To set the context lets do things in HANA TA without the CGUL rules first.
1. Lets create a small table with texts
So lets create a Table which looks something like this:
Now lets create two texts in it:
insert into "S_JTRND"."TA_TEST" values(1,'EN','TO BE','');
insert into "S_JTRND"."TA_TEST" values(2,'EN','NOT TO BE','');
So now the table entries look like:
2. Text Analysis Via Dictionary
Now lets say we want to do text Analysis where we Say
- if the text is "TO BE" it is to be treated as POSITIVE_CONTEXT
- if the text is "NOT TO BE" it is to be treated as NEGATIVE_CONTEXT
Lets create a dictionary to have these two values:
So in XSJS Project we create a english-Contextdict.hdbtextdict and content will be as follows(also attached):
<dictionary xmlns="http://www.sap.com/ta/4.0">
<entity_category name="POSITIVE_CONTEXT">
<entity_name standard_form="TO BE">
<variant name="TO BE" />
</entity_name>
</entity_category>
<entity_category name="NEGATIVE_CONTEXT">
<entity_name standard_form="NOT TO BE">
<variant name="NOT TO BE" />
</entity_name>
</entity_category>
</dictionary>
Now we use the dictionary above to create a configuration file(also attached):
So, pick content from any .hdbtextconfig and add the path to the above dictionary in it:
<configuration name="SAP.TextAnalysis.DocumentAnalysis.Extraction.ExtractionAnalyzer.TF" based-on="CommonSettings">
<property name="Dictionaries" type="string-list">
<string-list-value>JTRND.TABlog.dictonary::english-Contextdict.hdbtextdict</string-list-value>
</property>
</configuration>
3. Create Full text index on the Table using this configuration
CREATE FULLTEXT INDEX "IDX_CONTEXT" ON "S_JTRND"."TA_TEST" ("TEXT")
LANGUAGE COLUMN "LANG"
CONFIGURATION 'JTRND.TABlog.cfg::JT_TEST_CFG' ASYNC
LANGUAGE DETECTION ('en','de')
PHRASE INDEX RATIO 0.000000
FUZZY SEARCH INDEX OFF
SEARCH ONLY OFF
FAST PREPROCESS OFF
TEXT MINING OFF
TEXT ANALYSIS ON;
Check the TA results:
Note* for NOT TO BE, we did not get both POSTIVE(for substring TO BE) AND NEGATIVE, altough this is good, its a fluke, as TA did take the longest string matching and hence for NOT TO BE, and its sub String TO BE we got a Negative, but this could create problems.
Now moving on, lets add more to this context, lets add text NOT-TO BE as also a possibility of NEGATIVE_CONTEXT, infact NOT, followed by, TO BE,in same sentence is to be a NEGATIVE_CONTEXT.
Without changing anything lets insert some more values and see how they look:
insert into "S_JTRND"."TA_TEST" values(3,'EN','NOT-TO BE','');
insert into "S_JTRND"."TA_TEST" values(4,'EN','NOT, TO BE','');
insert into "S_JTRND"."TA_TEST" values(5,'EN','NOT, Negates TO BE','');
Check the TA results:
So you see we now have a problem, Also we could have NOT, -, NEG etc as possible predecessors before TO BE to point that its a NEGATIVE_CONTEXT
Solution 1: Lets have synonyms of NOT as one category, TO BE as "CONTEXT" category, and in post processing of TA lets see if we have TA_TYPE value of CONTEXT and NEGATIVE in same sentence then its a NEGATIVE CONTEXT,
But wouldnt it be great if index could do this on its own?
CGUL Rules save the day:
So here we go:
4. CREATE A .rul file
CONTEXT.rul(also attached) containing following rule:
#group NEGATIVE_CONTEXT (scope="sentence") : { <NOT> <>*? <TO> <>*? <BE> }
We need to compile this rule to get a .fsm file and put it on server under ...lexicon/lang (oos for this blog, I have attached the complied file here)
Now enhance you configuration file with reference to this fsm file.
<configuration name="SAP.TextAnalysis.DocumentAnalysis.Extraction.ExtractionAnalyzer.TF" based-on="CommonSettings">
<property name="Dictionaries" type="string-list">
<string-list-value>JTRND.TABlog.dictonary::english-Contextdict.hdbtextdict</string-list-value>
</property>
</property>
<property name="ExtractionRules" type="string-list">
<string-list-value>CONTEXT.fsm</string-list-value>
</property>
</configuration>
5. Restart the indexserver process so that the newly compiled rule file is picked up by the system.
6. Recreate the index using the same statement as above and check the TA table:
So, as you see the highlighted values come from the rule and mark extracted NEGATIVE CONTEXT, below I kept the dictionary value which wrongly identified the POSITIVE_CONTEXT for comparison, this should ideally not be handled by dictionaries.
So, in this context: To Be or Not To Be: HANA Text Analysis CGUL rules indeed has the answer!!
Hope this helps,
Bricks and Bats are Welcome