To Be or Not To Be: HANA Text Analysis CGUL rules has the answer

To set the context lets do things in HANA TA without the CGUL rules first.

1. Lets create a small table with texts

So lets create a Table which looks something like this:

Now lets create two texts in it:

insert into "S_JTRND"."TA_TEST" values(1,'EN','TO BE','');

insert into "S_JTRND"."TA_TEST" values(2,'EN','NOT TO BE','');

So now the table entries look like:

2. Text Analysis Via Dictionary

Now lets say we want to do text Analysis where we Say

if the text is "TO BE" it is to be treated as POSITIVE_CONTEXT
if the text is "NOT TO BE" it is to be treated as NEGATIVE_CONTEXT

Lets create a dictionary to have these two values:

So in XSJS Project we create a english-Contextdict.hdbtextdict and content will be as follows(also attached):

<entity_category name="POSITIVE_CONTEXT">

<entity_name standard_form="TO BE">

</entity_name>

</entity_category>

<entity_category name="NEGATIVE_CONTEXT">

<entity_name standard_form="NOT TO BE">

</entity_name>

</entity_category>

</dictionary>

Now we use the dictionary above to create a configuration file(also attached):

So, pick content from any .hdbtextconfig and add the path to the above dictionary in it:

<string-list-value>JTRND.TABlog.dictonary::english-Contextdict.hdbtextdict</string-list-value>

</property>

</configuration>

3. Create Full text index on the Table using this configuration

CREATE FULLTEXT INDEX "IDX_CONTEXT" ON "S_JTRND"."TA_TEST" ("TEXT")

LANGUAGE COLUMN "LANG"

CONFIGURATION 'JTRND.TABlog.cfg::JT_TEST_CFG' ASYNC

LANGUAGE DETECTION ('en','de')

PHRASE INDEX RATIO 0.000000

FUZZY SEARCH INDEX OFF

SEARCH ONLY OFF

FAST PREPROCESS OFF

TEXT MINING OFF

TEXT ANALYSIS ON;

Check the TA results:

Note* for NOT TO BE, we did not get both POSTIVE(for substring TO BE) AND NEGATIVE, altough this is good, its a fluke, as TA did take the longest string matching and hence for NOT TO BE, and its sub String TO BE we got a Negative, but this could create problems.

Now moving on, lets add more to this context, lets add text NOT-TO BE as also a possibility of NEGATIVE_CONTEXT, infact NOT, followed by, TO BE,in same sentence is to be a NEGATIVE_CONTEXT.

Without changing anything lets insert some more values and see how they look:

insert into "S_JTRND"."TA_TEST" values(3,'EN','NOT-TO BE','');

insert into "S_JTRND"."TA_TEST" values(4,'EN','NOT, TO BE','');

insert into "S_JTRND"."TA_TEST" values(5,'EN','NOT, Negates TO BE','');

Check the TA results:

So you see we now have a problem, Also we could have NOT, -, NEG etc as possible predecessors before TO BE to point that its a NEGATIVE_CONTEXT

Solution 1: Lets have synonyms of NOT as one category, TO BE as "CONTEXT" category, and in post processing of TA lets see if we have TA_TYPE value of CONTEXT and NEGATIVE in same sentence then its a NEGATIVE CONTEXT,

But wouldnt it be great if index could do this on its own?

CGUL Rules save the day:

So here we go:

4. CREATE A .rul file

CONTEXT.rul(also attached) containing following rule:

#group NEGATIVE_CONTEXT (scope="sentence") : { <NOT> <>*? <TO> <>*? <BE> }

We need to compile this rule to get a .fsm file and put it on server under ...lexicon/lang (oos for this blog, I have attached the complied file here)

Now enhance you configuration file with reference to this fsm file.

<string-list-value>JTRND.TABlog.dictonary::english-Contextdict.hdbtextdict</string-list-value>

</property>

<property name="ExtractionRules" type="string-list">

<string-list-value>CONTEXT.fsm</string-list-value>

</property>

</configuration>

5. Restart the indexserver process so that the newly compiled rule file is picked up by the system.

6. Recreate the index using the same statement as above and check the TA table:

So, as you see the highlighted values come from the rule and mark extracted NEGATIVE CONTEXT, below I kept the dictionary value which wrongly identified the POSITIVE_CONTEXT for comparison, this should ideally not be handled by dictionaries.

So, in this context: To Be or Not To Be: HANA Text Analysis CGUL rules indeed has the answer!!

Hope this helps,

Bricks and Bats are Welcome

To Be or Not To Be: HANA Text Analysis CGUL rules has the answer

Trending Articles

Practice Sheet of Right form of verbs for HSC Students

Download: FK ft Shenky – Nakuyewa ”Prod by: Shenky”

How to win at Markstrat (Markstrat Tips and Tricks) – Vodites

Ominde Commission Report and Recommendations – Ominde Report of 1964

Bureau of Internal Revenue: Regional Offices (Directory)

GO 53 on Enhancement of Ex-gratia upto 5 Lakhs Toddy Tappers in Telangana

Cakewalk CA-2A Leveling Amplifier v2.0.1.97 WiN, v2.0.1.96 OSX Incl Keygen

Mp3 Download: Mdu - Kunjenjenjena

How the kill the job , when DTP request running for long hours.

Microsoft Intune から展開しているアプリのアップデートについて

18-year-old girl was beaten for half an hour by two Northampton men in 'an...

Car crash in Dunton Bassett leaves driver in critical condition

Macky 2, Two Others In Road Accident

Application log 00000000000000089514: Could not convert queue DLVST90CLNT

Detroit mafia: D’Anna Brothers agree to plea deal

Delivery block field greyed out using VA02

Muloraki Au

【個人撮影】スマホのプライベート映像♪「中に出さないで///」カラオケ屋での生ハメ撮りが流出ｗ【リベンジポルノ】＠PornHub

BREAKING NEWS: Diamond Platnumz Is Reported Dead After Ghastly Car Accident

FIAT 500 B0111 B0112