This is a blog on some of the options that are available in text analysis. Shore descriptions with example codes.
Text Analysis is the process of analyzing unstructured text, extracting relevant information and then transforming that information into structured information that can be leveraged in different ways.
Full Text Indexing:
When dealing with a small number of documents, it is possible for the full-text-search engine to directly scan the contents of the documents with each query, a strategy called "serial scanning." This is what some rudimentary tools, such as grep, do when searching.However, when the number of documents to search is potentially large, the problem of full-text search is often divided into two tasks: indexing and searching.The indexing stage will scan the text of all the documents and build a list of search terms (often called an index). In the search stage, when performing a specific query, only the index is referenced, rather than the text of the original documents.The indexer will make an entry in the index for each term or word found in a document, and possibly note its relative position within the document.Conceptually, full-text indexes support searching on columns in the same way that indexes support searching through books.
CREATING A FULL TEXT INDEX:
CREATE FULLTEXT INDEX "nameofindex" On <SCHEMA_NAME>."<table>"("<column>")
TEXT ANALYSIS ON
CONFIGURATION {'EXTRACTION_CORE_VOICEOFCUSTOMER' || 'EXTRACTION_CORE' || 'LINGANALYSIS_BASIC' || 'LINGANALYSIS_STEMS' ||
'LINGANALYSIS_FULL'};
Configurations:
5 Predefined configurations. GROUPED INTO 3 categories.
LINGUISTIC ANALYSIS CONFIGURATIONS:
Linguistic analysis
- Segmentation–the separation of input text into its elements
- Stemming–the identification of word stems, or dictionary forms
- Tagging–the labeling of words' parts of speech (POS)
- LINGANALYSIS_BASIC : Segmentation (tokenization)
- LINGANALYSIS_STEMS : BASIC + Stemming (identifying words in dictionary based form -- like Work is the stem of working/worked/works )
- LINGANALYSIS_FULL : BASIC + STEM + TAGGING ( Labeling of words POS – Verb, noun,etc).
EXTRACTION CONFIGURATIONS: ( to extract Entities & FACTS )
- EXTRACTION_CORE : Only extracts Entity types (like name , organization , city, language, etc).
- EXTRACTION_CORE_VOICEOFCUSTOMER : extracts Entity types and relationship between them (sentiment analysis)
CUSTOM CONFIGURATIONS:
- Creating a custom configuration from existing configuration: (.hdbtextconfig)
Repositories->SAP->HANA->TA->Config. We can find all 5 config files(Ling analysis & extraction)
We can,
- Include / exclude Analyzers ,
- increase /decrease the sample text for analyzing language,
- Turn ON / OFF the POS, stemming,Tokenizing
- Enable custom dictionaries Should always be set to TRUE
Activate the text config file and use the file in the CONFIGURATION part of the index.
CREATE FULLTEXT INDEX "nameofindex" On <SCHEMA_NAME>."<table>"("<column>")
TEXT ANALYSIS ON
CONFIGURATION 'sap.hana.ta.config::<CUSTOMCONFIG>'
EXAMPLES:
ENTITY EXTRACTION (SENTIMENT ANALYSIS):
DROP TABLE LANGUAGE_DETECT;
Create column table language_detect (ID smallint Primary Key, content nvarchar(50), lang varchar(2));
insert into language_detect values(2,'JOHN LOVES TO PLAY FOOTBALL','EN');
DROP FULLTEXT INDEX LANG_INDEX;
CREATE FULLTEXT INDEX LANG_INDEX ON language_detect(CONTENT)
LANGUAGE COLUMN LANG
CONFIGURATION 'EXTRACTION_CORE_VOICEOFCUSTOMER'
TEXT ANALYSIS ON;
SELECT TOP 1000 * FROM "$TA_LANG_INDEX";
![]()
LINGUISTIC ANALYSIS :
DROP FULLTEXT INDEX LANG_INDEX;
CREATE FULLTEXT INDEX LANG_INDEX ON language_detect(CONTENT)
LANGUAGE COLUMN LANG
CONFIGURATION 'LINGANALYSIS_FULL'
TEXT ANALYSIS ON;
SELECT TOP 1000 * FROM "$TA_LANG_INDEX";
![]()
LANGUAGE DETECTION PARAMETER:
select * from SYS.M_TEXT_ANALYSIS_LANGUAGES -- TO FIND THE LIST OF SUPPORTED LANGUAGES.
![]()
Insert the following texts into a table for testing the detection:
German : Ich mag Musik
English: I love Music
Chinese (traditional) : 我愛音樂
insert into language_detect values(1,'我愛音樂');
insert into language_detect values(2,'I LOVE MUSIC');
insert into language_detect values(3,'Ich mag Musik');
Create Index without specifying the LANGUAGE DETECTION option:
CREATE FULLTEXT INDEX LANG_INDEX ON language_detect(CONTENT)
TEXT ANALYSIS ON
LANGUAGE DETECTION(‘ZH’,'EN')-- Analyse and display the results only for the specified two languages
CONFIGURATION 'EXTRACTION_CORE_VOICEOFCUSTOMER';
SELECT TOP 1000 * FROM "$TA_LANG_INDEX"
If no LANGUAGE DETECTION parameter is specified, then only the Words which are in ENGLISH (Default language) are considered.
LANGUAGE COLUMN:
Speed up the analysis by eliminating the detection of language by explicitly having the language code in a separate column.
DROPTABLE LANGUAGE_DETECT;
Createcolumntable language_detect (ID smallintPrimaryKey, content nvarchar(50), lang varchar(2));
insertinto language_detect values(2,'I LOVE MUSIC','EN');
insertinto language_detect values(3,'Ich mag Musik','DE');
DROP FULLTEXT INDEX LANG_INDEX;
CREATE FULLTEXT INDEX LANG_INDEX ON language_detect(CONTENT)
LANGUAGECOLUMN LANG --Specified the language column as LANG (To fetch the language code from this column)
CONFIGURATION 'EXTRACTION_CORE_VOICEOFCUSTOMER'
TEXT ANALYSIS ON;
SELECT TOP 1000 * FROM"$TA_LANG_INDEX";
MIME TYPES:
Used to specify MIME type (Format of the content) to improve the performance.
Supported mime types for text analysis:
SELECT * FROM SYS.M_TEXT_ANALYSIS_MIME_TYPES
![]()
DROPTABLE mime_types;
Createcolumntable mime_types (ID smallintPrimaryKey, content BLOB);
--NOTE: BLOB will take up more memory and hence no point in using BLOB to just load plain text files.
--Insert the data through any program (Java/Python,etc)
DROP FULLTEXT INDEX MIME_INDEX;
CREATE FULLTEXT INDEX MIME_INDEX ON mime_types(CONTENT)
CONFIGURATION 'EXTRACTION_CORE_VOICEOFCUSTOMER'
TEXT ANALYSIS ON
MIME TYPE'application/pdf'; --Specify the type of file to be loaded to the BLOB column
SELECT TOP 1000 * FROM"$TA_MIME_INDEX";
![]()
TOKEN SEPARATORS:
Specify only the separators that are needed to split the words in a set of text.
DROPTABLE TOKEN_SEPARATOR;
Createcolumntable TOKEN_SEPARATOR (ID smallintPrimaryKey, content nvarchar(200));
DROP FULLTEXT INDEX TOKEN_INDEX;
CREATE FULLTEXT INDEX TOKEN_INDEX ON
TEXT ANALYSIS ON
TOKEN SEPARATORS '/\:;"''[]_';
insert into TOKEN_SEPARATOR VALUES (1,'fellow-student writes semi-final'); -- Treats '-' as a part of the string and doesn't tokenize.
SELECT TOP 1000 * FROM"$TA_TOKEN_INDEX";
![]()
SYNCHRONIZATION:
We can set the time/document count for synchronizing the index table with the contents.
DROPTABLE SYNC;
Createcolumntable SYNC (ID smallintPrimaryKey, content nvarchar(200));
DROP FULLTEXT INDEX SYNC_INDEX;
CREATE FULLTEXT INDEX SYNC_INDEX ON SYNC(CONTENT)
TEXT ANALYSIS ON
ASYNC FLUSH EVERY 1 MINUTES; --SYNCHRONIZES ONCE EVERY MINUTE (Only once every minute, index table entries are refreshed)
DROP FULLTEXT INDEX SYNC_INDEX;
CREATE FULLTEXT INDEX SYNC_INDEX ON SYNC(CONTENT)
TEXT ANALYSIS ON
ASYNC FLUSH AFTER 5 DOCUMENTS; --SYNCHRONIZES ONCE AFTER EVERY 5 RECORDS
(Index table entries are refreshed once in every 5 records inserted into the content table)
ALTER FULLTEXT INDEX SYNC_INDEX SUSPEND QUEUE; -- SUSPEND THE ANALYSIS (Index table entries are not refreshed)
ALTER FULLTEXT INDEX SYNC_INDEX ACTIVATE QUEUE; -- ACTIVATE THE ANALYSIS (Index table entries are refreshed)
CUSTOM DICTIONARY:
Step1: Create the list of words in a file in project explorer(.hdbtextdict)
Step2: Add the file to any of the “.hdbtextconfig” file.
.hdbtextdict:
This file should be in XML format.
<?xml version=”1.0” encoding=”UTF-8”?>
<dictionary xmlns=http://sap.com/ta/4.0>
<entity_category name=”IPL CRICKET TEAM”>
<entity_name standard_form="Chennai Super Kings">
<variant name="CSK"type="ABBREV" />
<variant name="C.S.K"type="ABBREV" />
<variant name="super kings" />
<variant generation type ="standard"language="english" />
</entity_name>
</entity_category>
</dictionary>
.hdbtextconfig :
Create a custom configuration file (.hdbtextconfig) having any configuration as the base.
We will find the following at the bottom of any configuration file:
--create a .hdbtextconfig file ("iplteamconfig.hdbtextconfig" )by editing the following in the existing configuration files.
<property name=”Dictionaries” type=”string-list”>
<string-list-value> sap.hana.ta.config::iplteam.hdbtextdict</ string-list-value>
</property>
Use this in the Configuration while defining the INDEX creation:
--Use this in the Configuration while defining the INDEX creation:
DROP FULLTEXT INDEX DICT_INDEX;
CREATE FULLTEXT INDEX DICT_INDEX ON DICT(CONTENT)
CONFIGURATION 'sap.hana.ta.config::iplteam'; --The custom configuration file name
TEXT ANALYSIS ON;
REQUEST EXTRACTION:
General Requests : general req. by customers for enhancement , improvement, etc
Contact Requests : contacts given by customer like contact me at… call me at …
INSERTINTO prof_test VALUES (2,' CALL ME AT 8884484855 ');
INSERTINTO prof_test VALUES (10,' An additional key would be good on this keyboard');
INSERTINTO prof_test VALUES (3,'I want your customer care to contact me as soon as possible');
DROP FULLTEXT INDEX prof_index;
CREATE FULLTEXT INDEX prof_index ON prof_test(CONTENT)
CONFIGURATION 'EXTRACTION_CORE_VOICEOFCUSTOMER'
TEXT ANALYSIS ON;
SELECT * FROM"$TA_PROF_INDEX";
EMOTICONS EXTRACTION:
![]()
DROPTABLE prof_test;
Createcolumntable prof_test (ID smallintPrimaryKey, content nvarchar(300));
INSERTINTO prof_test VALUES (2,' I CLEARED MY CERTIFICATION :-D ');
INSERTINTO prof_test VALUES (10,' THE PARTY WAS GOOD :)');
INSERTINTO prof_test VALUES (3,'I HATE TRAFFIC :(((');
DROP FULLTEXT INDEX prof_index;
CREATE FULLTEXT INDEX prof_index ON prof_test(CONTENT)
CONFIGURATION 'EXTRACTION_CORE_VOICEOFCUSTOMER'
TEXT ANALYSIS ON;
SELECT * FROM"$TA_PROF_INDEX";
![]()