TSM - Text search and analysis in SAP HANA

Iulia Bicu - Software Developer @ Siemens România

This paper presents the options available in SAP HANA for text search and analysis. These options relate to the numerous ways in which the HANA platform can be used, especially in business applications, which have not been equipped with these types of functions before. The SAP HANA platform offers the users the possibility to search and visualize the content of the tables with the same simplicity as an Internet search does.

The text search function entails a certain tolerance for errors (fuzzy search) as well as the processing of multilingual terms and synonyms. One possible example is filling in the value "buenes eires tele" in a field, and finding the correct value despite the faulty spelling, by searching the value across several columns. Nevertheless, the users cannot always easily determine if the result is the expected one, the correct one, within the fuzzy search context which tolerates errors. Have you ever wondered why you get unexpected results when you search for something on the web?

Pattern recognition in and across texts and documents is a totally different analysis function.

For example, the relations, the intentions and even the feelings in a given text can be identified. Let us consider that you have a web store which allows your customers to shop online, as well as post comments about the products and their supplier. The so-called "Feeling analysis" is part of the engine which drives the text processing functionality in SAP HANA and enables you to identify patterns in these types of unstructured data. In the context of the online store, for instance, you can analyze if, for a given product, there are more positive comments than there are negative ones.

Quick search for text strings:

Search for similar words (error tolerance):

Search for linguistic variations of the same word:

Figure 1. Text search examples

Basic principles for text search and analysis in SAP HANA

Before activating the search function, take the following prerequisites into account:

the SAP HANA database must contain column-oriented/column-based tables.
a valid data type must be associated to each column. The data types in the columns determine the way in which you can query the data.

The main purpose of the search text in SAP HANA is to offer users an optimal use at the GUI level.

The following features are included:

Freestyle search. The users do not need to know the exact columns in the database where the query takes effect. For example, the address is searched via one input (insertion) field and all the technical features are filled in: street name, postal code, city, country and so on.
Fuzzy search. The users can easily vary the spelling in their search.
Linguistic search and synonym search.
Value suggestions and hints. The system can identify the possible search results efficiently, while the users introduce the text, making the hints visible for the user in real time.
Text analysis ("Feeling analysis" in particular). You can extract supplementary information from texts, semantic information for example.

Using text search via SQL

As is the case with the majority of functions in SAP HANA, you can access the text search function via SQL. To do this, you must use the SELECT command together with the key word CONTAINS, which enables you to search for the text. The standard syntax is the following:

The following example offers one of the ways in which you can use the CONTAINS clause.

SELECT \* FROM "MYTABLE" WHERE CONTAINS 
("FIELD1" ,  , EXACT)

Searching across several columns

The CONTAINS clause enables you to specify several columns that can be considered during the search. The following examples show that the search is performed in a list of columns or in all columns (freestyle).

There is another possibility of performing the search in a union of specified columns. For these examples we will use the Documents table, which includes three fields: ID, TITLE and ABSTRACT, and contains the following content:

Figure2. The Documents table and its content

In the next example, the word 'business' is searched for in the TITLE and ABSTRACT fields.

SELECT * FROM "DOCUMENTS" WHERE 
CONTAINS (("TITLE","ABSTRACT"), 'business');

In this example, 'business software' will be searched for in all the fields.

SELECT * FROM "DOCUMENTS" WHERE
CONTAINS (*, 'business software');

Text data types and full text index in SAP HANA

Fuzzy search in SAP HANA is based on the data types stored in columns. In this context, TEXT and SHORTTEXT represent two types of specific data, which are specialized in text search and analysis. The SHORTTEXT data type is used for character strings of a certain length (similar to NVARCHAR), because TEXT represents a larger object (similar to NCLOB- the SQL data type for a string in the ABAP Data Dictionary). In this context, the texts are parsed into tokens, which is the basis for search and analysis.

Unfortunately, at present, there is no support available in ABAP for the TEXT and SHORTTEXT data types. Therefore, it is not possible to create a table which uses these data types in the ABAP Data Dictionary. Although the fuzzy search function, allows for these data types (VARCHAR and NVARCHAR, for example), the function is not extended enough, and, therefore, fuzzy search is generally not recommended here. Without the possibility of parsing texts into tokens, the system is not capable of recognizing a permutation of words, which is a standard in modern applications.

Full Text Index

You can add the functionality offered by the text data types to a certain column by creating a full text index. In this way, you can activate the search and analysis functions for the majority of the data types in the ABAP Dictionary (including CHAR, STRING, and so on).

When you create a full text index for a column-oriented/column-based table, the system creates a shadow column of the TEXT type, which contains the same data, but which is integrated into an optimal search frame.

In this context, the text is parsed into tokens, and a dictionary is additionally generated. Figure 3. provides an example based on the name of an airline company. You must remember that this shadow column can be found only in the main memory.

Figure 3. Creating a full text index

You can create the full text index using the CREATE FULLTEXT INDEX command. The underlying syntax is the following:

CREATE FULLTEXT INDEX <index-name> 
ON <table> (<column>) <options>

The cases where a full text index is required are the following:

A. For an exact search, namely for finding the exact words with the correct spelling upon the search proper

In the example in Figure 6. , we use the Messages table which contains three fields ID, MESSAGE and LANG (see Figure 4.) and the full text index created (în Figura 5.) to search for the exact 'businesses' value.

Figure 4. The Messages table

Figure 5. The full text index

select * from MESSAGES WHERE CONTAINS 
("MESSAGE", 'businesses', EXACT);

Figure 6. Exact search using a full text index

B. For a linguistic search, namely for finding the words that have the same linguistic stem. In the next example, in Figure 7., we use the Messages table to look for the word 'go' .

select ID, MESSAGE, LANG from MESSAGES WHERE 
CONTAINS ("MESSAGE", 'go', LINGUISTIC);

Figure 7. Linguistic search using a full text index

Fuzzy Search

In SAP HANA, fuzzy search is a quick search which tolerates errors. A fuzzy search returns results even if the term to be searched contains additional or missing characters or other spelling errors. This section offers an outlook on the techniques used for fuzzy search in SAP HANA.

The degree to which the data must match the search query is generally determined by mathematical algorithms which form the basis of fuzzy search.

The result of the calculations is seldom a numerical value which is used to decide if the data are similar enough with the requested search.

As far as texts are concerned, the easiest type of algorithm deals with the minimum number of operations (like replacing or displacing characters) which are necessary to generate a portion of real data upon the request proper.

In practice, it is very hard to determine the degree of similarity among texts.

The text search function in SAP HANA results in a value between 0 and 1 which represents the degree of similarity. As developers, you must define a threshold (for example, 0,8) which is the value that results at the end of the search, namely the degree of similarity between the input value and the output value. Figure 8. contains an example of exact text search for the 'ENTERPRISE' word with a similarity degree of 0,8.

SELECT "TITLE", "ABSTRACT"
FROM DOCUMENTS
WHERE contains ("ABSTRACT",'ENTPRISE', FUZZY(0.8));

Figure 8. Search using a similarity degree

Special functions

There are special, scalar functions which enable you to get additional information for the results of an individual search. SAP HANA offers the following available functions: score(), highlighted(), and snippets(). The next paragraph provides a more detailed description of the score() function.

The score() function offers information about the degree of similarity between the search output and the search input. The value is situated within the 0 and 1 range, and higher values indicate a higher level of similarity.
Normally, the function is used to sort the results after the search takes place, so that the results with a higher degree of similarity are displayed in the upper part of the list, contrasting with the ones that have an inferior value.

SELECT SCORE (),"TITLE", "ABSTRACT"
FROM "Z003H05R"."DOCUMENTS"
WHERE contains ("ABSTRACT",'ENTPRISE', FUZZY(0.8))
ORDER BY SCORE() DESC;

Figure 9. Using the score() functions

In this paper, we discussed the basic architecture regarding the text search and analysis in SAP HANA. We mentioned the fields relating to the column-oriented/column-based tables, which contain certain data types (TEXT and SHORTTEXT), and which provide powerful functions for the text search and analysis of unstructured data. For the ABAP text types, you can use a full text index to create a virtual, TEXT-type column.