TSM - Large Scale Text Classification

Cristian Raț - Software Developer

In the past years, artificial intelligence has become the answer to many problems, such as detecting fraud and spam messages, classifying images, determining the topic of an article etc. With the rise in the number of internet users, the quantity of data that needs to be processed has also increased. Therefore, storing and processing data on one server has become too difficult, the best solution available being processing it within a distributed system.

Classification is a decision-making process, based on previous examples of correct decisions, and is therefore a supervised learning technique because, in order to make its own decisions, it needs a pre-classified set of data. The ’training’ process will result in discovering which property of the data indicates its belonging to a certain class, and the relationship between the property and its class will be saved within a model that will be used for sorting new data.

The process for creating an automated classification system is the same, no matter the classification algorithm being used and it comprises more stages: data processing, training, testing and adjusting, validation (final testing), deployment in production.

What the classification of a text actually means is the association between a text and and a pre-defined text category. Each word from the document is seen as an attribute of that document; therefore, a vector of attributes is being created for each document, using the words from the text. In order to improve the classification process, we can either eliminate certain words, especially the words that are very frequent in a vocabulary (e.g: and, for, a, an), or group words in 2-3 word clusters called n-grams.

More algorithms that deal with this problem have been developed: Naive Bayes, decision trees, neural networks, logistic regressions etc.

One of the most used text classification algorithms is Naive Bayes. Nayve Bayes is a probabilistic classification algorithm, its decisions being based on probabilities that derive from the pre-classified set of data.

The training process analyzes the relationships between the words that appear in the documents and the categories that the documents are associated with. Bayes’ theorem is then used to determine the probability that a series of words might belong to a certain category.

Bayes’ theory states that the probability for an event A to appear given that another event B already appeared is equal to the probability that the event B will appear given that the event A appeared times the probability that event A will appear divided by the probability that event B will appear.

Therefore, in the case of our problem, the probability that a certain document belongs to a C category given that it contains the word vector X=(x1, x2, …, xn) is the following:

This equation is extremely difficult to calculate and it requires a tremendous computational power. In order to simplify the problem, the Nayve Bayes algorithm assumes that the attributes from the vector X are independent (this is why the algorithm is called ”nayve”), thus making the formula look more like this:

This would be processed much faster but, if we have a big number of classes and a big amount of data, the training process will take too long for the algorithm to have any practical use. The solution is distributed processing.

Hadoop is a program that allows for processing data in a distributed manner. It comes with a distributed file system for storing data and it can scale up to several thousand computers, which should be enough to solve any classification problem.

Mahout is a library that contains scalable algorithms for clustering, classification or collaborative filtering. A big number of these algorithms implement the map-reduce paradigm and run in a distributed manner using Hadoop. The processing speed of the algorithms from this library is small compared to other options when the size of the data is small, so it doesn’t make sense to use Mahout; but Mahout can scale massively and the size of the data is not a problem in this case; this is why the recommendation is that we use it only when the amount of data is very big (more than 1 million documents used for training).

To facilitate algorithm usage, Mahout has a tool that can be executed from the command line, helping to simplify the creation of a scalable process of automated text classification.

The first step is to transform the text into attributes vectors, but for this to be possible we need to have access to the documents in a SequenceFile format. A SequenceFile is a a file that contains key-value pairs and is used with the programs that implement the map-reduce paradigm.

To transform our files into SequenceFile we use the command seqdirectory:

mahout seqdirectory -i initial_data -o sequencefile 

Transforming the text into vectors of attributes is done by using the seq2sparse command, which creates a SequenceFile file of the type:

mahout seq2sparse -i sequencefile -o vectors -wt tfidf

Dividing the data into test data and training data is as easy as the previous steps. The division is random; therefore we can select the percentage represented by test data from the total amount of data using the parameter ‚--randomSelectionPct’:

mahout split -i vectors --trainingOutput train-vectors --testOutput test-vectors --randomSelectionPct 40 -ow –sequenceFiles

The training and test processes are carried out using the following commands:

mahout trainnb -i train-vectors -el -o model -li labelindex -ow 
mahout testnb -i test-vectors -m model -l labelindex -ow -o rezultate_test

The results of the tests can at this point be evaluated and, if it is necessary, in order to obtain a better result, we can modify the initial data and we can repeat the process until the accuracy of the classification is satisfactory.

The classification of the text can have multiple applications and at this point the size of the data no longer represents a limitation. Hadoop and Mahout bring the necessary instruments for creating a classification process that can work with millions of documents, process which would have been extremely hard to do otherwise.