Sentiment Analysis and the complexity of online opinions

Cosmin Gabriel Popa
SA R&D Osprov Team
@Hewlett Packard

PROGRAMMING

If you'd live in a world where your opinions matter, would you try to make a change? Without any doubt opinions are a constant presence in our day to day lives: suggestions, reviews, these are only a few of so many different ways to influence future decisions. In this era, when the multitude of options are choking you, opinions are the ones that can help you make a quick decision under the spotlight. But lets assume that we can evaluate everything by pros and cons, would our decisions be correct? Is it actually that simple?

Sentiment Analysis or Opinion Mining is a branch of Natural Language Processing (NLP) that handles the study of opinions, sentiments, evaluations, attitudes, emotions and all their characteristics, focused on entities like products, organizations, individuals, events, etc. There wasn't a special interest in this field before the year 2000, but, since the widespread of comercial applications (online and offline), the perspective on analyzing opinions has changed radically. It is for the first time in history when there is a consistent and specific database obtained by social media. The perspective shift is not that surprising taking into account its applications in so many different fields from politics, sociology to economics and culture and the popularity of social networks.

The problem

Lets get back to the initial question. We'll take movies for example and as series of their reviews. We'll split the opinions into different categories, as we established initially: positive (pros) and negatives (cons). The following questions emerge: If there is a negative review on the lead actor and a positive one on the director, would you watch that movie? Would you need another opinion on that movie? How can you determine if the next opinion is a positive one or not? And how would that influence your decision?

There are several analysis methods depending on the level of precision or interest required: Document Analysis - determining if an entire document expresses a positive or negative opinion, Sentence Analysis - determining if a sentence is positive, negative or neutral, Entity and Aspect or Feature Analysis - determining if that is an opinion, on what is about and its polarity.

Solution

There are two methods for classifying opinions: training neural networks (Supervised Learning) and the one that doesn't imply that kind of procedure (Unsupervised Learning). By simplifying the analysis to Sentence Analysis and presuming that we have at our own disposal positive and negative documents, we'll try a solution by Supervised Learning with the help of NLTK (Natural Language Toolkit)

{python code}
from nltk.corpus import movie_reviews
positive_ids = movie_reviews.fileids('pos')
negative_ids = movie_reviews.fileids('neg')
{/python code}

Noticeable is the fact that the "movie_reviews" corpus contains movie reviews that are segregated into two categories as we established initially. From this data we'll extract words as future references for training data.

{python code}
positive_data = [movie_reviews.words(fileids=[f]) for f in positive_ids]
negative_data = [movie_reviews.words(fileids=[f]) for f in negative_ids]
{/python code}

We are on the verge of deciding how these references will "look" like when we will train our neural network. Keeping this simple, we'll choose as a feature the most frequent words in that movie review corpus. Now we'll implement a function that determines the frequency of every work in the two categories - separately and as a whole.

{python code}
import itertools
from nltk import FreqDist, ConditionalFreqDist
def buildFreqDistribution(positiveWords, negativeWords):
  word_fd = FreqDist()
  cond_word_fd = ConditionalFreqDist()
  for word in list(itertools.chain(*positiveWords)):
   word_fd[word.lower()] += 1
   cond_word_fd['positive'][word.lower()] += 1
   for word in list(itertools.chain(*negativeWords)):
     word_fd[word.lower()] += 1
  cond_word_fd['negative'][word.lower()] += 1
  return (word_fd, cond_word_fd)
{/python code}

Based on the frequencies calculated we will construct a dictionary that will keep the score of every word. The function "BigramAssocMeasures", shortly, calculates the relevence of a word in the context in which that word is found, returning a representative value.

{python code}
from nltk import BigramAssocMeasures
def buildWordsScores(word_fd, cond_word_fd, 
 total_word_count):

 word_scores = {}
  for word, freq in word_fd.items():
  positive_score = BigramAssocMeasures.
     chi_sq(cond_word_fd['positive'][word], 
     (freq, cond_word_fd['positive'].N()),
     total_word_count)

  negative_score = BigramAssocMeasures.
     chi_sq(cond_word_fd['negative'][word], 
     (freq, cond_word_fd['negative'].N()),
     total_word_count)

  word_scores[word] = positive_score + negative_score
  return word_scores
{/python code}

Functions that will filter the relevant information for training the neural network are needed.

{python code}
def getFeatures(label, data, best_words):
  features = []
  for feat in data:
  words = [selectBestWords(feat, best_words), label]
  features.append(words)
return features

def selectBestWords(words, best_words):
  return dict([(word, True) for word in words if word 
    in best_words])

def findBestWords(scores, number):
  best_vals = sorted(scores.items(), key=lambda w_s: 
    w_s[1], reverse=True)[:number]

  best_words = set([w for w, s in best_vals])
  return best_words
{/python code}

All that still remains is to add them together. With the extracted information will train the neural network represented by the NaiveBayesClassifier. Keeping it simple, we'll use the algorithm already implemented in the NLTK platform.

{python code}
(word_fd, cond_word_fd) = buildFreqDistribution(positive_data, negative_data)

total_word_count = cond_word_fd['positive'].N() + 
  cond_word_fd['negative'].N()

word_scores = buildWordsScores(word_fd, cond_word_fd, 
  total_word_count)

best_words = findBestWords(word_scores, 1000)
positive_features = getFeatures('positive', 
  positive_data, best_words)

negative_features = getFeatures('negative', 
  negative_data, best_words)

classifier = NaiveBayesClassifier.train(
  positive_features + negative_features)

classifier.show_most_informative_features(10)
{/python code}

Looks about now that we have a mechanism that can determine with a certain precision if a movie review is positive or negative. Let's take as example the following movie review on a Kevin Costner film : "Once again Mr. Costner has dragged out a movie for far longer than necessary. Aside from the terrific sea rescue sequences, of which there are very few I just did not care about any of the characters [...]" - at first it seems that this one is negative.

{python code}
features = selectBestWords(words_in_review, 
  best_words)

print(classifier.classify(features))
{/python code}

It seems to be positive with a score difference of about 1.4. If we add the rest of the review: "[...] Most of us have ghosts in the closet, and Costner's character are realized early on, and then forgotten until much later, by which time I did not care. The character we should really care about is a very cocky, overconfident Ashton Kutcher. The problem is he comes off as a kid who thinks he's better than anyone else around him and shows no signs of a cluttered closet." - now the result changes and it seems that the review is a negative one with a score difference of about 0.9.

Case study

Using the SVM (Support Vector Machines) algorithm in parallel with the NaiveBayesClassifier that we previously used and testing these two on a database of about 6 million words, we'll get the following results. With the growth of the number of entities extracted from the text and used in training the neural networks, we can observe the differences of accuracy and classification.

We can observe how the algorithm's accuracy remains constant and on a growth path until a horizon point where the value becomes steady around : 81%. Even if the number of features grows the accuracy will oscillate around that value. Though, the instability of the NaiveBayesClassifier algorithm is clear in the last chart. The SVM seems to be more reliable.

Conclusions

It seems that it is not THAT simple. We were fooled by our own algorithm. But actually there is no standardized way of determining the "feeling" of an opinion. Even with a blunt simplification of the problem the accuracy never reached 95%. There are, as already mentioned, different methods and different algorithms that can be used in Supervised and Unsupervised Learning, but the suggestion is to use combinations of these two categories. Of course there is an important value in how the training data is used, the implemented methods, the test corpus - these have to comply with the user's initial intention. Cumulated opinions can determine the value of a product, of an individual, of an idea or an event. Sometimes they are more powerful than the price tag. You are overwhelmed with opinions everyday on any device, app, webpage or on any street. Finally, does your own opinion matter? I say: Yes!