Sentiment analysis in text mining is the process of categorizing opinions expressed in a piece of text. A basic form of such analysis would be to predict whether the opinion about something is positive or negative (polarity). There can be other forms of sentiment analysis or opinion mining like predicting rating scale on product’s review, predicting polarity on aspects of a product, detecting subjectivity and objectivity in sentences etc.

Our objective : To do sentiment polarity analysis on movie reviews. In other words, to classify opinions expressed in a text review (document) in order to determine whether the reviewer’s sentiment towards the movie is positive or negative.

The corpus being used here is polarity dataset v2.0. This corpus contains 2000 labelled files of movie reviews with 1000 files for each of the two sentiments. The sentences in the files are processed and downcased. So, we do not need to do any preprocessing here and can directly get started with building the application. As any other classification problem, we have to train a classifier based on some feature of the sentiment classes. So, basically there are two sub-tasks:

1. Feature extraction process
2. Training the classifier

In this blog-post, we will be focusing mainly on the most popular and widely used word weighing scheme in text mining problems, known as term frequency and inverse document frequency (tf-idf) . Further, we will be training a Support Vector Machine(SVM) classifier and Multinomial Naive Bayes classifier on tf-idf weighted word frequency features. Finally, we will analyse the effect of using this scheme while checking the performance of the trained model on test movie reviews files.

Tf-Idf weighted Word Count Feature Extraction

Conventionally, histogram of words are the features for the text classification problems. In general, we first build the vocabulary of the corpus and then we generate word count vector from each file which is nothing but frequency of words present in the vocabulary. Most of them will be zero as a single file won’t contain all the words in the vocabulary. For example, suppose we have 500 words in vocabulary. So, each word count vector will contains the frequency of 500 vocab words in the text file. Suppose text in a file was “Get the work done, work done”. So, a fixed length encoding will be generated as [0,0,0,0,0,…….0,0,2,0,0,0,……,0,0,1,0,0,…0,0,1,0,0,……2,0,0,0,0,0]. Here, all the word counts are placed at 296th, 359th, 415th, 495th index of 500 length word count vector and the rest are zero. The blog-post presents a document classification application using mentioned conventional approach. But there are limitations in this conventional approach of extracting features as listed below:

a) Frequently occurring words present in all files of corpus irrespective of the sentiment, like in this case, ‘movie’, ‘acting’, etc will be treated equally like other distinguishing words in the document.
b) Stop words will be present in the vocab if not processed properly.
c) Rare words or key words which can be distinguishing will not get special weight.

Here comes our tf-idf weighting factor which eliminates these limitations. The first question that comes to your mind  is “what does tf-idf do to these conventional features ?”.

Term frequency

It increases the weight of the terms (words) that occur more frequently in the document. Quite intuitive, right ?? So it can be defined as  tf(t,d) = F(t,d) where F(t,d) is number of occurrences of term ‘t’ in document ‘d’. But practically, it seems unlikely that thirty occurrences of a term in a document truly carry thirty times the significance of a single occurrence. So, in order to make it more pragmatic, tf is logarithmically scaled so that as the frequency of terms increases exponentially, we will be increasing the weights of terms in additive manner.

tf(t,d) = log(F(t,d))

Inverse document frequency

It diminishes the weight of the terms that occur in all the documents of corpus and similarly increases the weight of the terms that occur in rare documents across the corpus. Basically, the rare keywords get special treatment and stop words/non-distinguishing words get punished. It is defined as:

idf(t,D) = log(N/Nt ∈ d)

Here, ‘N’ is the total number of files in the corpus ‘D’ and ‘Nt ∈ d‘ is number of files in which term ‘t’ is present. By now, we can agree to the fact that tf is a intra-document factor which depends on individual document and idf is a per corpus factor which is constant for a corpus. Finally, tf–idf is calculated as:

tf-idf(t,d,D) = tf(t,d) . idf(t,D)

Enough with the theory part, let’s get hands on and write python code for extracting such features using scikit-learn machine learning library. It is an open source python ML library which comes bundled in 3rd party distribution anaconda or can be used by separate installation following this.

Sklearn.feature_extraction.text.TfidfVectorizer: Python implementation

At first sight, above heading may seem strange but it is a library class implemented in sklearn library. We can extract tf-idf weighted features with the help of its functions. Lets recall the description of polarity movie review data-set used here. We will divide the corpus in 90:10 split so that 1800 review files will be utilized as training set and rest 200 review files as test set. The below code snippet shows how to extract features from the text files.

vectorizer = TfidfVectorizer(min_df=5, max_df = 0.8, sublinear_tf=True, use_idf =True, stop_words = 'english')
train_corpus_tf_idf = vectorizer.fit_transform(X_train)
test_corpus_tf_idf = vectorizer.transform(X_test)

Let us understand this. The object of TfidfVectorizer class can be initialized with the following parameters –

  • min_df – remove the words from the vocabulary which have occurred in less than ‘min_df’ number of files.
  • max_df – remove the words from the vocabulary which have occurred in more than ‘max_df’ * total number of files in corpus.
  • sublinear_tf – scale the term frequency in logarithmic scale.(talked about this earlier).
  • stop_words – remove the predefined stop words of that language if present.
  • use_idf – weight factor must use inverse document frequency(obviously).
  • token_pattern – It is a regular expression for the kind of words chosen in vocabulary. default: u'(?u)\b\w\w+\b’ which means words only with 2 or more alphanumeric characters. If you want to keep only words with 2 or more alphabets(no numeric) then set token_pattern as ur'(?u)\b[^\W\d][^\W\d]+\b’  
  • max_features – choose maximum number of words to be kept in vocabulary ordered by term frequency.
  • vocabulary – If you have created your own vocabulary, give it as a list here otherwise it will generate vocabulary from the training data.

Upon initialization, fit_transform() function is called by vectorizer object with parameter X_train.X_train is a list(iterable) of strings where each string represents the content of the document. It is obvious that the length of this list is the number of training documents. Here, the fit_transform(X_train) does the following things.

1. Tokenizes each of the iterable string in words, preprocesses it for removing special characters, stop words etc. Also removes words that do not agree to the token_pattern regex.
2. Creates a vocabulary of words with count in training set. Takes max_features, min_df and max_df in consideration.
3. Finally, for each single string(document), it creates the tf-idf word count vector. The word count vector is a vector of all words in vocabulary with its frequency weighted by term frequency and inverse document frequency.

It returns a feature vectors matrix having a fixed length tf-idf weighted word count feature for each document in training set. This is also called term-document matrix. With this we are ready to train our SVM and MultinomialNB classifiers which take two parameters, namely term-document matrix and polarity labels of the 1800 training files(900 positive and 900 negative). This completes our training process. Similarly, vectorizer.transform(X_test) will generate a term document matrix for the 200 test files using the same vocabulary generated while training. We can now check the performance of our trained models on the term document matrix of test set. Below is the full code of sentiment analysis on movie review polarity data-set using tf-idf features.

import os
import numpy as np
from sklearn.naive_bayes import MultinomialNB
from sklearn.metrics import confusion_matrix
from sklearn.svm import LinearSVC
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.model_selection import StratifiedKFold

def make_Corpus(root_dir):
    polarity_dirs = [os.path.join(root_dir,f) for f in os.listdir(root_dir)]    
    corpus = []    
    for polarity_dir in polarity_dirs:
        reviews = [os.path.join(polarity_dir,f) for f in os.listdir(polarity_dir)]
        for review in reviews:
            doc_string = "";
            with open(review) as rev:
                for line in rev:
                    doc_string = doc_string + line
            if not corpus:
                corpus = [doc_string]
    return corpus

#Create a corpus with each document having one string
root_dir = 'txt_sentoken'
corpus = make_Corpus(root_dir)

#Stratified 10-cross fold validation with SVM and Multinomial NB 
labels = np.zeros(2000);
kf = StratifiedKFold(n_splits=10)

totalsvm = 0           # Accuracy measure on 2000 files
totalNB = 0
totalMatSvm = np.zeros((2,2));  # Confusion matrix on 2000 files
totalMatNB = np.zeros((2,2));

for train_index, test_index in kf.split(corpus,labels):
    X_train = [corpus[i] for i in train_index]
    X_test = [corpus[i] for i in test_index]
    y_train, y_test = labels[train_index], labels[test_index]
    vectorizer = TfidfVectorizer(min_df=5, max_df = 0.8, sublinear_tf=True, use_idf=True,stop_words='english')
    train_corpus_tf_idf = vectorizer.fit_transform(X_train) 
    test_corpus_tf_idf = vectorizer.transform(X_test)
    model1 = LinearSVC()
    model2 = MultinomialNB(),y_train),y_train)
    result1 = model1.predict(test_corpus_tf_idf)
    result2 = model2.predict(test_corpus_tf_idf)
    totalMatSvm = totalMatSvm + confusion_matrix(y_test, result1)
    totalMatNB = totalMatNB + confusion_matrix(y_test, result2)
    totalsvm = totalsvm+sum(y_test==result1)
    totalNB = totalNB+sum(y_test==result2)
print totalMatSvm, totalsvm/2000.0, totalMatNB, totalNB/2000.0    

There are two things which may seem unexpected to you. Firstly, def make_Corpus() is reading each file of data-set to convert multiple lines of texts in a document into one string per document. Secondly, kf = StratifiedKFold(n_splits=10) initializes the K-cross fold validation technique so that the data-set is partitioned in 10 parts. From these 10 parts, 1 part is used for testing the model while the other 9 parts for training. The validation process is repeated K times so that each of the partition is validated at least once without getting included in training set.

Checking Performance

10-cross fold validation allows us to test all the files of corpus. There are 1000 text reviews with each of the two sentiments (positive and negative). The results below shows the number of text files whose sentiment were correctly identified when predicted by classifiers. One can see the comparison between both the classifiers (MultinomialNB and SVM). Also, we can compare the results when the same was implemented using conventional word count features (Github link).

Features/Models Multinomial NB SVM
Conventional Word count 1646 (82.3%) 1636 (81.8%)
Tf-Idf weighted factor 1665 (83.25%) 1748 (87.4%)

SVM outperforms Multinomial NB when using tf-idf weighted features, Also we can see the improvement of 5.5% in true identification rate when weighting factors were applied on word count features. Confusion matrix for both Multinomial NB and SVM using tf-idf features are shown below:

Multinomial NB Negative Positive
Negative 856 144
Positive 191 809
SVM(Linear) Negative Positive
Negative 874 126
Positive 126 874

Concluding Remarks

Hope I have made justice to Tf-Idf features in this blog. I have tried to explain the usefulness of these features with sentiment analysis application. Beginners are encouraged to implement it, match their outputs with the results shown here and try to analyse the difference between conventional word count features and tf-idf weighted features. One can read my previous post to know how to implement conventional features for classification problem. Also, to get more insight about TfidfVectorizer class of sklearn, we must play with its various parameters like token_pattern, vocabulary, stop_words, max_features, etc.

The machine learning models (Multinomial NB and SVM) have been implemented here without giving background as it may overwhelm the readers with so much of information in a single blog-post. One may apply other variants of these classifiers in order to make comparison and analyse the underlying differences among them. Here, the purpose was to present an understanding of term frequency and inverse document frequency and its importance in text mining applications.

The full python implementation of sentiment analysis on polarity movie review data-set using both type of features can be found on Github link here.

If you liked the post, follow this blog to get updates about the upcoming articles. Also, share this article so that it can reach out to the readers who can actually gain from this. Please feel free to discuss anything regarding the post. I would love to hear feedback from you.

Happy machine learning 🙂