Sentiment Analysis: SVM vs. Naive Bayes
This recipe will compare two machine learning approaches to see which is more likely to give an accurate analysis of sentiment. Both approaches analyse a corpora of positive and negative Movie Review data by training and thereafter testing to get an accuracy score. The techniques are Support Vector Machines (SVM) and Naive Bayes. This recipe is based on Jinman Zhang's iPython Notebook found on TAPoR Github repository.
- Extract the training dataset
- The downloaded polarity dataset v1.1 contains two distinct sets of data of positive and negative reviews. In preparation for training, we create two lists. One to store the actual review texts and another to simply label each movie review document 'positive' or 'negative' depending on the set.
- Vectorize
- We convert these raw text review documents into tf-idf (Term Frequency - Inverse Document Frequency) matrix of features. The vectorizer process has parameters for deciding the threshold at which to ignore frequent document and vocubulary terms.
- Classification and testing using SVM
- The first approach will be using SVM. We classify the training data with the polarity labels (positive and negative labels). This generates a model that we use for the next step, testing.
- Extract the testing dataset
- We now import polarity dataset v0.9 in the same manner we imported the training dataset in step one above. We perform a prediction of the vectorized test dataset based on the generated model from the training dataset. To obtain an accuracy score, the output of the prediction is tested against the polarity labels of the test dataset. A float number between zero and one is generated with higher accuracy values closer to one, lower accuracy values towards zero.
- Classification and testing using Naive Bayes
- A similar process to SVM is involved albeit the classifier here is multinomial, which is better suited for discrete features and works with tf-idf matrices we created in step two. Predicting and calculating the accuracy score is obtained as in the previous step.
Both techniques serve as good primers for conducting sentiment analysis using machine learning techniques. The scikit-learn library simplifies the process into five major steps: training, vectorization, classification, prediction and testing.
This recipe is useful in deciding which approach will give you the most reliable results when analysing your corpora. The accuracy scores of the two techniques, SVM and Naive Bayes, also raises the important question. Why would the former perform better than the latter in this case? This could be due to many factors; the nature of the datasets used, the quality of the training dataset, the different technical approaches used internally e.t.c.