Text Classification by Topic

Introduction 

This recipe will show you how to classify text into general topics. We will use supervised machine learning model called Support Vector Machines (SVM) and the 20 Newsgroups subject matter data set as the topic classifier. This recipe is based on Jinman Zhang's Cookbook.

Ingredients 
Steps 
  • Import the 20 Newgroups Dataset
    • This dataset is automatically downloaded the first time you import the sklearn kit ingredient and then we create a shortlist of subject-term categories you want to classify with. The dataset will be dowloaded once and comprises both training and testing dataset.
  • Load the dataset
    • We load the training dataset into a variable and turn the dataset text into vectors of numerical values for statistical analysis.
  • Testing the model
    • We provide a couple of sentences to test if the training dataset can categorize them into the correct topics as per our expectation.
    • The output of the classification is a prediction of where each sentence fits in our shortlist of subject-terms.
  • Verifying the tests
    • To test the accuracy of the above results, we provide the same shortlist of subject-terms but this time we use the testing dataset as opposed to the training dataset. We then get the metric value between 1 and 0 using the F1-Score.
Discussion 

There are several models for classification under SVM supervised learning methods. Each with its pros and cons. We use the linear approach in this recipe because it scales better with larger data samples. Other classification approaches under SVM are SVC and NuSVC.

Machine learning methods such as Naive Bayes Classifier also exist but are less accurate compared to SVM.

This is a useful introductory recipe into machine learning techniques for text analysis. Working with your dataset, you can train an initial dataset and thereafter classify texts into predefined categories depending on your project needs. You could also use this technique in corpora for large research goals to cluster or categorize your corpora into topics and themes.

Next steps / further information 

TAPor Entry

Status