Recipe
This recipe will compare two machine learning approaches to see which is more likely to give an accurate analysis of sentiment. Both approaches analyse a corpora of positive and negative Movie Review data by training and thereafter testing to get an accuracy score. The techniques are Support Vector Machines (SVM) and Naive Bayes.
Multiple Correspondence Analysis (MCA) is a data analysis technique that can detect and represent the underlying structures of a dataset. In terms of textual analysis, we can identify and graph simultaneously occurring variables from the texts that comprise a corpus.
You can retrieve the MCA module at https://pypi.python.org/pypi/mca/1.0.3, or by typing the following line in the Python terminal:
pip install --user mca
In this recipe, we measure a corpora to determine authorship of the featured texts and visualize them by authorship. We will use Multidimensional scaling(MDS) as one of the techniques for analyzing similar / dissimilar data. This recipe if based on Jinman Zhang's Cookbook on Github.
This recipe will use regular expressions to clean up a webpage. This is useful if you want to carry out any meaningful textual analysis of the content in a web page. We can remove the html tags and other unnecessary textual elements with this method.
This recipe will show you how to classify text into general topics. We will use supervised machine learning model called Support Vector Machines (SVM) and the 20 Newsgroups subject matter data set as the topic classifier. This recipe is based on Jinman Zhang's Cookbook.
This recipe with show you how to prepare Voronoi diagrams, one way of showing relationships between words in a text and a search term. In order to do this, we will employ the use of word embeddings. These represent individual words in a text as real-valued vectors in a confined vector space. This recipe is based on Kynan Ly's cookbook as seen on this notebook.
Similar to finding People and Characters, finding locations in text is a common exploratory technique. This recipe shows how to extract places, countries, cities from a text. We will use Named-Entity Recognition (NER) module of NLKT library to achieve this. This recipe is based on Jinman Zhang's cookbook.
In this recipe we use 3 ebooks to show how topic analysis can identify the different topics each text represents. We will use Latent Dirichlet Allocation (LDA) approach which is the most common modelling method to discover topics. We can then spice it up with an interactive visualization of the discovered themes. This recipe is based on Zhang Jinman's notebook found on TAPoR.
NB: Any number of texts can be used, we choose 3 for this recipe.
Word frequencies and counts are text analysis methods that return results about the words in a text or set of texts. Counts return the amount of times a word is used in the text, whereas frequencies give a sense of how often a word is used in comparison to others in the text.
Tokenization is the process of splitting a sentence or a chunk of text into its constituent parts. These “tokens” may be the letters, punctuation, words, or sentences. They could even be a combination of all these elements. This recipe was adapted from a Python Notebook written by Kynan Lee.