Multiple Correspondence Analysis (MCA) is a data analysis technique that can detect and represent the underlying structures of a dataset. In terms of textual analysis, we can identify and graph simultaneously occurring variables from the texts that comprise a corpus.
You can retrieve the MCA module at https://pypi.python.org/pypi/mca/1.0.3, or by typing the following line in the Python terminal:
pip install --user mca
This recipe with show you how to prepare Voronoi diagrams, one way of showing relationships between words in a text and a search term. In order to do this, we will employ the use of word embeddings. These represent individual words in a text as real-valued vectors in a confined vector space. This recipe is based on Kynan Ly's cookbook as seen on this notebook.
Similar to finding People and Characters, finding locations in text is a common exploratory technique. This recipe shows how to extract places, countries, cities from a text. We will use Named-Entity Recognition (NER) module of NLKT library to achieve this. This recipe is based on Jinman Zhang's cookbook.
Word frequencies and counts are text analysis methods that return results about the words in a text or set of texts. Counts return the amount of times a word is used in the text, whereas frequencies give a sense of how often a word is used in comparison to others in the text.
A common requirement in extracting information is the ability to identify all persons or characters referred to in the text. It is an elaborate process of knowing parts-of-speech in the text, tagging them and retrieving the names associated with those dialogues. The common approach is by using part-of-speech POS-tagger which analyses a sentence and associates words with their lexical descriptor i.e. whether it is an adverb, noun, adjective, conjuntion e.t.c. NLTK is a robust library and therefore the main ingredient of our recipe.
Compare word usage across two corpora to see if there is any difference between the two. Look at a word in the two texts, you use the U test to determine if the difference is significant. Helps determine the uniqueness of a term.
Term Frequency-Inverse Document Frequency or TF-IDF, is used to determine how important a word is within a single document of a collection. It will help determine the importance or weight of word to a document in a collection or corpus. It ranks the importance of word based on how often it appears in a text but the rank is offset by how often it occurs in the whole collection.