Comparison

This recipe is part of the Text Analysis for Twitter Research (TATR) series. The recipe will look at categorizing text using the General Inquirer Categories released by Harvard

Multiple Correspondence Analysis (MCA) is a data analysis technique that can detect and represent the underlying structures of a dataset. In terms of textual analysis, we can identify and graph simultaneously occurring variables from the texts that comprise a corpus.

You can retrieve the MCA module at https://pypi.python.org/pypi/mca/1.0.3, or by typing the following line in the Python terminal:

pip install --user mca

Multidimensional Scaling (MDS) is a method to convert sets of document terms into a data frame that can then be visualized. The distances expressed in the visualization show how similar, or dissimilar, the contents of one text are to another. This recipe deals with several advanced text analysis concepts and methods. Links are provided to additional information on these terms.

Panda Data frame: https://pandas.pydata.org/

TF-IDF: https://en.wikipedia.org/wiki/Tf%E2%80%93idf

Principal Component Analysis (PCA) is a method to convert sets of document terms into a data frame that can then be visualized. The distances expressed in the visualization show how similar, or dissimilar, the contents of one text are to another. PCA tries to identify a smaller number of uncorrelated variables, called "principal components" from the dataset. The goal is to explain the maximum amount of variance with the fewest number of principal components.  This recipe deals with several advanced text analysis concepts and methods.

Word frequencies and counts are text analysis methods that return results about the words in a text or set of texts. Counts return the amount of times a word is used in the text, whereas frequencies give a sense of how often a word is used in comparison to others in the text.

This recipe uses Python and the NLTK to explore repeating phrases (ngrams) in a text. An ngram is a repeating phrase, where the 'n' stands for 'number' and the 'gram' stands for the words; e.g. a 'trigram' would be a three word ngram.

This recipe helps you explore how to analyze a text based on its parts of speech (e.g. nouns, adjectives, prepositions, etc.).

This recipe uses an Aggregate Text tool, frequency lists, Concordance and Collocation tools to explore how a writer’s use of language changes over a lifetime.

This recipe takes two works purported to come from the same author and uses tools such as distribution, Word Lists, etc. to suggest whether they may have been created by the same author.

This is a recipe for comparing two different texts, and exploring where they are similar and divergent in terms of words used, word distribution, theme and so on.