Multiple Correspondence Analysis and Content Analysis

Introduction 

Multiple Correspondence Analysis (MCA) is a data analysis technique that can detect and represent the underlying structures of a dataset. In terms of textual analysis, we can identify and graph simultaneously occurring variables from the texts that comprise a corpus.

You can retrieve the MCA module at https://pypi.python.org/pypi/mca/1.0.3, or by typing the following line in the Python terminal:

pip install --user mca

Ingredients 
  • Python 3
  • Natural Language Toolkit (NLTK)
  • Panda
  • Numpy
  • mca (Multiple Correspondence Analysis)
  • Matplotlib (Plotting)
  • A collection of electronic texts
  • Kynan Ly’s sample code available on TAPoR
Steps 
  • Open the Python notebook and import NLTK, PANDAS, NUMPY, MCA, and MATPLOTLIB libraries.
  • Gather the electronic text content.
    • Set a path for the destination folder of the novels.
    • Save the titles and contents of the novels in separate variables.
    • Append the file contents.
  • Define a cleaning and tokenization function.
    • Iterate the corpus texts through the previously defined function.
  • Run the cleaned corpus through General Inquirer Categories.
    • Load in the General Inquirer dictionary.
    • Save and append the rows of data.
  • Write helper functions for the General Inquirer.
    • Speed up loading time by creating a function to set row attributes.
    • Combine attributes that are already present.
    • Insert these into a dictionary list.
    • Create the columns for the dataframe.
    • Loop through the tokens.
  • Create a dictionary of words and their categories.
  • Set up MCA.
    • Create a dataframe.
    • Insert tokenized data.
  • Get a relative frequency for each category.
  • Run MCA on the dataframe.
  • Plot the points in a visualization.
    • Set shapes for the categories and authors.
    • Label points for the categories and authors.
Discussion 

MCA coupled with MatPlotLib is a visually interesting way to represent correlations found in a corpus. This is one of many ways to conduct MCA and plot the resulting data. Using General Inquirer Categories in conjunction with this text analysis technique allows for an in depth examination of the language used by one author, or of a set of authors.

Next steps / further information 
Status