Document Clustering by Topic using K-Means and MDS

Introduction 

Multidimensional Scaling (MDS) is a method to convert sets of document terms into a data frame that can then be visualized. The distances expressed in the visualization show how similar, or dissimilar, the contents of one text are to another. This recipe deals with several advanced text analysis concepts and methods. Links are provided to additional information on these terms.

Panda Data frame: https://pandas.pydata.org/

TF-IDF: https://en.wikipedia.org/wiki/Tf%E2%80%93idf

Vectorizer: http://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.TfidfVectorizer.html

K-Means Clustering: https://en.wikipedia.org/wiki/K-means_clustering

Multidimensional Scaling: https://en.wikipedia.org/wiki/Multidimensional_scaling

Dendrogram: https://en.wikipedia.org/wiki/Dendrogram

Ingredients 
  • Python 3
  • Natural Language Toolkit (NLTK)
  • Panda
  • Numpy
  • Scikit-learn (Machine Learning in Python)
  • SciPy (Open-source Software for Mathematics, Science, and Engineering)
  • Matplotlib (Plotting)
  • A collection of electronic texts
  • Sample code available on TAPoR
Steps 
  • Open the Python notebook and import NLTK, PANDAS, NUMPY, SCIKIT-LEANR, SCIPY and MATPLOTLIB libraries.
  • Gather the electronic text content.
    • Set a path for the destination folder of the novels.
    • Save the titles and contents of the novels in separate variables.
    • Append the file contents.
  • Double check that the number of titles matches the amount of content.
  • Prepare the text for cleaning.
    • Import a stopword list.
    • Import a stemmer tool.
  • Create cleaning functions to:
    • tokenize the text with the cleaning function.
    • remove the stopwords from the text.
    • remove proper nouns from the text.
    • remove contractions from the text.
  • Combine these functions together into one function.
  • Apply the cleaning function to all the texts in the corpus.
    • Create a loop that cleans each text separately.
    • Append the cleaned files in a variable for tokenized text and a variable for stemmed text.
  • Convert these lists into a Panda data frame.
    • Use the stemmed list as an index.
    • Use the tokenized words as columns.
  • Convert raw data into TF-IDF features using Vecotorizer.
  • Place the terms from the TF-IDF matrix into a variable.
  • Calculate the similarities of the documents with Cosine Similarity
  • Cluster the similarities with K-Means.
    • Declare number of clusters.
    • Store the document clusters into a list.
  • Set up the results for plotting.
    • Create a dictionary to hold the book title, content, and clusters.
    • Print the Panda data frame.
  • Examine the clusters and their contents (optional).
  • Use MDS to group the clusters.
    • Fit the model.
    • Save the results.
  • Set up the MDS cluster graph.
    • Assign colours to the document clusters.
    • Assign names to the document clusters.
  • Create a function to produce and visualize the MDS graph.
  • Display the MDS cluster graph.
  • Graph the results with a Dendrogram.
    • Create a matrix for the clusters.
    • Set graph size.
    • Set graph type and other labels.
  • Display the Dendrogram graph.
Discussion 

Using document clusters with MDS provides not only visually striking graphs, but a deeper understanding of how texts in a corpus are related to one another. This is useful in projects comparing the works of a single author over time, or of a group of authors from the same genre or era.

Next steps / further information 
Status