Document Clustering by Topic using K-Means and PCA


Principal Component Analysis (PCA) is a method to convert sets of document terms into a data frame that can then be visualized. The distances expressed in the visualization show how similar, or dissimilar, the contents of one text are to another. PCA tries to identify a smaller number of uncorrelated variables, called "principal components" from the dataset. The goal is to explain the maximum amount of variance with the fewest number of principal components.  This recipe deals with several advanced text analysis concepts and methods. Links are provided to additional information on these terms.

Panda Data frame:



K-Means Clustering:

Principal Component Analysis:


  • Python 3
  • Natural Language Toolkit (NLTK)
  • Panda
  • Numpy
  • Scikit-learn (Machine Learning in Python)
  • SciPy (Open-source Software for Mathematics, Science, and Engineering)
  • Matplotlib (Plotting)
  • A collection of electronic texts
  • Sample code available on TAPoR
  • Open the Python notebook and import NLTK, PANDAS, NUMPY, SCIKIT-LEANR, SCIPY and MATPLOTLIB libraries.
  • Gather the electronic text content.
    • Set a path for the destination folder of the novels.
    • Save the titles and contents of the novels in separate variables.
    • Append the file contents.
  • Double check that the number of titles matches the amount of content.
  • Prepare the text for cleaning.
    • Import a stopword list.
    • Import a stemmer tool.
  • Create cleaning functions to:
    • tokenize the text with the cleaning function.
    • remove the stopwords from the text.
    • remove proper nouns from the text.
    • remove contractions from the text.
  • Combine these functions together into one function.
  • Apply the cleaning function to all the texts in the corpus.
    • Create a loop that cleans each text separately.
    • Append the cleaned files in a variable for tokenized text and a variable for stemmed text.
  • Convert these lists into a Panda data frame.
    • Use the stemmed list as an index.
    • Use the tokenized words as columns.
  • Convert raw data into TF-IDF features using Vecotorizer.
  • Place the terms from the TF-IDF matrix into a variable.
  • Calculate the similarities of the documents with Cosine Similarity.
  • Cluster the similarities with K-Means.
    • Declare number of clusters.
    • Store the document clusters into a list.
  • Set up the results for plotting.
    • Create a dictionary to hold the book title, content, and clusters.
    • Print the Panda data frame.
  • Examine the clusters and their contents (optional).
  • Use PCA to group the clusters.
    • Fit the model.
    • Save the results.
  • Set up the PCA cluster graph.
    • Assign colours to the document clusters.
    • Assign names to the document clusters.
  • Create a function to produce and visualize the PCA graph.
  • Display the PCA cluster graph.
  • Graph the results with a Dendrogram.
    • Create a matrix for the clusters.
    • Set graph size.
    • Set graph type and other labels.
  • Display the Dendrogram graph.

Using document clusters with PCA provides not only visually striking graphs, but a deeper understanding of how texts in a corpus are related to one another. This is useful in projects comparing the works of a single author over time, or of a group of authors from the same genre or era.

Next steps / further information