Word Frequency and Count of a Set of Documents


Word frequencies and counts are text analysis methods that return results about the words in a text or set of texts. Counts return the amount of times a word is used in the text, whereas frequencies give a sense of how often a word is used in comparison to others in the text.

  • Open the Python notebook and import NLTK, PANDAS, OS, and RE libraries.
  • Import the novels and their contents.
    • Set the path of the destination folder of the novels.
    • Save the titles and contents of the novels in separate variables.
    • Append the file contents.
  • Double check that the number of titles matches the amount of content.
  • Create a cleaning function.
    • Tokenize the text with the cleaning function.
  • Get the word frequency.
    • Create a Pandas data frame for each novel.
    • Combine the data frames.
    • Display the Pandas data frame.
  • Turn word frequencies into a percentage.
    • Get the total word count for each row of words.
    • Create a new Pandas data frame.
    • Calculate the percent for each row in the Pandas data frame.
    • Display the results.

Word frequencies and counts are a good starting point for a more in-depth text analysis of a set of documents. Getting a sense of the amount of times certain words are used and their relation to one another can fuel speculation and give insight into deeper correlations in the corpus.

Next steps / further information