Word Frequency and Count of a Set of Documents
Introduction
Word frequencies and counts are text analysis methods that return results about the words in a text or set of texts. Counts return the amount of times a word is used in the text, whereas frequencies give a sense of how often a word is used in comparison to others in the text.
Ingredients
- Python 3
- The Natural Language Toolkit (NLTK) library, Pandas, OS, and RE
- A collection of electronic texts
- Sample code available on TAPoR.
Steps
- Open the Python notebook and import NLTK, PANDAS, OS, and RE libraries.
- Import the novels and their contents.
- Set the path of the destination folder of the novels.
- Save the titles and contents of the novels in separate variables.
- Append the file contents.
- Double check that the number of titles matches the amount of content.
- Create a cleaning function.
- Tokenize the text with the cleaning function.
- Get the word frequency.
- Create a Pandas data frame for each novel.
- Combine the data frames.
- Display the Pandas data frame.
- Turn word frequencies into a percentage.
- Get the total word count for each row of words.
- Create a new Pandas data frame.
- Calculate the percent for each row in the Pandas data frame.
- Display the results.
Discussion
Word frequencies and counts are a good starting point for a more in-depth text analysis of a set of documents. Getting a sense of the amount of times certain words are used and their relation to one another can fuel speculation and give insight into deeper correlations in the corpus.
Status
Submitted by Jason on Sun, 02/11/2018 - 12:46