Word Frequency and Count of a Set of Documents
Word frequencies and counts are text analysis methods that return results about the words in a text or set of texts. Counts return the amount of times a word is used in the text, whereas frequencies give a sense of how often a word is used in comparison to others in the text.
- Python 3
- The Natural Language Toolkit (NLTK) library, Pandas, OS, and RE
- A collection of electronic texts
- Sample code available on TAPoR.
- Open the Python notebook and import NLTK, PANDAS, OS, and RE libraries.
- Import the novels and their contents.
- Set the path of the destination folder of the novels.
- Save the titles and contents of the novels in separate variables.
- Append the file contents.
- Double check that the number of titles matches the amount of content.
- Create a cleaning function.
- Tokenize the text with the cleaning function.
- Get the word frequency.
- Create a Pandas data frame for each novel.
- Combine the data frames.
- Display the Pandas data frame.
- Turn word frequencies into a percentage.
- Get the total word count for each row of words.
- Create a new Pandas data frame.
- Calculate the percent for each row in the Pandas data frame.
- Display the results.
Word frequencies and counts are a good starting point for a more in-depth text analysis of a set of documents. Getting a sense of the amount of times certain words are used and their relation to one another can fuel speculation and give insight into deeper correlations in the corpus.
Next steps / further information
- Recipe on Tokenization
Submitted by Jason on Sun, 02/11/2018 - 12:46