Error message

  • Notice: Trying to access array offset on value of type int in element_children() (line 6609 of /var/data/sites/methodi.ca-7.63/includes/common.inc).
  • Notice: Trying to access array offset on value of type int in element_children() (line 6609 of /var/data/sites/methodi.ca-7.63/includes/common.inc).
  • Notice: Trying to access array offset on value of type int in element_children() (line 6609 of /var/data/sites/methodi.ca-7.63/includes/common.inc).
  • Notice: Trying to access array offset on value of type int in element_children() (line 6609 of /var/data/sites/methodi.ca-7.63/includes/common.inc).
  • Notice: Trying to access array offset on value of type int in element_children() (line 6609 of /var/data/sites/methodi.ca-7.63/includes/common.inc).
  • Deprecated function: implode(): Passing glue string after array is deprecated. Swap the parameters in drupal_get_feeds() (line 394 of /var/data/sites/methodi.ca-7.63/includes/common.inc).

Word Frequency and Count of a Set of Documents

Introduction 

Word frequencies and counts are text analysis methods that return results about the words in a text or set of texts. Counts return the amount of times a word is used in the text, whereas frequencies give a sense of how often a word is used in comparison to others in the text.

Ingredients 
Steps 
  • Open the Python notebook and import NLTK, PANDAS, OS, and RE libraries.
  • Import the novels and their contents.
    • Set the path of the destination folder of the novels.
    • Save the titles and contents of the novels in separate variables.
    • Append the file contents.
  • Double check that the number of titles matches the amount of content.
  • Create a cleaning function.
    • Tokenize the text with the cleaning function.
  • Get the word frequency.
    • Create a Pandas data frame for each novel.
    • Combine the data frames.
    • Display the Pandas data frame.
  • Turn word frequencies into a percentage.
    • Get the total word count for each row of words.
    • Create a new Pandas data frame.
    • Calculate the percent for each row in the Pandas data frame.
    • Display the results.
Discussion 

Word frequencies and counts are a good starting point for a more in-depth text analysis of a set of documents. Getting a sense of the amount of times certain words are used and their relation to one another can fuel speculation and give insight into deeper correlations in the corpus.

Next steps / further information 
Status