Analyze a Text's Parts of Speech with Python


This recipe helps you explore how to analyze a text based on its parts of speech (e.g. nouns, adjectives, prepositions, etc.).

  • Import the text to analyze into a string, and tokenize it
  • Tag the parts of speech in the text
    • Import the NLTK library and run the nltk.pos_tag() function with the tokens
    • Review the results to look for incorrectly tagged words
  • Plot the tagged elements
    • Plot the frequencies and distributions of individual parts of speech
    • Plot the overall frequency of the parts of speech
  • Combine related tag categories, e.g. all punctuation categories
  • Normalize the tag data to reduce subcategorization
    • The tags are three letters, with the third letter representing a sub category
    • Normalize the tags to cut off the third letter
  • Use a lexical dispersion to plot the normalized categories to see what to explore next
    • Find an abnormality in the plot, and print out the tokens around it
    • Find a keyword from the tokens, and use it to print that segment of the text’s string
  • Find the top frequency terms for each part of speech
    • Normalize the original tagged tokens list so that it has two-letter parts of speech
    • Refine the list to only include parts of speech categories that have at least two different words in the category
    • Plot the top frequency terms for each parts of speech tag
    • Import matplotlib and create one chart for each part of speech’s top words
  • Try comparing the parts of speech in this text with another
    • Create a function that takes in a string and a part of speech to count the occurrences of a certain part of speech
    • In the function, tokenize and tag the string, then create a list of each token whose tag matches the one being searched
    • In the function, return the ratio of matched tokens vs. total tokens
    • Use the function with two different texts to compare their reliance on the same part of speech
Next steps / further information 
  • This recipe is based off a section of The Art of Literary Text Analysis
  • Instead of using a normalized 2-letter code for the verb tags, keep the original tags and try to construct an argument about the different types of verbs in the text.
  • Which part-of-speech has the greatest percentage difference between the two texts you compared? Try using a loop for the parts-of-speech.