Analyzing Repeating Phrases (ngrams) with Python


This recipe uses Python and the NLTK to explore repeating phrases (ngrams) in a text. An ngram is a repeating phrase, where the 'n' stands for 'number' and the 'gram' stands for the words; e.g. a 'trigram' would be a three word ngram.

  • Import the text and tokenize it on import
  • Use the NLTK collocations() function to search for the top bigrams (two word phrases)
  • Now to get more control, use the ngram() function
    • Use it to turn the tokenized list in a list of bigram tuples, where the second word in one tuple is the first word in the following one
  • Use the ngram function to create a list of four-grams
    • Use nltk.FreqDist() on the list to determine the most commonly occurring four-grams
    • Note how some of the results are overlapping phrases in the text
  • Try to plot the location of the most common phrase
    • It isn’t possible to plot the phrase itself since the tokens are single words
    • Create a new text out of the four-grams, and search for the tuple
  • Search for the longest repeating phrases
    • Create a list that keeps track of the last set of repeating phrases
    • Start a for loop that has a range from 2 to the length of the whole text
    • With each iteration of the for loop:
      • Get ngrams of the whole text at the current length
      • Calculate the frequencies of the ngrams
  • Segment the text to analyze term distribution across sections
    • Import the NumPy library (useful for working with lists/arrays)
    • Split the tokenized text into 10 equal segments
    • Run a for loop to counts how many times a given phrase appears in a segment
    • Import matplotlib and plot the results as a bar graph
    • Try plotting a second phrase alongside the first
  • Plot correlations between phrase frequency
    • Use the NumPy corrcoef() function to plot correlation
    • Make a list of the most frequent phrases
    • Build a dictionary of counts for each top phrase
    • Build a dictionary of correlation values for a phrase
    • Plot the results
Next steps / further information 
  • This recipe is based on a section of The Art of Literary Text Analysis
  • Create a stopword list for the initial import of tokenized words
  • Compare the ngram results with and without the use of a stopword list
  • Using the stopword list of word tokens, determine the correlations between the top ngram and the top 20 ngrams (for 10 text segments)
  • Repeat this, but with 100 text segments – does the ranking change?