Comparing Frequencies with the Mann WhitneyU Test in Python


Compare word usage across two corpora to see if there is any difference between the two. Look at a word in the two texts, you use the U test to determine if the difference is significant. Helps determine the uniqueness of a term.

  • Python 3
  • Libraries os, string, panda, csv,
  • From “scipy.stats” import “mannwhitneyu”
  • Text/documents in .txt files
  • Calculate term frequency (TF)
    • Write the term frequencies of both corpora into independent csv files
    • Save all words in the two corpora into a grand word list csv file
  • Do the Mann Whitney U Test
    • Read both TF csv files
    • Ensure both csv files have the same number of rows, and if a term appears in one corpus but not the other, insert 0 for that word.
    • Read the grand word list csv file
    • For every word in the grand word list, conduct the Mann Whitney U Test with the corpora values for each word
  • Save to csv
Next steps / further information 

Explore sample code available on TAPoR.

