Comparing Frequencies with the Mann WhitneyU Test in Python
Compare word usage across two corpora to see if there is any difference between the two. Look at a word in the two texts, you use the U test to determine if the difference is significant. Helps determine the uniqueness of a term.
- Python 3
- Libraries os, string, panda, csv,
- From “scipy.stats” import “mannwhitneyu”
- Text/documents in .txt files
- Calculate term frequency (TF)
- Write the term frequencies of both corpora into independent csv files
- Save all words in the two corpora into a grand word list csv file
- Do the Mann Whitney U Test
- Read both TF csv files
- Ensure both csv files have the same number of rows, and if a term appears in one corpus but not the other, insert 0 for that word.
- Read the grand word list csv file
- For every word in the grand word list, conduct the Mann Whitney U Test with the corpora values for each word
- Save to csv
Explore sample code available on TAPoR.