Compare word usage across two corpora to see if there is any difference between the two. Look at a word in the two texts, you use the U test to determine if the difference is significant. Helps determine the uniqueness of a term.
Term Frequency-Inverse Document Frequency or TF-IDF, is used to determine how important a word is within a single document of a collection. It will help determine the importance or weight of word to a document in a collection or corpus. It ranks the importance of word based on how often it appears in a text but the rank is offset by how often it occurs in the whole collection.
The technique known as "indexing" plays a fundamental role in search engines like Google and Yahoo, and can help researchers rapidly expedite their data analysis. This recipe will describe the steps one can follow in order to index data with the Python package Whoosh.