Analysis

Compare word usage across two corpora to see if there is any difference between the two. Look at a word in the two texts, you use the U test to determine if the difference is significant. Helps determine the uniqueness of a term.

Term Frequency-Inverse Document Frequency or TF-IDF, is used to determine how important a word is within a single document of a collection. It will help determine the importance or weight of word to a document in a collection or corpus. It ranks the importance of word based on how often it appears in a text but the rank is offset by how often it occurs in the whole collection. 

Let's say that you have a large collection of texts and you want to use a computer to help you classify those texts into two or more groups, such as "Philosophical" and "Other". One technique to accomplish this task is to use supervised learning whereby you train a computer to classify texts for you. Training involves manually classifying a subset of your texts, having the computer analyze features in each subset, and then having the computer try to classify texts that haven't already been classified.

The technique known as "indexing" plays a fundamental role in search engines like Google and Yahoo, and can help researchers rapidly expedite their data analysis. This recipe will describe the steps one can follow in order to index data with the Python package Whoosh.