A common requirement in extracting information is the ability to identify all persons or characters referred to in the text. It is an elaborate process of knowing parts-of-speech in the text, tagging them and retrieving the names associated with those dialogues. The common approach is by using part-of-speech POS-tagger which analyses a sentence and associates words with their lexical descriptor i.e. whether it is an adverb, noun, adjective, conjuntion e.t.c. NLTK is a robust library and therefore the main ingredient of our recipe.
The technique known as "indexing" plays a fundamental role in search engines like Google and Yahoo, and can help researchers rapidly expedite their data analysis. This recipe will describe the steps one can follow in order to index data with the Python package Whoosh.