Finding People or Characters from A Text (Named-Entity Recognition)

Introduction 

A common requirement in extracting information is the ability to identify all persons or characters referred to in the text. It is an elaborate process of knowing parts-of-speech in the text, tagging them and retrieving the names associated with those dialogues. The common approach is by using part-of-speech POS-tagger which analyses a sentence and associates words with their lexical descriptor i.e. whether it is an adverb, noun, adjective, conjuntion e.t.c. NLTK is a robust library and therefore the main ingredient of our recipe.

Ingredients 
  • NLTK library
  • Numpy (required internally by NLTK for this recipe)
  • Python 3
  • A text file
  • This recipe is adopted from Zhang Jinman's cookbook
Steps 
  1. Open the text file and load the contents into a container
  2. Create a container to store the person or character names that will be found in the text
  3. Split the text by sentences
  4. Split the sentences further into words
  5. Tag / label each word with its lexical association
  6. Finally, extract the words labelled "PERSON" and deposit them into the container we created in step 2.
  7. Feel free to print out the words!
Next steps / further information 
  • Explore the code on TAPoR 
Status