Finding People or Characters from A Text (Named-Entity Recognition)
A common requirement in extracting information is the ability to identify all persons or characters referred to in the text. It is an elaborate process of knowing parts-of-speech in the text, tagging them and retrieving the names associated with those dialogues. The common approach is by using part-of-speech POS-tagger which analyses a sentence and associates words with their lexical descriptor i.e. whether it is an adverb, noun, adjective, conjuntion e.t.c. NLTK is a robust library and therefore the main ingredient of our recipe.
- NLTK library
- Numpy (required internally by NLTK for this recipe)
- Python 3
- A text file
- This recipe is adopted from Zhang Jinman's cookbook
- Open the text file and load the contents into a container
- Create a container to store the person or character names that will be found in the text
- Split the text by sentences
- Split the sentences further into words
- Tag / label each word with its lexical association
- Finally, extract the words labelled "PERSON" and deposit them into the container we created in step 2.
- Feel free to print out the words!
- Explore the code on TAPoR