Tokenization is the process of splitting a sentence or a chunk of text into its constituent parts. These “tokens” may be the letters, punctuation, words, or sentences. They could even be a combination of all these elements. This recipe was adapted from a Python Notebook written by Kynan Lee.
A common requirement in extracting information is the ability to identify all persons or characters referred to in the text. It is an elaborate process of knowing parts-of-speech in the text, tagging them and retrieving the names associated with those dialogues. The common approach is by using part-of-speech POS-tagger which analyses a sentence and associates words with their lexical descriptor i.e. whether it is an adverb, noun, adjective, conjuntion e.t.c. NLTK is a robust library and therefore the main ingredient of our recipe.
Stemming and Lemmatization are text analysis methods that return the root word of derivative forms of the word. This is done by removing the suffixes of words (stemming) or by comparing the derivative words to a predetermined vocabulary of their root forms (lemmatization). This recipe was adapted from a Python Notebook written by Kynan Lee.
This recipe shows how to scrape comments from a YouTube video to analyze.
This utility is for creating a simple web scraper with Python.
This section presents a concise summary of what the recipe will teach, focusing primarily on: (1) the outcome of the recipe (what are you trying to achieve, in non-technical terms); (2) the main technical approaches employed; and (3) whether the recipe is based on someone else’s work/code (they should be cited if so).
This recipe discusses ways to find electronic texts (e-texts) online that can be used by other text analysis tools.
Compare word usage across two corpora to see if there is any difference between the two. Look at a word in the two texts, you use the U test to determine if the difference is significant. Helps determine the uniqueness of a term.
Term Frequency-Inverse Document Frequency or TF-IDF, is used to determine how important a word is within a single document of a collection. It will help determine the importance or weight of word to a document in a collection or corpus. It ranks the importance of word based on how often it appears in a text but the rank is offset by how often it occurs in the whole collection.