Example

There are many different methods and ways to study Twitter data. Unlike more traditional text (e.g. books), discourse on twitter can encompass a wide range of perspectives, ideas, thoughts, and context. The analysis of this discourse is something that needs that requires different cleaning methods, refinement, and categorization.

In this recipe we use 3 ebooks to show how topic analysis can identify the different topics each text represents. We will use Latent Dirichlet Allocation (LDA) approach which is the most common modelling method to discover topics. We can then spice it up with an interactive visualization of the discovered themes. This recipe is based on Zhang Jinman's notebook found on TAPoR.

NB: Any number of texts can be used, we choose 3 for this recipe.

Tokenization is the process of splitting a sentence or a chunk of text into its constituent parts. These “tokens” may be the letters, punctuation, words, or sentences. They could even be a combination of all these elements. This recipe was adapted from a Python Notebook written by Kynan Lee.

A common requirement in extracting information is the ability to identify all persons or characters referred to in the text. It is an elaborate process of knowing parts-of-speech in the text, tagging them and retrieving the names associated with those dialogues. The common approach is by using part-of-speech POS-tagger which analyses a sentence and associates words with their lexical descriptor i.e. whether it is an adverb, noun, adjective, conjuntion e.t.c. NLTK is a robust library and therefore the main ingredient of our recipe.

Pages