This recipe is part of the Text Analysis for Twitter Research (TATR) series, and will look at tokenizing and extracting key features from a Tweet.
This recipe uses regular expressions (or Regex) to clean a text document. This recipe is based on the Using Regular Expressions to Clean a Text code.
This recipe will use regular expressions to clean up a webpage. This is useful if you want to carry out any meaningful textual analysis of the content in a web page. We can remove the html tags and other unnecessary textual elements with this method.
In this recipe we use 3 ebooks to show how topic analysis can identify the different topics each text represents. We will use Latent Dirichlet Allocation (LDA) approach which is the most common modelling method to discover topics. We can then spice it up with an interactive visualization of the discovered themes. This recipe is based on Zhang Jinman's notebook found on TAPoR.
NB: Any number of texts can be used, we choose 3 for this recipe.
Stemming and Lemmatization are text analysis methods that return the root word of derivative forms of the word. This is done by removing the suffixes of words (stemming) or by comparing the derivative words to a predetermined vocabulary of their root forms (lemmatization). This recipe was adapted from a Python Notebook written by Kynan Lee.