Text Cleaning

This tutorial goes over how to manipulate twitter data that is already processed into a CSV. It touches on topics such as tokenization and graphing with the aim of showcasing a large varity of analytical methods. Although the included Juypter IPython notebooks contains some dummy CSV (in this case you do not have any), feel free to use your own. If you do not have your own twitter data or have it in a JSON object see the following:

This tutorial is going to talk about natural language processing using Python. This tutorial is focus on how to use NLTK to do the text preprocessing such as tokenize text, get synonyms and antonyms, word stemming etc. 


This tutorial is divided into 6 parts; they are:

  1. Metamorphosis by Franz Kafka
  2. Text Cleaning is Task Specific
  3. Manual Tokenization
  4. Tokenization and Cleaning with NLTK
  5. Additional Text Cleaning Considerations
  6. Tips for Cleaning Text for Word Embedding

This recipe is part of the Text Analysis for Twitter Research (TATR) series, and will look at tokenizing and extracting key features from a Tweet.

This recipe uses regular expressions (or Regex) to clean a text document. This recipe is based on the Using Regular Expressions to Clean a Text code.

This recipe will use regular expressions to clean up a webpage. This is useful if you want to carry out any meaningful textual analysis of the content in a web page. We can remove the html tags and other unnecessary textual elements with this method.

In this recipe we use 3 ebooks to show how topic analysis can identify the different topics each text represents. We will use Latent Dirichlet Allocation (LDA) approach which is the most common modelling method to discover topics. We can then spice it up with an interactive visualization of the discovered themes. This recipe is based on Zhang Jinman's notebook found on TAPoR.

NB: Any number of texts can be used, we choose 3 for this recipe.

Stemming and Lemmatization are text analysis methods that return the root word of derivative forms of the word. This is done by removing the suffixes of words (stemming) or by comparing the derivative words to a predetermined vocabulary of their root forms (lemmatization). This recipe was adapted from a Python Notebook written by Kynan Lee.