TATR: Tokenization and Extraction


This recipe is part of the Text Analysis for Twitter Research (TATR) series, and will look at tokenizing and extracting key features from a Tweet.

  • Open a new Jupyter Notebook and import the following libraries:
    • NLTK
    • PANDAS
    • NUMPY
  • Import Twitter data
  • Insert data into a Panda’s dataframe
  • Write a function to tokenize the text within the dataframe
  • Apply the function to every cell in the dataframe (see Panda’s “apply” and “lamba” features)
  • Write a function to extract your desired column of token data from the dataframe
  • Save the extracted data into a new Panda’s dataframe

The TATR library was presented as an academic poster in 2018’s Congress held in Regina, SK. For a PDF version of the full poster, please visit:

Next steps / further information 

Certain aspects of this recipe draw upon code from the companion TATR notebooks and recipes. In particular, please see:

TATR: Panda and CSV of Tweets

This recipe describes components that are fundamental for some of the more advanced TATR notebooks.