TATR: Tokenization and Extraction
This recipe is part of the Text Analysis for Twitter Research (TATR) series, and will look at tokenizing and extracting key features from a Tweet.
- Python 3
- Twitter data
- Kynan Ly’s sample code on TAPoR
- Open a new Jupyter Notebook and import the following libraries:
- Import Twitter data
- Insert data into a Panda’s dataframe
- Write a function to tokenize the text within the dataframe
- Apply the function to every cell in the dataframe (see Panda’s “apply” and “lamba” features)
- Write a function to extract your desired column of token data from the dataframe
- Save the extracted data into a new Panda’s dataframe
The TATR library was presented as an academic poster in 2018’s Congress held in Regina, SK. For a PDF version of the full poster, please visit:
Certain aspects of this recipe draw upon code from the companion TATR notebooks and recipes. In particular, please see:
This recipe describes components that are fundamental for some of the more advanced TATR notebooks.