TATR: Tokenization and Extraction
Introduction
This recipe is part of the Text Analysis for Twitter Research (TATR) series, and will look at tokenizing and extracting key features from a Tweet.
Ingredients
- Python 3
- NLTK
- Panda
- Numpy
- Twitter data
- Kynan Ly’s sample code on TAPoR
Steps
- Open a new Jupyter Notebook and import the following libraries:
- NLTK
- PANDAS
- NUMPY
- Import Twitter data
- Insert data into a Panda’s dataframe
- Write a function to tokenize the text within the dataframe
- Apply the function to every cell in the dataframe (see Panda’s “apply” and “lamba” features)
- Write a function to extract your desired column of token data from the dataframe
- Save the extracted data into a new Panda’s dataframe
Discussion
The TATR library was presented as an academic poster in 2018’s Congress held in Regina, SK. For a PDF version of the full poster, please visit:
Status
Submitted by Jason on Tue, 05/01/2018 - 17:01