Tokenization using NLTK
Introduction
Tokenization is the process of splitting a sentence or a chunk of text into its constituent parts. These “tokens” may be the letters, punctuation, words, or sentences. They could even be a combination of all these elements. This recipe was adapted from a Python Notebook written by Kynan Lee.
Ingredients
- Python 3
- The Natural Language Toolkit (NLTK) module
- NLTK tokenization libraries: word_tokenize, TweetTokenizer, MWETokenizer, sent_tokenize.
- A text to analyze.
- Sample tokenization code on TAPoR.
Steps
- Open the Python notebook and import the NLTK libraries.
- The library you choose will depend upon the type of tokenization you require.
- Tokenize a string or file.
- Write a sentence into a variable.
- Place the variable in parenthesis after the nltk tokenization library of your choice.
- Print the results.
- Tokenize a file.
- Make sure you have a .txt file in your Python directory.
- Open the file by typing the full file name and store it in a variable.
- Place the variable in parenthesis after the nltk tokenization library of your choice.
- Print the results.
- Tokenize with multi-word.
- Choose two words that you wanted to treat as one token.
- Separate the words with a comma and choose a symbol to separate them by.
- Add additional two-word combinations if you so desire.
- Run the multi-word tokenizer library on a sentence.
- Print the results.
Discussion
Tokenization is the starting point for many text analysis projects. By breaking text down in this way, you create a “bag of words”, or, a collection of seemingly unrelated characters/strings. The tokens can then be compared and analyzed for higher order text analysis like word frequency counts or n-grams.
Status
Submitted by Jason on Thu, 02/08/2018 - 17:24