Tokenization using NLTK

Introduction 

Tokenization is the process of splitting a sentence or a chunk of text into its constituent parts. These “tokens” may be the letters, punctuation, words, or sentences. They could even be a combination of all these elements. This recipe was adapted from a Python Notebook written by Kynan Lee.

Ingredients 
  • Python 3
  • The Natural Language Toolkit (NLTK) module
  • NLTK tokenization libraries: word_tokenize, TweetTokenizer, MWETokenizer, sent_tokenize.
  • A text to analyze.
  • Sample tokenization code on TAPoR.
Steps 
  • Open the Python notebook and import the NLTK libraries.
    • The library you choose will depend upon the type of tokenization you require.
  • Tokenize a string or file.
    • Write a sentence into a variable.
    • Place the variable in parenthesis after the nltk tokenization library of your choice.
    • Print the results.
  • Tokenize a file. 
    • Make sure you have a .txt file in your Python directory.
    • Open the file by typing the full file name and store it in a variable.
    • Place the variable in parenthesis after the nltk tokenization library of your choice.
    • Print the results.
  • Tokenize with multi-word.
    • Choose two words that you wanted to treat as one token.
    • Separate the words with a comma and choose a symbol to separate them by.
    • Add additional two-word combinations if you so desire.
    • Run the multi-word tokenizer library on a sentence.
    • Print the results.
Discussion 

Tokenization is the starting point for many text analysis projects. By breaking text down in this way, you create a “bag of words”, or, a collection of seemingly unrelated characters/strings. The tokens can then be compared and analyzed for higher order text analysis like word frequency counts or n-grams.

Next steps / further information 
Status