Stemming and Lemmatization

Introduction 

Stemming and Lemmatization are text analysis methods that return the root word of derivative forms of the word. This is done by removing the suffixes of words (stemming) or by comparing the derivative words to a predetermined vocabulary of their root forms (lemmatization). This recipe was adapted from a Python Notebook written by Kynan Lee.

Ingredients 
  • Python 3
  • The Natural Language Toolkit (NLTK) module
  • NLTK libraries: RegexpStemmer, LancasterStemmer, PorterStemmer, SnowballStemmer, and WordNetLemmatizer.
  • A text to analyze.
Steps 
  • Open the Python notebook and import the NLTK libraries.
    • Open each of the libraries of your choice on a separate line.
  • Apply stemming to a text.
    • Read in the .txt file you would like to analyze into a variable.
    • Loop the variable containing your text.
    • In parenthesis, after the stemmer of your choice, insert the looped variable.
    • Print the results.
  • Apply lemmatization to a text.
    • Read in the .txt file you would like to analyze.
    • Loop the variable containing your text.
    • In parenthesis, after the lemmatizer, insert the looped variable.
    • Print the results.
Discussion 

Stemmers and Lemmatization may appear to operate in a similar fashion, but there are some major differences between the methods. Stemming “lops” off the end of a word, whereas Lemmatization makes use of a vocabulary. For these reasons, results and processing time may differ significantly between the two methods in certain situations.

These methods are useful for projects where the singular or root forms of a word are required. Most stemming and lemmatization libraries only work for English, but the Snowball stemmer supports analysis of Arabic, Danish, Dutch, English, Finnish, French, German, Hungarian, Italian, Norwegian, Portuguese, Romanian, Russian, Spanish and Swedish.

Next steps / further information 
  • Additional stemmer library documentation.
  • Explore sample code on TAPoR.
Status