Calculating TF-IDF with Python

Introduction 

Term Frequency-Inverse Document Frequency or TF-IDF, is used to determine how important a word is within a single document of a collection. It will help determine the importance or weight of word to a document in a collection or corpus. It ranks the importance of word based on how often it appears in a text but the rank is offset by how often it occurs in the whole collection. 

Ingredients 
  • Python 3
  • Libraries: os, csv, string, re, panda, math
  • From “collections” import default dictionary
  • Text/documents in .txt file
Steps 

First, acquire the files you want to use, and put them in a folder called “texts” on your computer and ensure your notebook application or coding environment has all the required ingredients installed. 

  • Calculate term frequency(TF) in each document
    • Iterate each document and count how often each word appears
    • Determine the relative frequency:
      • Take the number of times a word appears and divide by the total word count of the document
    • Write into a csv
  • Calculate the inverse document frequency (IDF): Take the total number of documents divided by the number of documents containing the word
    • Open the csv file containing the TF
    • Iterate through the file calculating the IDF
    • Write into a new csv
  • Calculate TF-IDF: multiply TF and IDF together
    • Open both csv files
    • For each row and column, multiply the corresponding TF and IDF values together
    • Write into a new csv
Next steps / further information 
  • Using the python libraries panda and matplotlib you can graph, make tables, or manipulate the data from the csv file
  • A sample notebook is available on TAPoR
TaDiRAH goals/methods 
Status