# Calculating TF-IDF with Python

Introduction

Term Frequency-Inverse Document Frequency or TF-IDF, is used to determine how important a word is within a single document of a collection. It will help determine the importance or weight of word to a document in a collection or corpus. It ranks the importance of word based on how often it appears in a text but the rank is offset by how often it occurs in the whole collection.

Ingredients
• Python 3
• Libraries: os, csv, string, re, panda, math
• From “collections” import default dictionary
• Text/documents in .txt file
Steps

First, acquire the files you want to use, and put them in a folder called “texts” on your computer and ensure your notebook application or coding environment has all the required ingredients installed.

• Calculate term frequency(TF) in each document
• Iterate each document and count how often each word appears
• Determine the relative frequency:
• Take the number of times a word appears and divide by the total word count of the document
• Write into a csv
• Calculate the inverse document frequency (IDF): Take the total number of documents divided by the number of documents containing the word
• Open the csv file containing the TF
• Iterate through the file calculating the IDF
• Write into a new csv
• Calculate TF-IDF: multiply TF and IDF together
• Open both csv files
• For each row and column, multiply the corresponding TF and IDF values together
• Write into a new csv
Next steps / further information
• Using the python libraries panda and matplotlib you can graph, make tables, or manipulate the data from the csv file
• A sample notebook is available on TAPoR