Classifying Documents Using Supervised Learning
Let's say that you have a large collection of texts and you want to use a computer to help you classify those texts into two or more groups, such as "Philosophical" and "Other". One technique to accomplish this task is to use supervised learning whereby you train a computer to classify texts for you. Training involves manually classifying a subset of your texts, having the computer analyze features in each subset, and then having the computer try to classify texts that haven't already been classified. We can get a sense of how accurately the classification process is working by using only a subset of the classified texts to train the classifier and then use the remaining classified texts to test the classifier. If the results are good, we should feel more confident in looking at unknown texts to try to predict the class.
- two or more sets of texts organized into folders with class names (such as "Philosophical" and "Other")
- iPython Notebooks (see these instructions)
[for now, follow along at https://github.com/htrc/ACS-TT/blob/master/tools/notebooks/ClassifyingPh...)