Clustering Texts by Authorship
In this recipe, we measure a corpora to determine authorship of the featured texts and visualize them by authorship. We will use Multidimensional scaling(MDS) as one of the techniques for analyzing similar / dissimilar data. This recipe if based on Jinman Zhang's Cookbook on Github.
- Python 3
- Numpy
- scikit-learn
- Scipy
- Matplotlib
- Corpora:
- Charles John Huffam Dickens (1812-1870) – A Christmas Carol, A Tale of Two Cities, Great Expectations
- Jane Austen (1775-1817) – Emma, Persuasion, Pride and Prejudice
- William Shakespeare (1564-1616) – Henry V, Macbeth, Richard III
- Provenance: Jinman Zhang's Notebook
- Download the texts (see links in ingredients section)
- All the texts by the authors are preferably stored in a common directory.
- Calculate the length of each text
- Go through the texts calculating the word frequency for each text document and then sum up the frequencies to get the word count per text document. This result is stored as a matrix.
- Plotting the frequencies
- At this point we can plot the frequency lengths using Matplotlib ingredient. Using a histogram, we can visualize the word frequency to see any outliers, general distribution and skewness of the corpora.
- Analyzing text relationships
- Using MDS, we will now measure the distance between texts to model similarity or dissimilarity. This modelling of distance is done using geometric spaces. We utilize scikit-learn machine learning functions in this step.
- Plotting the distances
- We can now plot the distances on a scatter plot using Matplotlib representing texts in cluster formations. Similar texts will be clustered together as well as share the same colour.
- Further plotting
- Using Dendrograms, we can better illustrate the clusters in a hierarchical chart. The tree-like structure of a dendogram is able to associate and distinguish texts by authors giving more insight than a scatter plot.
This is recipe would serve as an important step especially working with historical and classical texts. In particular, comparing an author's texts to analyze linguistics differences, ascertain authorship, or exploratory to raise further questions in the authorship.
Beyond clustering by authorship, some of the techniques mentioned here can be used to cluster texts by other parameters such theme, topic or other common / uncommon characteristic of the text.
This recipe or its techniques would accompany well any research aiming to support an argument where the corpora is measurable and each corpus can be analyzed in relation to another.