Show graphically key facts about the distribution of a word in a text
This recipe performs three different tasks:
(1) Plot the cumulative type/Token ratio in a text;
(2) Track the occurrence of a particular word in a text and plot all occurrences of the word in a dispersion plot;
(3) Show graphically the Relative frequency of the word across n equal sub-parts of the text and add to the plot chi-square and a dispersion measure (default is Juilland's D).
Resources:
- Raw text file
- The R programming/statistical package (base package)
User-specified input:
- A search word
- Number of parts (n) which the text file will be divided into (task 3)
- Dispersion measure to use (task 3)
- Read into R a text file.
- If necessary, clean/organize text.
- Tokenize the text file into words.
- Make one vector containing all the words of the text file in the order in which they occur in the original text.
- Calculate the type/Token ratio incrementally for each position and plot it. Show the positions where a search word occurs in red. (task 1 above)
- Identify the positions where a search word occurs in the vector.
- Make a distribution plot to graphically show the positions of each occurrence of the searchword. (task 2 above).
- Divide the vector of words into n equal sub-parts.
- Make a barplot showing frequency of the search word within each sub-part.
- Calculate frequency and percentages, chi-square and the selected dispersion measure indicating how even/uneven the dispersion of the search word is within a text. Add all these measures to the barplot. (task 3 above)
The recipe produces three ".png" plots.
For a critical overview of various dispersion measures, see Gries (2008):
Gries, Stefan Th. 2008. Dispersions and adjusted frequencies in corpora. International Journal of Corpus Linguistics 13(4). 403-437.
Available here: http://www.linguistics.ucsb.edu/faculty/stgries/research/Dispersion@IJCL....
Additional web resources (dispersion scripts) for this paper available here:http://www.linguistics.ucsb.edu/faculty/stgries/research/dispersion/link...
Gries, Stefan Th. 2009. Dispersions and adjusted frequencies in corpora: further explorations. In Stefan Th. Gries, Stefanie Wulff, & Mark Davies (eds.), Corpus linguistic applications: current studies, new directions, 197-212. Amsterdam: Rodopi.
Available here: http://www.linguistics.ucsb.edu/faculty/stgries/research/Dispersion_Rodo....