Error message

  • Notice: Trying to access array offset on value of type int in element_children() (line 6609 of /var/data/sites/methodi.ca-7.63/includes/common.inc).
  • Notice: Trying to access array offset on value of type int in element_children() (line 6609 of /var/data/sites/methodi.ca-7.63/includes/common.inc).
  • Notice: Trying to access array offset on value of type int in element_children() (line 6609 of /var/data/sites/methodi.ca-7.63/includes/common.inc).
  • Notice: Trying to access array offset on value of type int in element_children() (line 6609 of /var/data/sites/methodi.ca-7.63/includes/common.inc).
  • Notice: Trying to access array offset on value of type int in element_children() (line 6609 of /var/data/sites/methodi.ca-7.63/includes/common.inc).
  • Deprecated function: implode(): Passing glue string after array is deprecated. Swap the parameters in drupal_get_feeds() (line 394 of /var/data/sites/methodi.ca-7.63/includes/common.inc).

Show graphically key facts about the distribution of a word in a text

Introduction 

This recipe performs three different tasks:
(1) Plot the cumulative type/Token ratio in a text;
(2) Track the occurrence of a particular word in a text and plot all occurrences of the word in a dispersion plot;
(3) Show graphically the Relative frequency of the word across n equal sub-parts of the text and add to the plot chi-square and a dispersion measure (default is Juilland's D).

Ingredients 

Resources:

  • Raw text file
  • The R programming/statistical package (base package)

User-specified input:

  • A search word
  • Number of parts (n) which the text file will be divided into (task 3)
  • Dispersion measure to use (task 3)
Steps 
  • Read into R a text file.
  • If necessary, clean/organize text.
  • Tokenize the text file into words.
  • Make one vector containing all the words of the text file in the order in which they occur in the original text.
  • Calculate the type/Token ratio incrementally for each position and plot it. Show the positions where a search word occurs in red. (task 1 above)
  • Identify the positions where a search word occurs in the vector.
  • Make a distribution plot to graphically show the positions of each occurrence of the searchword. (task 2 above).
  • Divide the vector of words into n equal sub-parts.
  • Make a barplot showing frequency of the search word within each sub-part.
  • Calculate frequency and percentages, chi-square and the selected dispersion measure indicating how even/uneven the dispersion of the search word is within a text. Add all these measures to the barplot. (task 3 above)
Discussion 

The recipe produces three ".png" plots.

For a critical overview of various dispersion measures, see Gries (2008):

Gries, Stefan Th. 2008. Dispersions and adjusted frequencies in corpora. International Journal of Corpus Linguistics 13(4). 403-437. 
Available here: http://www.linguistics.ucsb.edu/faculty/stgries/research/Dispersion@IJCL...
Additional web resources (dispersion scripts) for this paper available here:http://www.linguistics.ucsb.edu/faculty/stgries/research/dispersion/link...

Gries, Stefan Th. 2009. Dispersions and adjusted frequencies in corpora: further explorations. In Stefan Th. Gries, Stefanie Wulff, & Mark Davies (eds.), Corpus linguistic applications: current studies, new directions, 197-212. Amsterdam: Rodopi. 
Available here: http://www.linguistics.ucsb.edu/faculty/stgries/research/Dispersion_Rodo....

Status