Generating Concordances

Introduction 

This recipe will show how to generate basic concordances of a word and showing it within a textual context. We will use "find and replace" strategy known as regular expressions in this approach. Regex, as it is also known is universal to most programming languages and is a well documented method of parsing text. This recipe is based on Jinman's cookbook.

Ingredients 
Steps 
  • Open the text file and load the contents into a variable
  • Cleanup and tokenize the variable and transform tokens into one case
    • Specify how to tokenize the words in the text using regex and decide whether to ignore punctuations and contractions.
    • It is recommended to also have all tokens in either lowercase or uppercase so as to simplify the regex expression used to find the keyword.
  • Create a function that given 3 arguments; a keyword, a list of tokens and an optional context-width, it will generate a list of concordances and where they occur in the text. This is done in a iterative fashion by going through the list of tokenized words.
    • Comparing each word with the keyword
    • Whenever a match is found, get its location in the list
    • Get a specific number of the words appearing before and after the keyword based on the given context limit, where possible.
  • Finally, append the keyword-in-context into a list of concordances.
Discussion 
  • Removal of punctuations and sanitizing contractions are an extra step that can be at your discretion depending on the needs. However this can be complex process and presents other dilemmas of how to handle such without losing valuable or including unnecessary content.
  • Concordances are useful to appreciate keyword frequencies in content.
Next steps / further information 
  • The are also other approaches besides use of Regular expressions. NLTK library actually has a method for searching text.
  • Stéfan Sinclair's Notebook provenance
Status