Find Patterns in a Text File Using Regular Expressions

Introduction 

This recipe demonstrates how to use the command line 'Grep' command and regular expressions to find patterns within a plain text file.

Ingredients 
  1. text file: e.g. Alice’s Adventures in Wonderland - http://www.gutenberg.org/files/11/11.txt
  2. Mac OSX or Linux/Unix operating system or Cygwin on Windows
    1. Specifically the grep command
Steps 
  1. Download a copy of the text file(s)
  2. Open a shell or terminal
  3. In general you will use commands that look like grep “PATTERN” filename.txt
    1. the PATTERN is the Regular expression
  4.  

Examples:

  • Using our example of Alice’s Adventures in Wonderland, here are some examples of searches you could execute using regular expressions:
  • To find every instance of the word “waistcoat” in the file the command would be:
    • grep “waistcoat” 11.txt
  • To find every two-letter word that ends in ‘t’ (such as it, at) the command would be:
    • grep “ .t “ 11.txt
    • The dot matches exactly one character
  • To find all of the words that end in ‘ed’ the command would be:
    • grep “.*ed “ 11.txt
  • The dot matches exactly one character and the star says 0 or more copies of the preceding character
  • To find either the word ‘she’ or ‘the’ in the text, the command would be:
    • grep “ [st]he “ 11.txt
  • The square brackets will match ‘he’ with either ‘s’ or ‘t’ as the first character. It will not match the word ‘he’
  • If you wanted to search more than one text file, or a whole corpus the command would be:
    • grep -R
Discussion 

The grep man page: on the terminal type “man grep” or online go to http://www.ss64.com/bash/grep.html

A good online tutorial is: A Tao of Regular Expressions Mastering Regular Expressions (by Jeffrey Friedl) - the O’Reilly manual for regular expressions (2nd edition published 2002)

Regular expressions can also be used in many different programming languages.

Things to look out for:

  • Every program that implements regular expressions has slightly different syntax.
  • Some patterns cannot be described with regular expressions. These often involve nested or paired elements.
Next steps / further information 
  • There are interesting connections between regular expressions, formal languages, and finite state machines or automata.
  • Once you understand how to use regular expressions they can be used for searching and replacing in text editors.
  • Using the UNIX or LINUX pipeline ( | ) character to string together (in sequence) multiple commands allows manipulations that would be difficult without specialized tools.
Status