Find Patterns in a Text File Using Regular Expressions
Introduction
This recipe demonstrates how to use the command line 'Grep' command and regular expressions to find patterns within a plain text file.
Ingredients
- text file: e.g. Alice’s Adventures in Wonderland - http://www.gutenberg.org/files/11/11.txt
- Mac OSX or Linux/Unix operating system or Cygwin on Windows
- Specifically the grep command
Steps
- Download a copy of the text file(s)
- Open a shell or terminal
- In general you will use commands that look like grep “PATTERN” filename.txt
- the PATTERN is the Regular expression
Examples:
- Using our example of Alice’s Adventures in Wonderland, here are some examples of searches you could execute using regular expressions:
- To find every instance of the word “waistcoat” in the file the command would be:
- grep “waistcoat” 11.txt
- To find every two-letter word that ends in ‘t’ (such as it, at) the command would be:
- grep “ .t “ 11.txt
- The dot matches exactly one character
- To find all of the words that end in ‘ed’ the command would be:
- grep “.*ed “ 11.txt
- The dot matches exactly one character and the star says 0 or more copies of the preceding character
- To find either the word ‘she’ or ‘the’ in the text, the command would be:
- grep “ [st]he “ 11.txt
- The square brackets will match ‘he’ with either ‘s’ or ‘t’ as the first character. It will not match the word ‘he’
- If you wanted to search more than one text file, or a whole corpus the command would be:
- grep -R
Discussion
The grep man page: on the terminal type “man grep” or online go to http://www.ss64.com/bash/grep.html
A good online tutorial is: A Tao of Regular Expressions Mastering Regular Expressions (by Jeffrey Friedl) - the O’Reilly manual for regular expressions (2nd edition published 2002)
Regular expressions can also be used in many different programming languages.
Things to look out for:
- Every program that implements regular expressions has slightly different syntax.
- Some patterns cannot be described with regular expressions. These often involve nested or paired elements.
Status
Submitted by sondheim on Sat, 02/26/2011 - 00:00