Find Patterns in a Text File Using Regular Expressions
This recipe demonstrates how to use the command line 'Grep' command and regular expressions to find patterns within a plain text file.
- Download a copy of the text file(s)
- Open a shell or terminal
- In general you will use commands that look like grep “PATTERN” filename.txt
- the PATTERN is the Regular expression
- Using our example of Alice’s Adventures in Wonderland, here are some examples of searches you could execute using regular expressions:
- To find every instance of the word “waistcoat” in the file the command would be:
- grep “waistcoat” 11.txt
- To find every two-letter word that ends in ‘t’ (such as it, at) the command would be:
- grep “ .t “ 11.txt
- The dot matches exactly one character
- To find all of the words that end in ‘ed’ the command would be:
- grep “.*ed “ 11.txt
- The dot matches exactly one character and the star says 0 or more copies of the preceding character
- To find either the word ‘she’ or ‘the’ in the text, the command would be:
- grep “ [st]he “ 11.txt
- The square brackets will match ‘he’ with either ‘s’ or ‘t’ as the first character. It will not match the word ‘he’
- If you wanted to search more than one text file, or a whole corpus the command would be:
- grep -R
The grep man page: on the terminal type “man grep” or online go to http://www.ss64.com/bash/grep.html
A good online tutorial is: A Tao of Regular Expressions Mastering Regular Expressions (by Jeffrey Friedl) - the O’Reilly manual for regular expressions (2nd edition published 2002)
Regular expressions can also be used in many different programming languages.
Things to look out for:
- Every program that implements regular expressions has slightly different syntax.
- Some patterns cannot be described with regular expressions. These often involve nested or paired elements.
- There are interesting connections between regular expressions, formal languages, and finite state machines or automata.
- Once you understand how to use regular expressions they can be used for searching and replacing in text editors.
- Using the UNIX or LINUX pipeline ( | ) character to string together (in sequence) multiple commands allows manipulations that would be difficult without specialized tools.