Using Regular Expressions to Clean a Text
This recipe uses regular expressions (or Regex) to clean a text document. This recipe is based on the Using Regular Expressions to Clean a Text code.
- A text file
- Import the regular expressions library
- Load and read the text file by assigning it to a variable
- Now you can clean the text using any of the regular expression functions, such as:
- Eliminate White Space
- Isolate Dialogue
- Remove Punctuation
- Divide your text into a list of sentences
Regex is a versatile way to clean and slice text. It's strength lies in its succinct code and similar expression across most programming languages.
You can use Project Gutenberg to find text files to practice with, although some further cleaning may be required to remove the addition notes made by the website, including trademarks, notes, and branding.
Regular Expressions (or Regex) is a coding technique that functions in many programming languages. Regex makes use of metacharacters (!?^.) and literal strings to carry out its operations. For a full list of Regex metacharacters and their associated functions, please see the Regex cheatsheet: http://www.rexegg.com/regex-quickstart.html
After cleaning your text, you can explore it using text analysis tools.