Using Regular Expressions to Clean a Text
Introduction
This recipe uses regular expressions (or Regex) to clean a text document. This recipe is based on the Using Regular Expressions to Clean a Text code.
Ingredients
- Python3
- re
- A text file
Steps
- Import the regular expressions library
- Load and read the text file by assigning it to a variable
- Now you can clean the text using any of the regular expression functions, such as:
- Eliminate White Space
- Isolate Dialogue
- Remove Punctuation
- Divide your text into a list of sentences
Discussion
Regex is a versatile way to clean and slice text. It's strength lies in its succinct code and similar expression across most programming languages.
You can use Project Gutenberg to find text files to practice with, although some further cleaning may be required to remove the addition notes made by the website, including trademarks, notes, and branding.
Status
Submitted by Kaitlyn on Thu, 03/22/2018 - 22:28