Using Regular Expressions to Clean a Webpage
Introduction
This recipe will use regular expressions to clean up a webpage. This is useful if you want to carry out any meaningful textual analysis of the content in a web page. We can remove the html tags and other unnecessary textual elements with this method.
Ingredients
- Python3
- re
- urllib
- Example code on TAPoR
Steps
- Import content from a website by using URL of the site you want to clean and assigning it a variable
- Before cleaning the web content, eliminate extra lines and spaces using re.sub function. This step is important because Regex does not recognize new line characters and cause your code to not work properly.
- Remove HTML tags using Regex functions to isolate tags found between “< >” brackets and then replacing them with a space.
- Eliminate additional web scripting and unnecessary code
- Repeat step 2 to remove the extra spaces added it during the cleaning process
Discussion
Importing websites into Python also brings in all of the additional HTML elements that may get in the way of any textual analysis we need to do on the web content. Regex is a fairly simple way to eliminate those elements. We can even use it to isolate certain tags and lines of code that we might need for our project. The Regex required for cleaning is going to change from site to site, so be sure to check the a full list of Regex metacharacters and their associated functions.
Status
Submitted by Kaitlyn on Thu, 03/22/2018 - 22:20