Using Regular Expressions to Clean a Webpage

Introduction 

This recipe will use regular expressions to clean up a webpage. This is useful if you want to carry out any meaningful textual analysis of the content in a web page. We can remove the html tags and other unnecessary textual elements with this method.

Ingredients 
  • Python3
  • re
  • urllib
  • Example code on TAPoR
Steps 
  • Import content from a website by using URL of the site you want to clean and assigning it a variable
  • Before cleaning the web content, eliminate extra lines and spaces using re.sub function. This step is important because Regex does not recognize new line characters and cause your code to not work properly.
  • Remove HTML tags using Regex functions to isolate tags found between “< >” brackets and then replacing them with a space. 
  • Eliminate additional web scripting and unnecessary code
  • Repeat step 2 to remove the extra spaces added it during the cleaning process
Discussion 

Importing websites into Python also brings in all of the additional HTML elements that may get in the way of any textual analysis we need to do on the web content. Regex is a fairly simple way to eliminate those elements. We can even use it to isolate certain tags and lines of code that we might need for our project. The Regex required for cleaning is going to change from site to site, so be sure to check the a full list of Regex metacharacters and their associated functions.

Next steps / further information 

Once your text is clean to your specifications, you can use text analysis tools to further explore it

Regular Expressions (or Regex) is a coding technique that functions in many programming languages. Regex makes use of metacharacters (!?^.) and literal strings to carry out its operations. 

A Regex cheatsheet is available here: http://www.rexegg.com/regex-quickstart.html 

TaDiRAH goals/methods 
Status