Creating a Basic Web Scraper with Python

Introduction 

This utility is for creating a simple web scraper with Python.

Ingredients 
  • Python 3
  • A notebook editor such as Jupyter
  • Example code from The Art of Literary Text Analysis (forthcoming)
Steps 
  • Create a text file listing the url's to scrape from, with each url on a new line
  • Create a list of urls to scrape by reading a string from the file and splitting it on the newline character
  • Import the urllib.request library
  • Try and scrape a page that you know works
  • Create a loop that scrapes each site and puts it in a list:
    • Prime an empty list before the loop
    • In each loop iteration, open the current url's website, read it, and save the contents in a list item
  • Create a new for loop to print out the results in the list:
    • For each iteration of the loop, print a segment of the scraped text (e.g. text[:200] to print the first 200 characters) and a newline character "\n" to create a space between the scraped sites
  • Save a file of the results:
    • Open a new file to write
    • Create a loop to cycle through the list of scraped sites. With each iteration of the loop:
      • Create a string containing the current list item (scraped website)
      • Add a newline character to the string to create space between each website in the file
      • Write the string to the file
Further information 
  • This recipe is based on an example in The Art of Literary Text Analysis (forthcoming)
  • Try and write a function that tests whether a link is bad before scraping it, so that you do not need to only have good links in your file
  • Try and automate the process of cleaning up the HTML formatting to just get the text. Try this using Beautiful Soup for Python.