Creating a Basic Web Scraper with Python
Introduction
This utility is for creating a simple web scraper with Python.
Ingredients
- Python 3
- A notebook editor such as Jupyter
- Example code from The Art of Literary Text Analysis (forthcoming)
Steps
- Create a text file listing the url's to scrape from, with each url on a new line
- Create a list of urls to scrape by reading a string from the file and splitting it on the newline character
- Import the urllib.request library
- Try and scrape a page that you know works
- Create a loop that scrapes each site and puts it in a list:
- Prime an empty list before the loop
- In each loop iteration, open the current url's website, read it, and save the contents in a list item
- Create a new for loop to print out the results in the list:
- For each iteration of the loop, print a segment of the scraped text (e.g. text[:200] to print the first 200 characters) and a newline character "\n" to create a space between the scraped sites
- Save a file of the results:
- Open a new file to write
- Create a loop to cycle through the list of scraped sites. With each iteration of the loop:
- Create a string containing the current list item (scraped website)
- Add a newline character to the string to create space between each website in the file
- Write the string to the file
Submitted by GregWS on Wed, 03/29/2017 - 16:40