Mapping Locations from a Corpus

Introduction 

This recipe explores how to analyze a corpus for the locations that are written about within it. These results can be mapped to visualize the spatial focus of the corpus. This recipe is based off of an iPython notebook by Matthew Wilkins.

Ingredients 
Steps 

Recognizing Named Entities with Stanford NLP
In this recipe we will used the Stanford NLP to identify locations. The Stanford named entity recognizer works by learning (from hand-tagged training data) the words and word types that are typically used as locations (and other types of named entities) in context. This means that it can recognize places that were not present in the training data if the context in which they appear strongly indicates a place name. For instance, "I was born in Xxx" obviously contains a place name ("Xxx"), even though we don't know what that place is. Unfortunately this method requires a trained entity model for each language you want to analyze.

To determine locations, run the Stanford Named Entity Recognition package from the Command Line, entering each text file you want to analyze and outputing a .tsv tabbed format file. After completing this, import the corpus and create a metadata table for their information, then import and parse the .tsv files to determine locations. Create three lists of equal length, identifying file IDs, the location string, and the number of occurrences in that file. Once we have this data, we can combine it in database-like ways.

At this stage, you can plot the location data to see which areas are spoken about the most. Some possible things to search are the total number of times a location is mentioned across the corpus, or how many different books mention a place.

It is useful to build a summary dataframe with total occurrence counts for easier manipulation (rather than recalculating every time we need the numbers). This list will likely be very long, and we can cull much of the data by choosing to work with a more limited set of important terms, for instance only using terms that appear more than 5 times in at least 2 volumes. The tension here is between recall and precision, and the thresholds can be tweaked to fit the inquiry.

Performing Geocoding with Google APIs
In order to perform geocoding you will need to sign up for a free Google Cloud Services API key. Google API keys are short (40-character) strings of random-looking text (e.g. 'AIzaJNEjnvslrknvslDWNNW9fr6DWojnsvokjtl'). You should store yours somewhere non-public, since it allows applications to run against Google services (potentially incurring charges) on your behalf.

If you do not already have an API Key, follow these steps to get one:
  • Go to https://cloud.google.com/storage/docs/json_api/v1/how-tos/authorizing#APIKey
  • Click on the "Credentials page" link in step 1
  • Click "Create a Project" (the site might sit and think for about 30 seconds)
  • Click "Create Credentials" and select "API Key" (not OAuth or Service Account) from the dropdown menu
  • Copy the key that is given to you and paste it into a plain text document, storing it somewhere safe on your computer
Geocoding with Google requires two stages. First, use Google's Places API to identify the location in question, then use the placeid you receive with the Geocoding API to look up all the details for the location. These APIs return JSON data, which looks a lot like a Python dictionary; the googlemaps client parses the incoming JSON into straight dictionaries, which can be turned into Python dictionaries. The full geo data returned by the API is likely more than you require, so only extract as much information as you require. Remember to deliberately limit the number of queries per second to avoid going over usage limits.
 
Now that you have the geodata, combine it with the existing dataset to geocode the locations and have everything in one place.
 
Mapping the Data
Import mapping libraries into Python and use them to map the locations. Play with mapping different bits of the data, such as only mapping cities, or aggregating places by their country and plotting the countries. The dots of locations on the map can be scaled to reflect the number of occurrences, visualizing the spatial emphasis of the corpus.
Next steps / further information 
  • Explore Matthew Wilkens' Example Code
  • Try the Named Entity Recognition step with a fictional-world corpus such as Tokien's The Lord of the Rings/The Hobbit/The Silmarillion. Once locations are identified, consider how to map them without Google.
Status