Mapping Locations from a Corpus
This recipe explores how to analyze a corpus for the locations that are written about within it. These results can be mapped to visualize the spatial focus of the corpus. This recipe is based off of an iPython notebook by Matthew Wilkins.
- Python 3
- Stanford NER Java Package
- Google Cloud Services API key
- A corpus of texts to analyze
- Matthew Wilkens' Example Code
Recognizing Named Entities with Stanford NLP
In this recipe we will used the Stanford NLP to identify locations. The Stanford named entity recognizer works by learning (from hand-tagged training data) the words and word types that are typically used as locations (and other types of named entities) in context. This means that it can recognize places that were not present in the training data if the context in which they appear strongly indicates a place name. For instance, "I was born in Xxx" obviously contains a place name ("Xxx"), even though we don't know what that place is. Unfortunately this method requires a trained entity model for each language you want to analyze.
To determine locations, run the Stanford Named Entity Recognition package from the Command Line, entering each text file you want to analyze and outputing a .tsv tabbed format file. After completing this, import the corpus and create a metadata table for their information, then import and parse the .tsv files to determine locations. Create three lists of equal length, identifying file IDs, the location string, and the number of occurrences in that file. Once we have this data, we can combine it in database-like ways.
At this stage, you can plot the location data to see which areas are spoken about the most. Some possible things to search are the total number of times a location is mentioned across the corpus, or how many different books mention a place.
It is useful to build a summary dataframe with total occurrence counts for easier manipulation (rather than recalculating every time we need the numbers). This list will likely be very long, and we can cull much of the data by choosing to work with a more limited set of important terms, for instance only using terms that appear more than 5 times in at least 2 volumes. The tension here is between recall and precision, and the thresholds can be tweaked to fit the inquiry.
Performing Geocoding with Google APIs
In order to perform geocoding you will need to sign up for a free Google Cloud Services API key. Google API keys are short (40-character) strings of random-looking text (e.g. 'AIzaJNEjnvslrknvslDWNNW9fr6DWojnsvokjtl'). You should store yours somewhere non-public, since it allows applications to run against Google services (potentially incurring charges) on your behalf.
- Go to https://cloud.google.com/storage/docs/json_api/v1/how-tos/authorizing#APIKey
- Click on the "Credentials page" link in step 1
- Click "Create a Project" (the site might sit and think for about 30 seconds)
- Click "Create Credentials" and select "API Key" (not OAuth or Service Account) from the dropdown menu
- Copy the key that is given to you and paste it into a plain text document, storing it somewhere safe on your computer
- Explore Matthew Wilkens' Example Code
- Try the Named Entity Recognition step with a fictional-world corpus such as Tokien's The Lord of the Rings/The Hobbit/The Silmarillion. Once locations are identified, consider how to map them without Google.