Locate and identify
This recipe is part of the Text Analysis for Twitter Research (TATR) series, and will look at tokenizing and extracting key features from a Tweet.
This recipe with show you how to prepare Voronoi diagrams, one way of showing relationships between words in a text and a search term. In order to do this, we will employ the use of word embeddings. These represent individual words in a text as real-valued vectors in a confined vector space. This recipe is based on Kynan Ly's cookbook as seen on this notebook.
Principal Component Analysis (PCA) is a method to convert sets of document terms into a data frame that can then be visualized. The distances expressed in the visualization show how similar, or dissimilar, the contents of one text are to another. PCA tries to identify a smaller number of uncorrelated variables, called "principal components" from the dataset. The goal is to explain the maximum amount of variance with the fewest number of principal components. This recipe deals with several advanced text analysis concepts and methods.
Similar to finding People and Characters, finding locations in text is a common exploratory technique. This recipe shows how to extract places, countries, cities from a text. We will use Named-Entity Recognition (NER) module of NLKT library to achieve this. This recipe is based on Jinman Zhang's cookbook.
In this recipe we use 3 ebooks to show how topic analysis can identify the different topics each text represents. We will use Latent Dirichlet Allocation (LDA) approach which is the most common modelling method to discover topics. We can then spice it up with an interactive visualization of the discovered themes. This recipe is based on Zhang Jinman's notebook found on TAPoR.
NB: Any number of texts can be used, we choose 3 for this recipe.
A common requirement in extracting information is the ability to identify all persons or characters referred to in the text. It is an elaborate process of knowing parts-of-speech in the text, tagging them and retrieving the names associated with those dialogues. The common approach is by using part-of-speech POS-tagger which analyses a sentence and associates words with their lexical descriptor i.e. whether it is an adverb, noun, adjective, conjuntion e.t.c. NLTK is a robust library and therefore the main ingredient of our recipe.
This recipe explores how to analyze a corpus for the locations that are written about within it. These results can be mapped to visualize the spatial focus of the corpus. This recipe is based off of an iPython notebook by Matthew Wilkins.
This recipe uses an Aggregate Text tool to generate Dynamically Aggregated Text, and uses tools such as a frequency list and Concordance to explore the results.
This recipe demonstrates how to use the command line 'Grep' command and regular expressions to find patterns within a plain text file.
This recipe uses List Words, Concordance and Collocation tools to explore themes in blog discourse. It is important to keep individual blog entries around the same length to ensure consistent results when analyzing your compiled text.