Voronoi Diagram with Word Embedding


This recipe with show you how to prepare Voronoi diagrams, one way of showing relationships between words in a text and a search term. In order to do this, we will employ the use of word embeddings. These represent individual words in a text as real-valued vectors in a confined vector space. This recipe is based on Kynan Ly's cookbook as seen on this notebook.

  • Open the file and load the corpus into a sentence tokenizer.
    • This will split the text into chunks of sentences and create a list with all the sentences.
  • Clean the text to remove punctuations and make text uniform.
    • Depending on your needs, you can choose which punctuations and even contractions to remove. In this recipe, we will simply lowercase all the words in the corpus, remove standard stopwords and lemmatize the text. Feel free to perform any custom cleaning of the corpus to your taste.
  • Do basic collocation with a search term in the corpus. 
    • This is done by iterating through the list of cleaned sentences and taking note of all the sentences that feature the given search term. This is important for the next step where we perform word-embedding.
  • Using the results of the last step above to train and assign valued vectors on each word. We can set a limit on the minimum number of words to model with and also reduce the dimensions possible. A trained model is generated at this point.
    • The trained model is a vector in multiple-dimensions which needs to be converted into 2-Dimensions (2D) in order to be represented as a Voronoi diagram.
  • To make the final diagram useful, it is important to create a canvas on which to present your Voronoi diagram since the generated 2D can be rendered as infinite vertices. In order to prevent the diagram from extending infinitely, it is important to define a bounded container where the regions will be represented with finite points.
  • To actually plot the Vononoi Diagrams, we take in the 2D results to create polygons and appended them into a list in an iterative manner. These polygon regions are assigned different colors and annotated with labels for rendering.

This recipe makes use of Principal Component Analysis (PCA) to convert vectors into 2D plane in order to plot them. Read more about PCA.
Word embedding is a Machine learning technique for mapping words into vectors with real-valued numbers. Read more on word embedding here and here.

Voronoi diagrams are excellent at visually representing relationships between words in a corpus especially in relation to a particular keyword or search term. In a corpus, this is a more appealing visualization of collocated words and visual display how close the collocating words are to each other as well as to the search term at the center.

Next steps / further information 

Here are other recipes with some the concepts used in this recipe:
Document Clustering by Topic using K-Means and PCA  
Basic Collocation in Python 
Find Collocated Words 

Explore the code on TAPoR