Error message

  • Notice: Trying to access array offset on value of type int in element_children() (line 6609 of /var/data/sites/methodi.ca-7.63/includes/common.inc).
  • Notice: Trying to access array offset on value of type int in element_children() (line 6609 of /var/data/sites/methodi.ca-7.63/includes/common.inc).
  • Notice: Trying to access array offset on value of type int in element_children() (line 6609 of /var/data/sites/methodi.ca-7.63/includes/common.inc).
  • Notice: Trying to access array offset on value of type int in element_children() (line 6609 of /var/data/sites/methodi.ca-7.63/includes/common.inc).
  • Notice: Trying to access array offset on value of type int in element_children() (line 6609 of /var/data/sites/methodi.ca-7.63/includes/common.inc).
  • Deprecated function: implode(): Passing glue string after array is deprecated. Swap the parameters in drupal_get_feeds() (line 394 of /var/data/sites/methodi.ca-7.63/includes/common.inc).

Topic Modelling to Identify Themes

Introduction 

In this recipe we use 3 ebooks to show how topic analysis can identify the different topics each text represents. We will use Latent Dirichlet Allocation (LDA) approach which is the most common modelling method to discover topics. We can then spice it up with an interactive visualization of the discovered themes. This recipe is based on Zhang Jinman's notebook found on TAPoR.

NB: Any number of texts can be used, we choose 3 for this recipe.

Ingredients 
Steps 
  • Extract the 3 text files and make a list containing all the texts.
  • Preprocessing - we iterate through the list, for each text we:
    • tokenize sentences into words
    • remove any punctuations and stopwords
    • lemmatize the text
  • The processed corpora then converted into a matrix by:
    • first, get the top vocabularies ordered by term frequency across the corpus. Since the tokenized texts can be too big, we can simply get the top 1000 from the results.
    • we then learn the vocabulary by using results above as the training set
    • finally, return a transformed matrix.
  • Specify the number of topics (typically same as number of texts, in this case 3) and a threshold for the number of top words each topic can have. 
  • We use an LDA modelling method and get the words from each text that closely represent the main topic of discussion. 
  • Visualizations
    • Visualization is done by first getting a distribution of the topics in each text then transforming the LDA modelling results into a document-word matrix 
    • A more advanced and interactive visualization is done using pyLDAvis ingredient and uses LDA modelling results as input
Discussion 

Depending on the size of texts collected, topic modelling will give you a fairly good idea of the texts at hand. Even without the visualization ingredients, the resulting word-sets should give you an insight especially working with large, diverse and unstructured texts.

This is a common method to do preliminary analysis on texts and identify themes you may want to confirm and pursue.

Next steps / further information 

This recipe is based on Jinman Zhang's cookbook (see TAPoR). 

Status