Miscellaneous

This recipe will compare two machine learning approaches to see which is more likely to give an accurate analysis of sentiment. Both approaches analyse a corpora of positive and negative Movie Review data by training and thereafter testing to get an accuracy score. The techniques are Support Vector Machines (SVM)  and Naive Bayes.

This recipe uses regular expressions (or Regex) to clean a text document. This recipe is based on the Using Regular Expressions to Clean a Text code.

This recipe will use regular expressions to clean up a webpage. This is useful if you want to carry out any meaningful textual analysis of the content in a web page. We can remove the html tags and other unnecessary textual elements with this method.

Multidimensional Scaling (MDS) is a method to convert sets of document terms into a data frame that can then be visualized. The distances expressed in the visualization show how similar, or dissimilar, the contents of one text are to another. This recipe deals with several advanced text analysis concepts and methods. Links are provided to additional information on these terms.

Panda Data frame: https://pandas.pydata.org/

TF-IDF: https://en.wikipedia.org/wiki/Tf%E2%80%93idf

Principal Component Analysis (PCA) is a method to convert sets of document terms into a data frame that can then be visualized. The distances expressed in the visualization show how similar, or dissimilar, the contents of one text are to another. PCA tries to identify a smaller number of uncorrelated variables, called "principal components" from the dataset. The goal is to explain the maximum amount of variance with the fewest number of principal components.  This recipe deals with several advanced text analysis concepts and methods.

Tokenization is the process of splitting a sentence or a chunk of text into its constituent parts. These “tokens” may be the letters, punctuation, words, or sentences. They could even be a combination of all these elements. This recipe was adapted from a Python Notebook written by Kynan Lee.

Stemming and Lemmatization are text analysis methods that return the root word of derivative forms of the word. This is done by removing the suffixes of words (stemming) or by comparing the derivative words to a predetermined vocabulary of their root forms (lemmatization). This recipe was adapted from a Python Notebook written by Kynan Lee.

The introductory section of a tutorial content.