This recipe uses regular expressions (or Regex) to clean a text document. This recipe is based on the Using Regular Expressions to Clean a Text code.
This recipe will use regular expressions to clean up a webpage. This is useful if you want to carry out any meaningful textual analysis of the content in a web page. We can remove the html tags and other unnecessary textual elements with this method.
This recipe will show how to generate basic concordances of a word and showing it within a textual context. We will use "find and replace" strategy known as regular expressions in this approach. Regex, as it is also known is universal to most programming languages and is a well documented method of parsing text. This recipe is based on Jinman's cookbook.
Tokenization is the process of splitting a sentence or a chunk of text into its constituent parts. These “tokens” may be the letters, punctuation, words, or sentences. They could even be a combination of all these elements. This recipe was adapted from a Python Notebook written by Kynan Lee.
Stemming and Lemmatization are text analysis methods that return the root word of derivative forms of the word. This is done by removing the suffixes of words (stemming) or by comparing the derivative words to a predetermined vocabulary of their root forms (lemmatization). This recipe was adapted from a Python Notebook written by Kynan Lee.
The introductory section of a tutorial content.
This section presents a concise summary of what the recipe will teach, focusing primarily on: (1) the outcome of the recipe (what are you trying to achieve, in non-technical terms); (2) the main technical approaches employed; and (3) whether the recipe is based on someone else’s work/code (they should be cited if so).
This recipe discusses ways to find electronic texts (e-texts) online that can be used by other text analysis tools.