Gathering
This recipe uses regular expressions (or Regex) to clean a text document. This recipe is based on the Using Regular Expressions to Clean a Text code.
This recipe will use regular expressions to clean up a webpage. This is useful if you want to carry out any meaningful textual analysis of the content in a web page. We can remove the html tags and other unnecessary textual elements with this method.
This recipe discusses ways to find electronic texts (e-texts) online that can be used by other text analysis tools.
This is a recipe for looking at the changes that have taken place in a wikipedia article over time, and generating a corpus of the different edited versions.