An aggregate is a text formed by sequentially joining together a number of texts from a variety of sources into one large text.
A co-occurrence is the number of times two patterns, such as words or word fragments, occur in a set order within a set distance of one another in a source text.
In a Co-occurrence, the co-pattern is searched for within a set distance before and after the Primary pattern.
Collocation refers to the occurrence of words adjacently more often than would be expected by chance. It is the relationship between two words or groups of words that often go together and form a common expression. If the expression is heard often, the words become 'glued' together in our minds. 'Crystal clear', 'middle management' 'nuclear family' and 'cosmetic surgery' are examples of collocated pairs of words. Some words are often found together because they make up a compound noun, for example 'riding boots' or 'motor cyclist'.
A Concordance is a gathering of passages that "concord" or agree. Usually it is a gathering of passages with a sought for word.Concordances are a form of reading tool that go back to the Middle Ages. They are typically lists of words with their appearances. A concordance for the bible, for example, would have entries for all the content words of the bible in alphabetical order. Each entry would include information about where the word appears and some context. Searching for words on a computer now typically returns a concordance called a Key Word in Context (KWIC) with the sought word down the center and a few words of context on either side. Google returns a type of concordance when you search for a word with an example of the word in context for each page it recommends.
In text analysis, context refers to the text surrounding a string of characters, which may be as short as a word or as long as a paragraph.Context is particularly important when generating a concordance for a string.
- Extract text
To extract text is to remove HTML or XML elements from it. This process returns a plain text document. All text can be extracted from an HTML or XML document, or only the text within particular elements.
- Finite state machine
A mathematical abstraction sometimes used to design digital, logic or computer programs. It is a behavior model composed of a finite number of states, transitions between those states, and actions. The operation of an FSM begins from one of the states (called a start state), goes through transitions depending on input to different states and can end in any of those available, however only a certain set of states mark a successful flow of operation (called accept states).
- Fixed phrase
A fixed phrase is a phrase containing a specified word, within a context of a specified number of words on either side of that word. For example, if one were to search the sentence 'She sells sea shells by the sea shore' for 'sea' with a context of one word, the results would include 'sells sea shells' and 'the sea shore'.
- Fixed phrase list
A fixed phrase list is a list of all phrases containing a specified word, within a context of a specified number of words on either side of that word, in a given document. For example, if one were to search the sentence 'She sells sea shells by the sea shore' for 'sea' with a context of one word, the results would include 'sells sea shells' and 'the sea shore'.
- Glasgow Stop Words list
The Glasgow Stop Words list is a popular Stop list developed by the Information Retrieval Group at the University of Glasgow. The TAPoR and Voyant toolsets use a modified version of the Glasgow Stop Words list in their respective text analysis tools. The modifications include the addition of numeric characters, punctuation, other text symbols, individual letters, and the removal of words such as 'top', 'sincere' and 'beyond'. This list may be applied or ignored according to the needs of the user: for example, a search for common phrases may wish to retain the stop words in the results, while a search for the top words may wish to filter them out.
To grep is to search a text for a string or regular expression pattern of characters.
HTML (Hypertext Markup Language) is a language used in web development to make a text readable by web browsers. HTML is primarily formed of paired elements, such as
< body >< /body >
< p >< /p >
that apply some characteristic to the text within it. One pair of elements may be nested inside another like this:
< body >< p >< /p >< /body >
In this case, < body >< /body > marks the beginning and end of the body of the document, while < p >< /p > marks the beginning and end of a paragraph within the body.
- Interactive graph
An interactive graph is a graph designed to provide further information based on how the user interacts with it. For example, hovering over a data point may trigger more details about that point, while clicking on it may cause more related points to appear in the graph.
A Key Word In Context (or KWIC) is a display of results in which the word searched for, the keyword, is in the centre surrounded by one line of context. This is how a Concordance is usually displayed.
- Local file
A local file is a file located on one's own computer, rather than on a website (located 'remotely'). Many tools, such as those in the TAPoRware and Voyant toolsets, allow users to upload files from their own computers rather than using a text located on a website. To do so, select the appropriate option and use the tool's 'browse' button to locate the correct .txt, .html or .xml file from your computer's directory.
In text analysis, a pattern is a string of characters (such as a word or phrase) or Regular expression to be searched for within the source text.
- Plain text
Plain text refers to a text without any additional formatting affecting its human readability, often found in .txt files. These files do not require a specialized program, such as a word processor, to read them.
- Primary pattern
When searching for a Co-occurrence, the primary pattern is the pattern searched for first within the source text; the Co-pattern or Secondary pattern is then searched for within a set distance of that primary pattern.
- Regular expression
A regular expression is a means for matching strings of text, such as particular characters, words, or patterns of characters. A regular expression is written in a formal language that can be interpreted by a regular expression processor, a program that either serves as a parser generator or examines text and identifies parts that match the provided specification.
- Relative frequency
In statistics, relative frequency describes the number of times an event occurs over the course of an experiment or study. Relative frequencies are commonly plotted onto histograms to provide a visual representation of the frequency distribution.
- Secondary pattern
In a Co-occurrence, the secondary pattern is searched for within a set distance following the Primary pattern.
- Source text
In text analysis, the source text is the text to be acted on. The source text can be hosted on a web page or uploaded from one's local files. When using the TAPoRware and Voyant toolsets, the source text must be in Plain text (.txt), HTML (.html) or XML (.xml) format.
- Stop list
A Stop list is a series of words that you may choose to exclude from a particular operation because you deem them to be irrelevant or obstructive to your analysis task. If you are searching for descriptive terms for example, you may choose to exclude function words normally occurring as part of everyday speech. Your interest may lie only in extraordinary words.
- Text Encoding
One of the most important aspects of the text input process is the encoding of the text which you are working with. It must be encoded as either UTF8 or Latin-1, which provides proper mapping of accented and other extended characters. See the links below for more background information on encoding processes. For example, when properly encoded the character 'e' is differentiated from the character 'é' and 'é' is not seen as the character 'e' + some symbol.
Tokens are strings of characters, such as word fragments, words, phrases or sentences, generated from a source text. In text analysis, tokens are useful for generating everything from word counts, to statistical analysis, to creating a concordance.
Unicode character encoding is an evolution of the ASCII set to permit support of a greater number of alphanumeric characters including those with diacritical marks such as accents. More information on UTF-8 is available at Wikipedia.
- Word cloud
A visual presentation of keywords drawn from a text, visually differentiated based on their position and frequency of use in that text.
XML, or Extensible Markup Language, is a language used in web development to make a text readable by web browsers and/or store data. Like HTML, XML is primarily formed of paired elements. Unlike HTML, the elements are defined by the user, rather than predefined. For example, both
are valid element pairs. These elements apply characteristics and metadata to the text within them. One pair of elements may be nested inside another: