Program Overview 

XTRACT was a lexical collocation tool developed by Frank Smadja in the early 1990s that used statistical techniques for retrieving and identifying collocations in a large textual corpora (Smadja, "XTRACT" 399).

Rather than a single tool, XTRACT was a set of tools for locating words in context and making statistical observations to identify collocations (Smadja, "Retrieving Collocations" 150). Smadja aimed to promote XTRACT as useful for any text-based application, such as language generation, retrieving grammatical collocations (Smadja, "Retrieving Collocations" 171), or generating a multilingual lexicography (Smadja, "Retrieving Collocations" 174).

In 1993, Smadja rolled out an expanded and refined version of XTRACT which computed more information and was optimized to improve how much fuctional information it could extract (Smadja, "Retrieving Collocations" 150). According to Smadja, the 1993 version of XTRACT worked in three stages:

In the first stage, pairwise lexical relations are retrieved using only statistical information....In the second stage, multiple-word combinations and complex expressions are identified....Finally, by combining parsing and statistical techniques the third stage labels and filters collocations retrieved at stage one. The third stage has been evaluated to raise the precision of Xtract from 40% to 80% with a recall of 94%. (Smadja, "Retrieving Collocations" 145)

Smadja also found that the third stage could "be considered as a retrieval system that retrieves valid collocations from a set of candidates" (Smadja, "Retrieving Collocations" 166).

It used statistics to retrieve pairwise lexical relations from a corpus where they are correlated within a sentence, and "retain[ed] words (or parts of speech) occupying a position with probability greater than a given threshold" (Smadja, "Retrieving Collocations" 151). XTRACT could also apply its collocations to producing other lexicographic output, such as adding syntax (Smadja, "Retrieving Collocations" 161), producing tagged concordances, parsing texts and labeling sentences (Smadja, "Retrieving Collocations" 162).

XTRACT's results varied based on the size of the corpus (Smadja, "Retrieving Collocations" 168). Smadja found that the program was not effective low-frequency words, which negatively impacted smaller texts because they did not have a large enough distribution amongst their collocates (168). The results also varied based on the content of a corpus, evident with Smadja's example using Wall Street data:

Food is not eaten at Wall Street but rather traded, sold, offered, bought, etc. If the corpus only contains stories in a given domain, most of the collocations retrieved will also be dependent on this domain...in addition to jargonistic words, there are a number of more familiar terms that form collocations when used in different domains. A corpus containing stock market stories is obviously not a good choice for retrieving collocations related to weather reports or for retrieving domain independent collocations such as "make-decision." (Smadja, "Retrieving Collocations" 169)

In Smadja's view, XTRACT produced results with the highest quality and greatest range of collocations, which he credits to its filtering system and syntactic labeling (Smadja, "XTRACT" 411).

Last Update 
Dec 30, 2014
This document is retrieved from the Internet archive.