Sidebar: The search engine SmarTrieve worked by building an inverted index of a target text corpus. That is, first having eliminated "stop" words - a hundred or so words that are so common as to have no value in discriminating one piece of text from another - it created a list of every word that occurred in the text and, for each word, a list of addresses where it occurred. This involved using some stemming algorithms to eliminate inflected word forms that would overcomplicate the search and retrieval task. Thus, plurals were stemmed to their singular form, verb inflections to the infinitive form, and so on.

In addition, co-occurrence indices were built to record instances of words that tended to occur close together in text - fixed phrases and other forms. These indices permitted the identification and retrieval of segments of text containing a so-called "key word in context." Thus, for a query such as "Why do leaves fall from trees?" the retrieval engine would find not merely all isolated instances of "leaf," "fall," and "tree," but would give special weight to those where all three, or at lesser weight two of the three, fell close together in running text.

Finally, relevance-ranking algorithms in the retrieval engine would order the returned list of "hits" by expected relevance to the original query, giving extra weight, for example, to passages in which key terms appeared in a title, or appeared several times, or appeared close to the beginning of a segment rather than near the end.

back to the main text

©2003 by Robert McHenry