Mortimer

Early in 1993 Clarke and Bartell began a series of experiments in automatic text classification. Clarke brought to the work an understanding of the nature of classification and experience in automated cluster analysis, Bartell his knowledge of the application of neural networks to the problem of information retrieval. Still fondly recalled are Clarke's first sketches of a graphical search interface, in which a few circles represented broad topics, within which would be represented individual "hits" (documents in the data being searched that were found to satisfy in significant degree the criteria for the various topics); the different diameters of the circles were proportional to the number of hits falling within each, or, as it was termed at the time, their "pregnancy."

In building his experimental software, Bartell employed a mathematical technique called multidimensional scaling. Very roughly, groups of text documents were analyzed statistically to find associations among the lexical elements (words and phrases) of which they were composed. While particular terms might be characteristic of certain subjects, such terms are too few and infrequent to serve as reliable indicators for each of a large set of diverse documents. The tendency of clusters of terms to occur together in certain patterns, however, is much more diagnostic of subject matter, and it was these patterns that Bartell's program was intended to detect.

Meanwhile, Clarke discovered the Propaedia. This volume of the Encyclopędia Britannica consisted chiefly of an elaborate "Outline of Knowledge." The outline provided precisely what was needed in order to construct a tool that would not only cluster but classify documents, namely a logical and thorough taxonomy of topics. As part of his study of the Propaedia, Clarke got in touch directly with the Editorial group in Chicago, from whom he learned that each of the 64,000-odd articles in the Micropaedia portion of the encyclopedia had been classified against the Propaedia outline. One or more tags denoting the classification(s) of each article were part of each article's text file in PSEdit.

A training set of about 45,000 tagged Micropaedia documents (some 70% of the whole) was selected and submitted to analysis. The neural network in effect "learned" how to recognize the semantic characteristics of articles in various categories. The first test of the software's efficacy as a classifier was then to submit articles from the remaining 30%, without their Propaedia tags, and compare the software's suggested classification with the editors'. Accuracy at the top level of classification - the ten principal parts of the Propaedia - was about 96%; it decayed progressively at successively deeper layers, falling to about 75% at the fourth level. In subsequent tests and demonstrations the software's output was generally held to three levels.

The next and far more demanding test was to submit text documents from other sources to the classifier. In the course of many months of development, sample data were obtained from the Los Angeles Times, the New York Times, and the yet-unpublished new edition of the Grove Dictionary of Music and Musicians. Results varied with source but were uniformly good, sometimes excellent. The Grove articles, written in a style similar to that of Britannica, were classified with great accuracy. Articles from the New York Times varied in style and vocabulary; news articles were handled very well, while some feature columns proved hard to place in topic space. Editorial inspection of some of these showed that they tended to be musings about this and that, with no consistent or firmly stated subject; the classifier, interestingly, tended to place these in the category Literature.

The system was dubbed "Mortimer" by Kester, in honor of the originator of the Propaedia. As essential element of Mortimer was a method for displaying the results of the calculations performed by the software. In essence, for each analyzed document, a probability score was calculated for each possible topic, in this case 176 topics (the number of third-level rubrics in the Propaedia). These scores constituted a vector in a 176-dimensional space. That vector, and those of all other documents under analysis, had then to be projected onto the two-dimensional space of the computer screen. Distortion of one sort or another was inescapable, and there was no one correct answer; and, indeed, a lively debate over how to effect the visualization persisted for some months. The actual user interface adopted for demonstrations within Britannica and to outside parties, designed by John Dimm, showed the topic areas as large, irregular colored regions. (The shape, size, and relative positions of these regions had previously been calculated in a similar fashion from the semantic statistics for all training documents in each classification.) Clicking on one of these areas would bring it to the foreground and reveal the next lower level of classification. By this means the user was able, in effect, to navigate down through the outline.

In March 1995 Bartell and Clarke filed a patent application for Mortimer; their patent for "Method and System for two-dimensional visualization of an information taxonomy and of text documents based on topical content of the documents" was granted in April 1997.

When Mortimer was given a search query, it would determine which documents in the set were relevant, assign a topic value to each, and place dots - indicators of "hits" - in the topic map. This ability was especially impressive when the query term was polysemous, i.e., might have any of several quite different meanings. Kester's favorite term for demonstrating the power of Mortimer was "depression," which might refer to economics, psychology, geology, or other matters. Here is the top level of Mortimer's display of articles in Britannica relevant to the query term "depression." The green dots indicate Britannica articles. The one judged most relevant, because its title matches the query term, has its title shown as well.

If the user decided that the sense of "depression" in which he was interested lay in the area of Human Life, he could click on that area to produce a second topic map at the next level down; another click would bring the view down to the third level. Note that article titles are revealed as part of this navigation downward. Alternatively, had the area of interest been geological depression, the user might have clicked down from The Earth to this third level display.

This ability to detect and distinguish different senses of search terms is called "disambiguation." No ordinary search engine could do this; no ordinary search engine can do it today. More remarkable still, Mortimer could run over several distinct databases of text documents simultaneously. Hits from different sources appeared as differently colored dots on the map. (Unfortunately, no screen shot of this is available.) It was this aspect of the tool that underlay the third of the imagined extensions of Britannica Online, the notion of Gateway Britannica.

next

©2003 by Robert McHenry