3.3 WORD ABSTRACTION
Harris [1982] indicated that we should group words based on the classifications of their predecessor and successor words. This, of course, implies a bootstrapping problem. However, our word groups in Figure 3.4 could be grouped in the manner that Harris suggested. Using our three main word groups, we see the following patterns:
Thus, at this, higher -- more "abstract" -- characterization of the data, we find that the two verb sets have identical feature sets -- <NOUN, NOUN>. Thus, at this abstract level, the two verb groups would combine. But, the noun group shares no features with the verb groups. Thus, we get a nice splitting of nouns and verbs for this "toy" language.
The implementation of this abstraction step is again quite simple. As indicated in Algorithm 3.5, we scan table LEXTRIPLES, retaining all entries that have at least one of their three words grouped. It is important to allow words that were not grouped on the initial clustering pass to remain in the database. Recall that the word "eat" was not grouped with any other word on the first pass. It would be impossible to group it with the other verbs if it were not retained during the abstraction process. Although it is not evident in this "toy" language, this step is a filtering mechanism. In a more realistic language, there would be a number of entries removed from the table LEXTRIPLES because NONE of their component words would be grouped.
The second part of the abstraction process is to replace all grouped words by their group label. For example, in the "toy" language each noun occurrence would be replaced by the label NOUN. This has the effect of shrinking the lexicon. Further, this reduction in "vocabulary" will necessarily affect the number of distinct words appearing in the LEXTRIPLES entries. Thus, we will get a reduction in the number of features associated with words at this "abstracted," more general level.
This abstraction process will have two affects. First, words become closer because they have fewer features. This reduces the size of the denominator in the Tanimoto coefficient (which will increase the coefficient - similarity - and decrease the distance). Second, words become closer because more general features lead to more sharing of features. Thus, the numerator of the Tanimoto coefficient will become larger. Both of these affects are evident above. The abstraction process has reduced the set of features for the two verb sets to a single, shared context.
Thus, we are now using five algorithms to group words. The actual steps in our procedure are: