5.4 CLASSIFYING THE REMAINDER OF THE LEXICON
Iterative clustering has now identified a small subset of words from the lexicon. We will call this subset the CORE GROUPING (see Figure 5.9 at the end of this section). In Chapter 4, we revised the goal of our grouping algorithm to: seek a "well-behaved" subset of the language which can be used as the basis for categorizing the remainder of the language. To verify whether it meets this goal, we attempted to classify the remainder of the lexicon using the CORE GROUPING.
We use the contexts associated with the CORE words to classify non-CORE words. In particular, we scan through the LEXTRIPLES database identifying all of the original data contexts associated with each CORE word. We will call these the CORE contexts. Since we are looking for contexts that might be used to classify OTHER words, we limit the search to shared contexts. Each CORE context is stored in a new database TAGFEATURES, along with the CORE word's group label. Note that each shared context is associated with more than one word, and there is no guarantee that all CORE words associated with a context will belong to the same CORE group. In fact, it is highly likely that there will be multiple entries in TAGFEATURES for many of the CORE contexts.
By storing the CORE contexts in their "raw" data form, we are able to classify a non-CORE word that appears in exactly the same context as a CORE word. We also want to be able to use some generalization here. In particular, we found that the generalization embodied in our abstraction process was a key component in the iterative clustering grouping procedure. There, we performed generalization by replacing all grouped words by their group labels. Similarly, here we will replace any CORE words in the CORE contexts by their group labels. To separate the two types of CORE contexts we identify the "raw data" CORE contexts by a DATA flag; the "abstract" CORE contexts by an ABSTRACT flag. Thus, the TAGFEATURES table actually contains two databases: the DATA CORE contexts and the ABSTRACT CORE contexts.
Since we are dealing with real world data, we must be concerned with noise. In particular, we are concerned with spurious CORE contexts. This would be an isolated shared context that a CORE word shares with no other CORE word in the same group. In the iterative clustering procedure, we found that the effects of noisy, shared contexts could be neutralized by requiring that words share a minimum number of contexts (2 seemed to be a sufficient number). Here, we will attempt to neutralize the affect of noisy CORE contexts by retaining only those CORE context entries for which the context is associated with a minimum number of CORE words (again, 2 seems to be a sufficient number). We do this by adding a word count to each CORE context entry. Each time a given CORE context appears with a new word from a given CORE group, we increment the counter. Algorithm 5.1 actually builds the TAGFEATURES table. When the TAGFEATURES table is completed, we scan through it eliminating all entries with too low a word count (algorithm not provided).
Table TAGFEATURES has all the information necessary for classifying the remainder of the lexicon. The classifying procedure involves comparing the contexts associated with a non-CORE word against the entries in TAGFEATURES. As CORE contexts are identified, the CORE groups corresponding to those contexts need to be recorded. If all the CORE contexts identified are associated with only one CORE group, then the non-CORE word will be classified in that group.
However, quite frequently the contexts are associated with multiple CORE groups. This can happen if: (1) one of the CORE contexts is associated with more than one CORE group, or (2) if separate CORE contexts are associated with separate groups. To resolve this situation, it is necessary to identify the "best" CORE group for classifying the non-CORE word (or, to decide not to classify the word). It seems wise to assign the word to the most frequently occurring CORE group over its set of contexts. So we "tally" the CORE groups encountered during the scan through the non-CORE word's contexts.
But how should we do this tallying? We could simply add 1 each time a CORE group is encountered. But this seems wrong. Some CORE contexts are better than others. A CORE context that is associated with 10 words from its CORE group should carry more weight than a CORE context that is associated with 2 words from its CORE group. A simple method for incorporating this "weighting" is to add the word count associated with the <CORE context, CORE group> pair to the appropriate CORE group accumulator each time a CORE context is matched.
Finally, there is the issue of deciding whether to classify a word, or not. We have decided to be conservative, and only to classify a word if its contexts "strongly" select a CORE group. This is determined by summing all of CORE-group weighted scores associated with the word. Then, if there is a CORE group that has greater than 50% of this total, the non-CORE word is classified in that group.
Algorithm 5.2 actually performs this classification. Note that we maintain separate CORE-group scores for DATA and ABSTRACT contexts. Also note that we only need to check the DATA CORE contexts if the non-CORE word's context is a shared context. However, all of the contexts must be checked against the ABSTRACT CORE contexts. The DATA contexts are given primacy because they are actual contexts that words have occurred in. In this sense, they deal with what is really known about the words. The ABSTRACT contexts, on the other hand, are generalizations. An ABSTRACT context match may occur even though NO word has ever actually occurred in that context. This is good because it allows us to classify words that occur infrequently (even words with only 1 occurrence). But since the DATA contexts really carry more information, we will attempt to classify a word using them first. If this fails, we will then attempt to classify the word based on its ABSTRACT contexts. If both fail, the word is not classified.
Although our requirement - that a classifying CORE context must have in excess of 50% of the summed weighted-scores over all CORE contexts encountered - seems quite restrictive, the classifying procedure does quite well in practice. In the primary corpus, the 267 CORE words generated a set of CORE contexts that actually classifies an additional 1629 words. This resulted in 6-fold increase in the size of grouped portion of the lexicon (to 1896 word - 18.87% of the lexicon). Further, the number of errors rises only to 128 (6.75% of the grouped words). Figure 5.9 shows the CORE GROUPING for the primary corpus, and Figure 5.10 shows the word groupings generated by applying our classification process to that CORE GROUPING. In the next section, we will discuss these results in more detail.