5.6 EXPANDING THE CORE
The concept underlying our approach to expanding the CORE GROUPING is that the important element in the classification process is the set of CORE contexts, not the CORE words. The CORE words allowed us to identify an initial set of CORE contexts. Once these contexts were identified, they were the critical factor in deciding whether a word entered a given class. So, the key to improved classification seems to be through identification of a better set of CORE contexts.
In emphasizing CORE contexts, we are following Pinker's theory of language acquisition [PINKER84]. His system bootstraps itself with a set of initial word categories that are temporary. Words are assigned permanent categories only when they appear in the proper "structure-dependent distribution." We treat our initial CORE GROUPING of Figure 5.9 as an initial temporary word classification. Our CORE contexts provide the critical structure dependencies that determine whether a word should continue (or enter, or leave) the CORE GROUPING.
We begin the CORE expansion process by examining the set of CORE contexts that were stored in the TAGFEATURES table during classification. First we remove all abstract contexts from the table. Then, from the set of DATA CORE contexts, we remove all contexts that are associated with more than one CORE group. Then we eliminate all remaining contexts that have only one associated word. We call the set of contexts remaining after these filtering steps the KEY contexts. Algorithm 5.3 performs these tasks.
Then we use the KEY contexts to locate all words that occur in those contexts. We place these words, along with their associated KEY context, in the table TAGWORDS. Then we scan TAGWORDS, eliminating all words associated with more than one CORE group (recall that TAGFEATURES contains a CORE GROUP label for every KEY context). Then we scan through TAGWORDS one more time eliminating all words that <context, word> pairs for contexts that have only one associated word. Algorithm 5.4 performs these tasks.
When table TAGWORDS has been completely processed, it contains the next CORE. Algorithm 5.5 transfers this CORE to the lexicon. Note that the old CORE is completely erased prior to the new CORE transfer. This allows for the non-monoticity we mentioned in the previous section. Words only stay in the CORE if they managed to service the sequence of Filters that were applied to TAGFEATURES and TAGWORDS. After the new CORE has been transferred, the lexicon is scanned and any word groups that have less than two members are removed.
When this expansion procedure was applied to the CORE GROUPING, the new CORE had 371 words distributed over 39 word groups with an error count of 13 words (3.50% of the grouped words). Most of this error was located in the word group labeled AT, which had 4 nouns and 4 verbs. This accounted for 8 of the 13 errors. Significantly, the largest word group, which was labeled MAJORITY, had its error count reduced from 5 to 1. Thus, almost all of the verbs in that group had been removed. The intent of this procedure was to isolate a reliable subset of the classified words. These words would then form a new, hopefully improved, CORE.
When we applied the classification procedure to this new expanded CORE, we found that fewer words were classified. However, the quality of the classification seemed to be better (the 1,742 words that were grouped produced only 98 noun-verb mismatches - a 5.63% error rate). This improvement was encouraging, so we decided to iterate the expansion procedure until no more changes occurred (that is, a stable CORE was produced).
The above procedure iterated 4 times before it terminated. The history of that derivation and the resulting stable CORE are shown in Figure 5.14. The resulting classification is shown in Figure 5.15.