5.7 HOW GOOD IS THIS EXTENDED CLASSIFICATION?
Our classification routine has generated a new set of word classes for our primary corpus. Not only do we have more words that have been grouped, but they are distributed over a smaller number of groups. To see if this classification is any better than the one described in section 5.5, we will examine how well it covers our word-occurrence and shared-context distribution classes.
As before, we will concentrate on the shared-context distribution classes. Now we have classified 100% of the words that have 10 or more shared contexts. Prior to extending the CORE, we had 100% coverage of words with 20 or more shared contexts. So we see an improvement here. Further, there are now only 9 words with 5 or more shared contexts that are not classified (down from 14 before). When we consider the set of groupable words (those words with 2 or more shared contexts) the classification process increased the percentage of grouped words from 72% to 75%. We seem to be seeing a small improvement in performance.
Recall that one of the goals of the extension process was to improve classification performance on words that have less than two shared contexts. We have had some success at improving our ability to classify the words with one shared context (adding 52 grouped words to raise the percentage grouped to 37%). For words with no shared contexts, we have tripled the number of words grouped (from 30 to 98), but this still represents less than 2% of the words of this type.
As far as errors (that is, the noun-verb mismatches detected) are concerned, we see a quantitative and qualitative improvement. The number of errors dropped from 128 (6.75% of the 1,896 words grouped prior to CORE extension) to 80 (3.90% of the 2,051 words grouped using the extended CORE). The large concentration of errors associated with the large noun group labeled MAJORITY has disappeared. That group's composition has changed to: 729 nouns (versus 687 before), 6 verbs (versus 88), and 41 words from other categories (versus 49). Thus, this group seems to be a qualitatively better grouping than we had before CORE extension.
The two largest concentrations of error in the classification yielded by CORE extension are in the groups labeled AT and HAVE. The AT group is a MISC group that has a rating of -0.097. It is composed of 36 verbs, 22 nouns, and 87 miscellaneous words (including 18 prepositions, 17 gerunds, 10 adverbs, 9 adjectives, and 4 WH-words). Thus, we have a "borderline" group, with no dominant category. If we look at the CORE group that was the basis for this grouping we find the following:
Notice that all the nouns include the punctuation characters ".^". This is the LOB Corpus symbol string for end-of-sentence. These symbols are attached to these nouns because of our initial decision to attach all punctuation to the preceding word. Thus, all of these nouns are sentence final. Looking at the 22 nouns the classification puts in this group, we found that 18 of them were followed by either the end-of-sentence sequence, a comma, or an opening quote. In most cases, then, these word would be followed by an noun phrase (either sentence-initial, clause-initial, or quote-initial). This is also true of the other words in this groups CORE. So there is a coherent syntactic pattern to the words appearing in this group.
The HAVE group has a rating of -0.700, with 165 verbs, 25 nouns, and 10 miscellaneous words. The CORE group that generated this classification had the highest error count of the extended CORE groups (9 noun-verb mismatches out of a total of 22 for the entire CORE GROUPING). This suggests that this group replaced the MAJORITY group as the "worst" CORE group. But, qualitatively, HAVE is not nearly as bad as MAJORITY was. The CORE contexts associated with HAVE classified an additional 144 words (a 257% increase) while the error grew by a 177% (from 9 to 25). (The comparable numbers for the pre-CORE extension MAJORITY group were a 956% increase in grouped words and a 1660% increase in error.) With its error growing slower than the increase in the number of grouped words, the CORE contexts associated with HAVE seem to be doing a reasonably good job of classification. Thus, our worst CORE group after CORE extensions seems much better than before CORE extension.
The HAVE group has 4 words that are used ambiguously as nouns and verbs: FINISH, MOVE, REVIEW, and TURN. Interestingly, all of these words were classified in the MAJORITY noun group prior to extending the CORE. Further, 20 of the 46 verbs that are in the CORE of the HAVE group were classified in the MAJORITY noun group prior to CORE extension. So, it appears that the extension process has shifted many of verbal CORE contexts that were previously associated pre-extension MAJORITY group to the post-extension HAVE group.
All of the above point to the fact that CORE extension has caused a significant improvement in the quality of word classification. However, we are still faced with the very poor performance on classification of words with fewer than 2 shared contexts. As a final attempt to improve this performance, we will look at the possibility of "splitting" words with ambiguous usage.