5.8 WORD SPLITTING
In section 5.6 we discussed our algorithm for expanding the CORE. A key component of that algorithm was that we limited our attention to words that were associated with only one CORE group. In this section we will look at those words that are associated with more than one CORE group. If our CORE groups are isolating key usage related contexts, as we believe they are, than words associated with multiple CORE groups are subject to ambiguous usage.
As a preliminary experiment, we have created a very conservative algorithm (not provided here). It will selectively split a very few words (a total of 26 words over 3 passes through the splitting algorithm). We scan through the INACTIVE words in the TAGWORDS table (recall that the active words formed the new CORE extension). We look for words that predominantly belong to one CORE group, but have some CORE contexts from other groups. Our metric for selecting words is the same metric that we used in the classification algorithm - the word must have more the 50% of the sum of its weighted-context scores from one CORE group. If a word passes this test, than the contexts associated with its predominant CORE group are split off from the word and assigned to a new lexical entry.
This algorithm is applied to the stable CORE extension produced by our CORE extension procedure. The word splitting algorithm causes sufficient changes in the lexicon to allow the CORE extension procedure to extend the CORE further. In practice, we found that after the fourth attempt at splitting words we found no words to split. The stable CORE GROUPING at that point is found in Figure 5.16, and its accompanying classification is found in Figure 5.17.
We should note that all the comments we made in section 5.7 concerning CORE extensions continue to hold. One point of concern is the high number of noun-verb mismatches associated with the group labeled AT. This count has risen to 43 (41% of the 105 mismatches for the entire classification). But, a closer examination of the nouns classified in this group indicates that 37 of the 43 are nouns followed by various forms of punctuation. Therefore, this group continues to isolate words that precede noun phrases.
So, it appears that even the very conservative word-splitting algorithm we have tried is capable of improving the classification performance. As we mentioned in section 5.5, the key to performance improvement is increasing the number of words in the CORE GROUPING. This will both expand the set of CORE contexts and expand the scope of generalization. Both of these are critical to the classification process. The word-splitting algorithm we mentioned in this section split selected words into two lexical entries. One of these created a new lexical entry that would join the dominant CORE group associated with that word. Although this adds a new word to that CORE grouping, it will not add any new CORE contexts. The improvement in classification seems to be coming from the remaining contexts associated with the split word. Now that words reduced set of contexts appears to be making it available for classification with another CORE group. And that classification WILL lead to new CORE contexts and increase the scope of classification. Thus, it seems reasonable to continue the search for better splitting algorithms.