5.9 Examining some Actual Word Groupings

5.9 EXAMINING SOME ACTUAL WORD GROUPINGS

Before we conclude this chapter, we thought it would be interesting to examine a word group from each of our major types: NOUN, VERB, and MISC. We begin with the VERB group labeled CHOSEN. It's CORE group has six members:

First, notice that all of these words have very low frequency counts. Further, they all have very few shared contexts. But, our algorithms seemed to have located a stable set of contexts that can identify verbal participles. Second, the word HAD_1 is a new word that was created by our word splitting algorithm. That algorithm has split one participle context away from the set of all HAD contexts (that will include its use as a past-tense verb), and used that context to create the word HAD_1. Finally, the word PICKED is ambiguous between its participle and past-tense usage. It has one context for each usage. The fact that it is grouped here implies that its past-tense context is not part of the set of all CORE contexts. This could be problematic, because it will allow the classification process to include a past-tense context in the set of CORE contexts for this group. Note that this group has a rating of -0.833 because the ambiguous word PICKED is not included in the verb count (only unambiguous verbs are included).

Classification adds the following words to the CHOSEN group:

These words, when combined with the CORE words, produce a word group of size 17. The rating for this group is -0.941 (again, because PICKED is not included in the verb count). Note that the word GOT has entered this group. Although its dominant form, in the primary corpus, is as a past-tense verb, it did have some participle occurrences. So, this group seems to be doing well at isolating verbal participles.

The next group we will examine is the NOUN group labeled BALL. It's CORE group has eight members:

Notice that 5 of these words have only one shared context. This is an example of how the CORE extension process can build a larger CORE GROUPING. If a word has only one shared context, and that context is a member of the CORE contexts for a given group, then the word is classified in that group. Note, further, that if a new shared context should occur for that word in the future and that context is a member of the CORE contexts for another group, than the word will be removed from the CORE GROUPING. This is an example of the non-monoticity in group membership that we mentioned earlier.

The classification process adds 5 more words to this group (for a total of 13).

Note that the quantifier OTHER is part of this group. In the LOB clustering done in section 4.3.1, we found that quantifiers shared many contexts with nouns. In fact, they were among the closest tags to the large NOUN cluster. Here, it is this overlap that is allowing the word OTHER to join this otherwise exclusively noun group.

The final group that we will look at is the MISC group labeled NOW. It has the following 10 words in its CORE group:

This is a predominantly adverbial group (tags RB and RN). Note that the LOB Corpus comes with the negative morpheme N'T as a separate lexical item. We did not attempt to re-combine this morpheme with its predecessor word. In the LOB tag analysis, RB, RN, and XNOT tags were all members of the same tag cluster. Again, we have some overlapping contexts that are allowing the morpheme N'T to enter this word group.

Classification added the following 5 words to the NOW group (for a total of 15):

Note that all of these words except NOT are classified using abstraction. Thus we see the strength of generalization when words like HARDLY, REALLY, and THEREFORE are classified correctly even though they share NO contexts with the words in the CORE group. And, we see the weakness of abstraction when a pronoun like ANYONE is allowed to join this adverb/negation group.

This ends a brief survey of some representative groups. It shows some of the power and weakness of our technique. Also recall our discussions of the groups labeled AT and HAVE in the preceding sections. Although the technique has its weaknesses, it does seem to be creating meaningful word groups.

Page updated

Google Sites

Report abuse