5.2 ITERATIVE CLUSTERING
Of course, actually grouping the full 11% potential of one of these corpora is the ideal. To be grouped, words must SHARE two shared-contexts and have a distance that is less than the currently specified THRESHOLD. So, if we let the iterative clustering algorithm continue until no more words can be grouped (that is, until no more "grouping arcs" can be found), we would expect to group far fewer words than this ideal. (Actually, about 400 words in these two corpora).
Further, as the process of iterative clustering continues, we find that all words tend to merge into a single large group (with some lingering, small, "outlier" groups). So, if we want to find meaningful word groups, we must stop the grouping process early. The word-grouping summaries in Figures 5.3 and 5.4 show the results of running our iterative clustering technique with a starting distance threshold of 0.75). The procedure was stopped when a single word group contained more than 50% of the words (a heuristic for "large").
Note that the "total" and "group-rating" columns each contain two numbers separated by a "/". The first number indicates the total number of words over all groups that would be classified in that column; the second, the number of groups formed. Both are important measures of grouping performance. The word count indicates our progress in identifying words; the group count, how sparsely those words are distributed. Note that these tables contain the same error measure and group ratings that we introduced in Section 4.3.1.
(The group rating is equal to the number of nouns occurring in that group MINUS the number of verbs divided by the total number of words in the group. Thus, a rating of +1.000 implies a that the group is a "pure" noun group; -1.000, a pure verb group. In Figures 5.3 and 5.4, NOUN groups have a preponderance of nouns (group rating > +0.500); VERB, a preponderance of verbs (group rating < -0.500). Other word groups (with group ratings between +0.500 and -0.500) will be MISC groups. An error occurs when nouns and verbs are placed in the same word group (that is, a noun-verb mismatch). In a group with a rating > 0, the error count will be the number of verbs occurring; in a group with a rating < 0, the number of nouns. If there are an equal number of nouns and verbs in a group, it will receive a 0.000 rating, and its error count will be the sum of its nouns and verbs.)
Based on our discussions in Section 4.3, we expected to find three basic types of word groups: (1) "pure" nouns or nouns mixed with a few miscellaneous words (especially adjectives), (2) "pure" miscellaneous or miscellaneous mixed with a few nouns and/or verbs, and (3) "pure" verbs or verbs mixed with a few miscellaneous (especially adjectives). These expectations were based on our experiments on LOB tags. Our three group-rating columns roughly correspond to these types.
It should be noted that we followed the analysis of Section 4.3.2 in assigning the labels NOUN and VERB to words. Essentially, all non-possessive LOB tags that start with an N were labeled NOUNS. All LOB tags that started with VB, BE, DO, HV, or MD were labeled VERBS. There were three exceptions to the verb labeling schemes. The LOB tags VBG, BEG, and HVG were not labeled VERBS. These are all gerunds. Since gerunds are "intermediate" forms - verbs acting as nouns -, we felt safe in this exclusion. The tree in Figure 4.5 shows that BEG and HVG are peripheral forms that will have little bearing on the results. However, VBG is the "closest" verb form to the nouns. We could probably legitimately include VBG with either verbs or nouns, but chose to mark it MISC instead. (In most cases, this decision had little or no effect on the error measure.) Finally, note that all other LOB tags (that is, all non-VERBS and non-NOUNS were marked MISC.
Figures 5.3 and 5.4 show a similar pattern. Both analyses start off with relatively low error counts in abstraction # 0 (that is, the analysis of the original, raw-text data). All of the errors in the primary corpus (5) occur in a single group, where we find a large group - labeled MAJORITY - of 54 nouns joined by 5 verbs (see Figure 5.5). This exhibits a key problem with real world text - ambiguous word usage (see section 4.1.4 for a detailed discussion). Here we find a number of words that can be used both as nouns and verbs. In particular, FORM has three shared contexts associated with its usage as a verb, and three others associated with its usage as a noun. It is classified as a NOUN because 8 of its 11 occurrences are in noun contexts. But, since its shared-contexts are equally split between noun and verb usages, it becomes a point where nouns and verbs can be combined in a single word group. And, this is exactly what happens. (See Figure 5.5 for details.)
Another problem is exhibited by the word WISH in Figure 5.5. Here we have a word that is primarily used as a verb, but can be used as a noun. In this case, a minimal number of noun occurrences (2) just happens to overlap the noun occurrences of GROUND. In this case, tuning of the starting distance threshold could remove WISH from the group. However, we must balance this "error reduction" against the reduction in grouped words that will occur. In this case, the reduction in the total number of words the iterative clustering grouped seemed too great (25% fewer words when the starting distance threshold was dropped to 0.70). But the issue of when the set of grouped words is "too small" requires additional study.
In the secondary corpus, we have a more dispersed pattern of similar types of errors. Two of the "error" groups are: (CHANGE, STOP) and (SUIT, NAME). In each case, the two grouped words are used as both a verb and a noun. And, in each group, one word is predominantly used as a noun and the other as a verb. Thus, each group contributes a count of 2 to the error count. The remaining errors in the secondary corpus involve two groups: a large noun group with two verbs, and a moderate verb group with one noun. Although the error counts are somewhat higher in the secondary corpus, we will concentrate on the primary corpus (where the pattern of error is more interesting).
The first abstraction was significant in both corpora. It was at this point that the determiners "A" and "THE" are grouped. Because of the high number of contexts associated with these words, an abstraction step was necessary to bring them "close enough" to form a group. Similar behavior held for adjectival and prepositional groupings. Thus we see a significant growth in the MISC groups column (a 41% increase in the grouped words in the primary corpus; 113% increase in the secondary corpus).
The error counts did not exceed 5% of the total grouped words for a number of abstractions (until abstraction # 10 in the primary corpus; abstraction # 9 in the secondary corpus). Through these final stable abstractions we find that the NOUN columns hold a majority of the grouped words, ending with about 50% of the grouped words. Further, there is significant move from "pure" noun groups to predominately noun groups during these abstractions In the primary corpus, "pure" noun groups account for 50% of the words in the NOUN column at abstraction # 0; 24% of the words at the final stable abstraction (# 10). (In the secondary corpus, the comparable numbers are 39% and 22%.) In both corpora, the NOUN groups at the final stable abstraction are dominated by a single, large group of nouns which has a small subset of verbs and miscellaneous words. (In both corpora, this group has over 100 words). Recall from our analysis of the LOB tags that we found only one NOUN group, and its rating was 0.75 (see Figure 4.4). Thus, we see a consistent pattern over all of these experiments. Figure 5.6 shows a list of the word groups in the primary corpus at the final stable abstraction (# 10).
In both corpora, there was a giant leap in the error count following the last "stable" abstraction mentioned in the preceding paragraph. This is symptomatic of the onset of noun/verb mixing. In both corpora we found that the largest NOUN group was combining with the largest VERB group. From then on, the error count grew rapidly as NOUN and VERB groups continue to combine (and there was a corresponding increase in the number of words found in MISC groups). Whenever we allow iterative clustering to continue until there are no more "grouping arcs" (on real world data), we always reach a point where over 90% of the grouped words are in a single MISC group. Thus, the final stable abstraction seems to occupy a privileged position in our analysis. It is the ideal point to stop iterative clustering.
Before we leave this section, we would like to establish some points of contact with Harris's theory. Above we have discussed the dominance of the NOUN groups in both corpora (with about 50% of the grouped words at the final stable abstraction). But, if we look at the MISC and VERB groups, we saw two quite different patterns over the two corpora. In the primary corpus there was a preference for VERB groups (which contain about twice as many words as the MISC groups throughout most of the early abstractions. In the secondary corpus we found that the MISC groups grow more rapidly than the VERB groups. Thus, the data seems to be driving these analyses in different directions. But, this is only a function of how we look at the data. In Harris's theory of grammar, there are TWO basic word types: null operators (NOUNS) that place NO restriction on the words that surround them, and operators (MISC and VERBS) that DO place restrictions on their surrounding words (see section 2.4). If we look at the grouped words for some key abstractions using these Harris word group types, we see:
In both corpora, we see a dominance of N words at abstraction # 0. This is because the Harris system has a basic bootstrapping problem. If we are going to define word groups by word class of the words they REQUIRE to surround them, then it will be hard to group words prior to forming some initial word groups. So, to get initial word groups, it will be necessary to have some words that have very restricted contexts. Then, these words can be grouped on their actual surrounding words, without resorting to the classes that the words belong to. The preceding chart implies that this is more likely to happen for NOUN-type words than operators.
Once we have an initial set of grouped words, then our abstraction step replaces these words by their group labels (which corresponds to Harris' word classes). Because O-type words are expected to be classified based on the types of words surrounding them, we would expect the number of words in that class to grow rapidly once an abstraction step has occurred. And, this is exactly what has happened in these two corpora.
Although the actual text in the two corpora cause the percentages to vary, we see the following trends: (1) N-type word groups are easier to form with no abstraction, and (2) O-type word groups grow rapidly when abstraction occurs. These results are consistent with Harris's predictions. Therefore, our grouping procedure may be considered a viable automatic technique for bootstrapping a Harris-type analysis of real world text. (Note that the only implementation of Harris's theory to extracting word groups from real world text was done manually [HARRIS89]).