3.4 ITERATIVE CLUSTERING
Returning to Figure 3.4, notice that within the original three orthogonal word groups, the intermediate categorizations are not very good. As mentioned in Chapter 2, any classifier should be able to identify the six distinct noun groups based on their usage in the text generator. Moreover, we would expect that a good classifier could at least separate the subjects from the non-subjects. Further, we would like to see the verb classes associated with the ambiguous verb "see" to be somewhat close together. Also, we would expect the same for "break."
The problem here is related to a second major characteristic of single-linkage clustering. That is, the technique is vulnerable to chaining effects -- that is, "the presence of a few [data] points located so as to produce a bridge between ... [more natural] clusters results in a rather unexpected grouping into one large, elongated cluster ..." [DUDA73, p.233]. Whether this is a problem or not depends on the data being clustered. In the "toy" language data, the noun group is being built into a single large word group.
The normal solution to the chaining effect is to adjust the specified distance threshold. By lowering that threshold, the "chains" can be terminated before they become overly large. If a lower threshold is used in conjunction with the word abstraction process mentioned above, we can actually overcome this chaining effect in the "toy" language.
But, we also want to be able to continue the abstractions until we have all the verbs in one group and all the nouns in another. To achieve that, we will need to allow the distance threshold to be increased periodically to avoid terminating too soon. We have implemented this by letting the algorithm increment the distance threshold whenever no new grouping activity occurs. The distance threshold is allowed to increase until it reaches 1. Recall that our distance computation algorithm (3.3) does not create graph arcs if the computed distance between two words is 1 (that is, their feature sets are disjoint).
Again, we will be using five algorithms to group words. The actual steps in our procedure are:
Finally, the initial distance threshold must be sufficiently large to capture enough initial word groups to allow us to group all of the data. This is not a problem in the "toy" language. The words within the primary, "well-defined" word groups (AGRESS, ANIM, HUM, INANIM, FRAGILE, FOOD, INTRAN, DO-REQD, and AGPAT) are very close together (distances between words that belong to the same group are less than 0.15). Further, the distance between words that are not in the same primary, "well-defined" word group is much greater (at least 0.35). So, we set the initial distance threshold to 0.25, and allowed the procedure to go through a series of 11 abstractions. Figure 3.6 shows the results.
Notice that we now have much better behaved intermediate groupings. For example, we now have a clean separation between subjects and non-subjects. Further, we can see that the subjects categorized by the verbs they take.
Further, we find that the verb "see" spans its two verb uses, clustering into the TRANS group after "chase" and "like", and bringing "smell" in after it joins the VERB group. Similarly, "break" spans its two verb uses. We find our word groupings to be a very satisfying match to the "syntax" of the "toy" language.