5.3 AUTOMATICALLY STOPPING ITERATIVE CLUSTERING
We are seeking an automatic way to stop the iterative clustering technique at an optimal point. As a concrete, specific goal we want to stop abstraction before the nouns and verbs combine (that is, near the final stable abstraction identified in the previous section). Otherwise, we will be unable to classify the remainder of the corpus. Our initial plan was to stop the procedure when a single group contained more than half the words (presumably a single large noun group). Unfortunately, that criteria did not stop the analysis in Figures 5.3 and 5.4 soon enough. The groupings that we have found provide no discrimination as far as word class.
(An obvious stopping criteria would be a monitor on error rate. However, the error rate is just a monitor on performance. It requires a "pre-tagged" corpus. We are developing algorithms to "tag" the words of a corpus. Therefore, the error rate will not be available to our algorithms.)
An obvious stopping point is the final stable abstraction. Unfortunately, there is no simple tag-independent metric for identifying this point. For the two corpora we have been considering, the final stable abstraction occurred three abstractions prior to the formation a "very large" group (that is, a group with over 50% of the grouped words). Such a test would be seems quite ad hoc, and it would be difficult to implement.
A better solution comes to light when we realize that both corpora analyses degenerate when the distance threshold reaches 0.85. (Recall that our analysis of LOB tags predicted that this would be the problem area. It was predicted to be the distance threshold where nouns and verbs would begin to combine.) Since we began our analysis with a distance threshold of 0.75, our revised stopping condition is: Stop the iterative clustering procedure when the distance threshold exceeds the starting threshold by 0.10. Using this condition, we will continue to get the advantages of incremental clustering, AND we will be stopping the procedure when we still have a reasonable set of "classifying" word groups. Thus, our final grouping procedure will be: