5.10 Harris Revisted

5.10 HARRIS REVISTED

Before we end this chapter, let us mention Harris's theory one more time. In section 5.2, we produced the following chart using Harris' two primary word types, null operators (NOUNS) and operators (VERBS and MISC):

Then we noted: Although the actual text in the two corpora cause the percentages to vary, we see the following trends: (1) N-type word groups are easier to form with no abstraction, and (2) O-type word groups grow rapidly when abstraction occurs. These results are consistent with Harris's predictions. Therefore, our grouping procedure may be considered a viable automatic technique for bootstrapping a Harris-type analysis of real world text.

Below we replicate the above chart using the results of our classification, extension, and word splitting procedures.

The earlier remarks were based on the results of the iterative clustering technique, assuming we could stop it at the last stable abstraction. We have been devoting the bulk of this chapter to developing new algorithms that: (1) stop the iterative process at an easily determined point, and then (2) extend those results to cover more of the lexicon. The two charts on the preceding page indicate that both techniques arrive at essentially the comparable point (that is, essentially the same mix of nouns and operators). In both corpora, the error rate rising along with the number of words classified, but in neither case does it become excessive (4.76% in the primary corpus, 8.36% in the secondary corpus). Therefore, the revised grouping procedure that we have developed in this chapter may also be considered a viable automatic technique for bootstrapping a Harris-type analysis of real world text.

Page updated

Google Sites

Report abuse