CHAPTER 5
GROUPING WORDS FROM THE LOB CORPUS
In this chapter we turn our attention to real world text. We apply the algorithms we developed in the last two chapters to a continuous stream of text. The corpora in this chapter were drawn from the hand-tagged LOB Corpus (see section 4.2). This allows us to "rate" our algorithm's performance in forming word groups. As we discuss this processing, the final forms of our plain-text algorithms will take shape. These final algorithms will be applied to an untagged corpus in Chapter 6.
As we process the text of the LOB corpus, a number of key questions will come forward. In Chapter 4, we established our goal as identifying a well-behaved subset of the words in a corpus. Then we intended to use that subset to tag the remaining words. In pursuing this goal, we must remember that since we are working with real text, all parts of speech are present. This implies that eventually most words will be combined into a single large group (see section 4.1.2). So, we must be concerned with the following questions: (1) When should we stop the iterative clustering grouping process? One large group of words would be meaningless for tagging the remainder of the corpus. (2) How big a subset of the words is big enough? (3) If the iterative clustering algorithm produces too small a subset, should we perform some additional processing to expand the subset? The decision about subset size will ultimately be based on how well the subset performs in tagging the remainder of the words. This suggest two more questions: (4) What is "the remainder of the words?" And, (5) how do we know that we have classified "most" of it? In answering questions (4) and (5), we must be aware that only a small portion of the words (about 11% for the corpora we will discuss in this chapter) meet the groupability constraint that we established in the previous chapter (that is, have at least two shared contexts).
In this chapter, we will address each of these questions in turn. In the process, we will introduce the following: (1) an automatic stopping criteria for the iterative clustering technique, (2) a new set of algorithms to classify the remainder of the words, and (3) additional algorithms to expand the subset of words identified by iterative clustering.