CHAPTER 4
TRANSITION TO A MORE "REALISTIC" CORPUS
In Chapter 3, we demonstrated our categorization technique using a very simple "toy" language. In this chapter, we will begin the transition to more a "realistic" corpus. First we will examine a number of key issues that will arise in this new corpus. For the most part, these issues are addressed by allowing some parametric variation in the way context features are identified. Then, we will introduce our first "realistic" corpus, the hand-tagged LOB Corpus. Finally, we will perform an analysis of the tags of the LOB corpus. This tag-analysis will serve as a preliminary investigation of how our technique will perform when applied to the actual words of the LOB Corpus. In Chapter 5, we will actually categorize the words of the LOB Corpus and attempt to evaluate our technique's performance. Then, in Chapter 6, we will apply our occurrence-based methodology to a non-tagged corpus of text.