CHAPTER 1
AN INTRODUCTION TO OCCURRENCE-BASED PROCESSING
We have embarked on a research program that we call OCCURRENCE-BASED processing. This methodology, quite simply, monitors the contexts in which data elements appear. As such, it is similar to co-occurrence statistical studies, but we do not tally the number of times the data element occurs in its context - we simply record that it has occurred in the context. Thus, one occurrence is the same as 1,000 occurrences.
We have been applying this methodology to the task of categorizing words from a natural language. In particular, we have been applying it to corpora consisting of samples of written English text (edited newspaper articles and unedited technical articles). These samples of text can be assumed to be reasonably "correct" samples of the English language. Thus, all sequences of words can be considered to be valid sequences in the language. Though there will be some "noise" in the data, the underlying assumption of occurrence-based processing - that all context occurrences are equally valid descriptors of a data item - seems reasonably valid in these corpora. This is brought out in the empirical studies that we will mention later.
The key problem here is: How do we start the categorization process? Word categories are identified by certain characteristic distributions of linguistic objects. Unfortunately, the linguistic objects that define word categories are word categories. Thus, the language learner is faced with the circular task of needing to know word categories in order to learn word categories. The only way that the language learner can succeed is to acquire an initial set of categorized words that is capable of starting and sustaining the categorization task. So, we have a bootstrapping problem: How does the language learner acquire that initial set of categorized words?
In this chapter, we will begin by discussing the bootstrapping problem in more detail. Because our focus is primarily machine categorization of words, we are interested in how far the categorization process can be taken using only the word-distribution information from a corpus. Distributional analysis of text is the foundation of structural linguistics, so we will next examine the pertinent aspects of a key American structuralist, Zelig Harris. Then we will examine a connectionist model by Elman that shows that, at least in a "toy" language, the structuralist techniques can be automated. Finally, we will present an overview of our occurrence-based methodology. This will preview the material that appears in the remainder of this document.