7.1 WHAT WE HAVE LEARNED FROM OCCURRENCE-BASED WORD CATEGORIZATION
This methodology grew out of an interest in computer-based approaches to language acquisition. Initially, we looked at statistical approaches. The problem we saw there was the inability to deal with low-frequency data. In essence, the existing statistical techniques treat ALL low frequency occurrences as noise. In the connectionist models, such as Elman's (see Chapter 2), the "toy" languages that are processed use training sets with hundreds of repetitions of each lexeme. In statistical approaches to "realistic" text, like [SCHUTZE93], we find a system that is bootstrapped using the most frequently occurring words in a large corpus. Again, we have the creation a of mini-language and the co-occurrence counts are maintained between objects in that mini-language. (The Schultze system uses the 5,000 most frequently occurring words from a corpus drawn from the New York Times News Service.) The low-frequency objects are ignored during an initial "training" period, when word classes are developed. It is important to note that low-frequency objects are in fact classified in a later "tagging" stage. But, they are not considered in the initial training period.
Unfortunately, most linguistic objects (words) have very low frequency counts, and they are initially ignored by the existing statistical techniques. We believe that this type of data should not be ignored in the initial determination of word categories. According to Harris [HARRIS82], there are two conditions that influence the presence of a word in a sentence: (1) a necessary condition that allows the word to appear in the given abstract context; and, given that the word is allowed to appear in that abstract context, (2) the likelihood that the word will occur with the actual surrounding words. The abstract context in the necessary condition is formed by the word classes of the surrounding elements. It is important to note that the necessary condition does not concern likelihoods. Further, in Harris's theory, word categories are built up from similar sets of abstract contexts. Significantly, again, likelihoods are not concerned here. Since frequency is a statistical analog of likelihood, this suggests that frequency counts might be ignored. Our approach changes the focus from word frequencies to "key" contexts (that is, contexts that can uniquely assign words to a single word class). We allow all words of the language to participate in the initial development of word categories. Thus, we do not create an artificial, mini-language for the initial determination of word categories.
It is interesting to note that the statistical approaches to word categorization have been able to bring more words into their initial training set by increasing the volume of text being examined. In recent years, we have seen a great expansion in the size of the corpora being subjected to statistical analysis. Unfortunately, Zipf's law prevails. It states that there will always be a large portion of lexical items in any text that will have very low frequency counts. So, although increasing corpora size may make more words available for statistical analysis, words that are not analyzable will continue to be a large portion of the lexicon. Since approaches like [SCHUTZE93] are creating mini-languages, many interesting contexts will be lost because the participating words will not belong to the mini-language.
The existing statistical techniques deal with the problem of noise in the data by using a frequency filter. By limiting their attention to frequently occurring words, they establish frequency threshold for noise. Words that occur less frequently than the established threshold are ignored. But, the question becomes what should the noise threshold be. We believe that most, if not all, written language events are NOT subject to significant noise. In fact, the basic assumption underlying occurrence-based processing is that almost all events are statistically relevant. Under this assumption, it is no longer necessary to maintain frequency counts, we simple record which events have occurred, and treat them all as being equally valid.
But, there is still the problem of some noise in the data. The kind of noise that we encounter falls into two classes. The first is that a large number of contexts occur with only one word. If these contexts are allowed to participate in similarity determinations, we get a very poor classifier of words. Therefore, we have required that a context must occur with at least two words before it is considered in our similarity metric. (We call such a context a "shared" context.) It should be noted, however, that we still record all contexts. This is because our abstraction process (that is, replacing the words in a context with their word classes) may change a previously non-shared context into a shared context.
The second kind of noise that we encounter is related to ambiguous word usage. Words like CUT and MOVE can be used both as nouns and verbs. Allowing such words to enter word classes can quickly cause meaningful word categories to disappear. In most cases, these words have a predominate usage in a corpus, and the problem does not occur. But, for low-frequency words, it is possible that such a word may occur equally often in both categories. To prevent such occurrences, we have added another filter, this time on word pairs. We require that word pairs share at least a minimum number of contexts (in our corpora TWO has proven to be sufficient).
Note that the filtering focus has been completely changed. We have shifted that focus from words to word combinations. This technique will only work in domains where the noise level is very low. And, we believe that written language syntax is such a domain. This is especially true of edited text. We normally find a large number of unique text sequences within any written document. We would argue that syntactically all of these text sequences can be assumed to be equally valid. The addition of editing provides further support to this assumption.
It should be noted that we did not embark on this word categorization task with any preconceived notions as to what categories should be found. We simply let our algorithms process the text data, and let them determine a set of "natural" categories. This is an important point. Most word categorization systems are designed to identify the standard set of syntactic categories (noun, verb, etc.). The algorithms are tuned to produce these desired categories. Recent statistical categorization systems have approached this problem in a different way (see the discussion of Elman's work in Chapter 1 and [ELMAN90,89]). In these systems, similar words are allowed to combine with each other in natural categories. Then the categories are allowed to combine until all words are combined into one large super-category (the language). Looking at the hierarchical tree that describes this clustering process provides a series of more and more general "natural" categories for the words which reside at the leaf nodes.
We were motivated to follow the latter approach to categorization. Two main points influenced this decision. First, the categories that Elman produced did include the standard syntactic categories at a very general level (as children of the root node of the category tree). This implied that actual word usage may be sufficient to detect the standard categories without a pre-processing bias. Second, the linguistic theory of Harris (see [HARRIS84]) generated a set of syntactic categories that do not coincide completely with the standard set. Since his theory is an inductive theory based on actual language usage, this suggested that the standard categories may not be natural. By not providing any pre-processing bias on the categories we were seeking, we avoided any theoretic-bias toward either Harris's or the standard theory.
We centered our research on identifying a set of natural syntactic categories while limiting the pre-theoretic bias. Although the actual categorization routines are not biased toward standard categories, our validation methods are. To judge the validity of the categories produced it is necessary to examine the types of words that occur in these categories. To do this quantitatively, we sought a corpus that included hand-tagged words (we chose to use the LOB Corpus - see section 4.2). The hand-tags in this corpus are based on the standard set of categories.
To limit the bias imposed on our validation system by these tags, we performed a detailed pre-analysis of the tags used in the corpora which we extracted from the LOB Corpus. We treated the tags assigned to words as lexical items, and then formed tag categories based on similar tags. Section 4.3 discussed the details of this study. We want to highlight two key findings. First, all tags are similar to all other tags. Most surprisingly, there is significant overlap between contexts associated with some noun-verb tag pairs. This is surprising because the standard theory provides nouns and verbs with orthogonal syntactic feature sets (nouns have the features (+N, -V); verbs, (-N, +V)). On the other hand, this is not surprising when one considers that X-bar theory provides all major categories (including nouns and verbs) with the same general phrase structure (see [RAD88]).
The second key finding was that above a certain similarity threshold, nouns were in fact separated from verbs. However, we do find all other kinds of tag pairings continue to occur (noun with adjective, adjective with verb, etc.). This seems to be consistent with the predictions of the syntactic feature set for the standard theory. This suggests that if we perform our processing below this threshold, we should find categories that are consistent with the standard syntactic features: ±N and ±V. But this does not imply that we will find the standard categories. We actually found many categories that could be interpreted as representing single features like: +N, not(+V), and so on.
We established a theory-independent stopping metric for terminating our categorization process prior to the onset of noun-verb mixing (that is, before major noun and verb categories began to be combined). When our algorithms were stopped by this metric, an interesting set of categories emerged. In both our primary corpus and our validating secondary corpus, we found that the words assigned to predominantly noun categories made up approximately equal percentages of the total words grouped (that is, 51% of the grouped words were assigned to noun categories in the primary corpus; 52% in the secondary corpus). However, the standard theory non-noun categories did not display similar word distributions across the two corpora. Interestingly, Harris's categorization system makes a binary distinction between word categories (null operators are essentially the set of nouns, and operators are the set of non-nouns - see section 1.2). His system yields consistent results across the two corpora. This provides some validation of Harris's categorization system. So, we are seeing some convergence between the category systems of these two competing theories. We have both some validation of the standard theory's syntactic feature set, and some validation of the binary category distinction of Harris.
Before we leave this section, we want to mention one additional point. Recall that in section 2.6, we identified a situation where occurrence-based techniques would have problems. There we were concerned with a verb that was ambiguously used as either a transitive or intransitive verb. Further, the verb's transitive objects came from the same set as potential subjects. Since the "toy" language where this instance occurred had no end-of-sentence markers, it became impossible to separate these usages on the basis of the local context (preceding word and following word). In that section, we demonstrated how this problem could be overcome by maintaining some simple frequency counts.
The implication that was drawn from this example was that the kinds of information extracted from the data by the existing frequency-based statistical systems would be different from the "possibility" information extracted by our occurrence-based methodology. This prediction seems to be validated by the differences between statistical results reported in Chapter 2 and our occurrence-based results reported in Chapter 3 (analyzing data from the same "toy" language). These results argue strongly for a hybrid system using a combination of frequency-based and occurrence-based techniques. Future research should address how these two techniques complement each other, and how such a hybrid could take advantage of the strengths of both techniques.
In summary, we want to re-emphasize the three key findings mentioned above. First, the treating of all word contexts as being equally statistically valid has identified interesting natural word categories. Thus occurrence-based processing seems to be a viable research tool in the domain of written language syntax. Second, the word categories that we found are compatible with both standard linguistic theory and the inductive linguistic theory of Harris. This points to the possibility of further experimental results that might bridge the two theories. And finally, frequency-based and occurrence-based processing extract different kinds of information from a text stream. Both of these forms of information are relevant to syntax. Thus, a viable syntactic machine will need to be a hybrid, including both types of processing.