1.4 OCCURRENCE-BASED PROCESSING
In the following subsections we will provide an overview of our occurrence-based methodology. Some key points to emphasize are: First, our system is designed to work with real-text corpora. Thus, it must be able to deal with many problems associated with real text. These problems include low-frequency words and ambiguous word-usage. Second, since we are using a distribution-based word categorization system, we do not assume that the word categories that we find will automatically match the standard linguistic word categories (noun, verb, etc). With these points in mind, let us proceed with the overview.
1.4.1 THE ORIGINS OF OCCURRENCE-BASED PROCESSING
This methodology grew out of a prolonged study of Elman's connectionist model of word categorization [ELMAN89,90]. Two concepts from Harris's theory were pivotal in the development of our methodology. The first was that local contexts should be sufficient in determining word classes. (Harris implied that the preceding and following word should be enough.) The second concept was that the local contexts would not be defined in terms of the words but the classes of the words forming those contexts. The generalization technique of "abstraction" (which we introduce in section 3.3) will capture this notion of a context based on classes of words.
Our goal was to apply the concepts embodied in Elman's work to more realistic corpora. His model dealt with a very small lexicon (29 words), and a very simple "toy" language (sentences of the form NOUN-VERB and NOUN-VERB-NOUN). The simplicity of the "toy" language convinced us that a simpler statistical method might duplicate Elman's results. We also hoped that the technique might be scalable to larger domains. We were not able to duplicate the model's results using a variety of simple statistical approaches (see Chapter 2). Although we were not successful in extracting all of the subtleties that Elman's model found in the data, we did find an analysis that provided the "natural" categories of the "toy" language. That analytical technique was occurrence-based processing.
In the preceding analysis of Elman's model, we found verification of the two concepts from Harris's theory we mentioned above. We found no significant improvement in results by expanding the analytical context beyond the preceding and following word. (This was true for both the statistical and occurrence-based analyses). And, we found that we were unable to form a complete set of "natural" categories without generalization. The actual occurrence-based procedure for word-categorization is detailed in Chapter 3.
Two key elements in our word-categorization model are the use of contexts as features and an iterative approach to hierarchical clustering. We treat contexts as features because they are either present or not ("yes/no" data elements). We form "natural" classes of words using hierarchical clustering. This requires a similarity metric. We chose one of the feature related metrics - the Tanimoto coefficient - because of the feature-like nature of our contexts. A feature space is formed by generating the union of the context sets for two words. Then the similarity between the two words is equal to the number of contexts that the two words have in common divided by the size of their feature set.
Actual words are combined using a modified form of single-linkage clustering. This technique combines nearby words until ALL word-pair similarities are beneath a predefined threshold. We have found that if we start with a relatively high similarity threshold, and slowly let the threshold drop, we get very good "natural" categories for Elman's "toy" language.
1.4.2 MOVING TO "REALISTIC" DATA
We used Elman's "toy" language to help us hone our technique for more "realistic" corpora. The move to "realistic" data brought many problems that were not present in the "toy" language. These problems are discussed in detail in Chapter 4. Probably the most significant problems are: (1) words from all linguistic categories will be present, (2) ambiguous word usage will occur, and (3) punctuation will be present.
Words from all linguistic categories were a major problem. The result was a proliferation of contexts. As a result, very few words were similar to each other. We were able to achieve reasonable results by changing our features to be "shared" contexts instead of ALL contexts. Shared contexts are contexts that at least two words share. Using shared contexts as our features, we were again able to get reasonable similarity measures.
Ambiguous word-usage was also a concern. A word like CUT can be used either as a noun or a verb. If it is used as both in the corpus, then it is possible that it could link nouns and verbs into a single large category instead of keeping them separate. Our similarity threshold was able to block this from happening in most cases (the noun contexts separating CUT from verbs, and the verb contexts separating it from nouns). But for infrequently occurring words we found it necessary to revise our clustering algorithm as follows: to be similar, two words have to have a minimum number of common shared contexts (we found that TWO was a sufficient threshold).
The main problem with punctuation was what to do with it? The three obvious choices were: (1) treat punctuation as separate lexical items, (2) ignore punctuation, or (3) treat it as part of its accompanying word. Note that options (1) and (2) would require special pre-processing for raw textual input. Further, there are some serious problems with punctuation identification that we mention in Chapter 4. We chose to implement option (3). This negates the need for pre-processing the textual input (we only need to parse the input on white space - that is, spaces, tabs, etc.). But it does create a number of unique lexical items ( such, as "HARRIS", "HARRIS.", "HARRIS,", etc.). Empirically, this has actually proven beneficial to our analysis. It appears that by separating punctuation-related contexts from a word, we get a better set of "grouping" contexts.
These results led us to conclude that "realistic"-corpus, "natural" categories could be determined by clustering words using the Tanimoto coefficient for word-pair similarity. Words are only eligible for combination if they have a minimum number of shared contexts in common (TWO), and if their similarity is greater than a particular threshold. The resulting categories are reasonably good - based on our metric of noun-verb mismatches - but the number of words placed in these categories is a very small subset of the total lexicon.
1.4.3 CLASSIFYING THE WORDS FROM A "REALISTIC" CORPUS
Our clustering technique identified a set of "natural" categories, but it did not classify very many words as members of those categories. However, the key to the categories is not the words that belong to them, but the contexts that those words share. Thus, it seemed reasonable that these KEY contexts could be used to classify additional words as category members.
We call the set of clustered words the CORE GROUPING. From this grouping we extract all shared contexts that are associated with the CORE words, and assign a CORE group label to them. In this sense, we have identified the CORE contexts associated with our "natural" categories. Then we scan through the shared contexts associated with all non-CORE words. If a word has a significant number of shared contexts in common with a "natural" category, it will be classified in that category (see Chapter 5 for details). This step actually provides a significant increase in the number of classified words (from 267 words to 1,534 words in the primary corpus discussed in Chapter 5).
But, this limits our classification to words that occur in the same shared contexts as CORE words. We would like to be able to classify words that occur in "similar" contexts as well. So, we perform a second pass through the contexts of the remaining non-CORE words. This time we look at ALL contexts. Each context is converted into a "generalized" context (by replacing any CORE word in a context by its category label). Then, if the word has a significant number of generalized contexts in common with the generalized CORE contexts of a "natural" category, it will be classified in that category (again, see Chapter 5 for details). This step further expands the number of classified words (to 1,896 words in the primary corpus in Chapter 5). Also, note that these two classifying steps can be done in parallel, in effect only requiring one additional pass through the corpus.
The preceding classification procedure actually classified about 19% of the lexicon. If we look at the details of the classification, we find that over 70% of the words that have at least TWO shared contexts are classified. Further, 35% of the words that have ONE shared context are classified. This seems quite good, but if we look at the pattern of verb-noun mismatches, we find that there is one large category that has 687 nouns, 88 verbs, and 49 words from other categories. This seems to imply that this category may not be very "natural."
Pinker's [PINKER84] model of language acquisition provides some insight here. In that model, Pinker bootstraps his system with semantically-based initial categories. However, all semantically categorized words are designated temporary. Words are permanently categorized only after they have appeared in the proper "structure-dependent distribution." In our case, we will treat the initial CORE GROUPING as a temporary set of CORE words. The CORE contexts are the critical "structure-dependencies" that will determine whether a word should continue in (or enter, or leave) the CORE GROUPING. Each word's set of shared contexts are intersected with the set of CORE contexts. If the resulting subset contains only CORE contexts associated with a single "natural" category, the word joins the CORE as a member of that category. Otherwise, it is designated a non-CORE word.
Note that this procedure will add new words to the CORE, but it will also remove previously grouped words from the CORE. After all words in the lexicon have been processed, the CORE is examined and ALL "natural" categories that have less than two words remaining in them are discarded. The net result is a new, expanded CORE with fewer categories (the number of "natural" categories drops from 61 to 39 in the primary corpus of Chapter 5).
This procedure can be iterated until a stable CORE results. Note that changes in the set of CORE words will cause changes in the set of CORE contexts, which will cause changes in the set of CORE words, and so on. Ultimately, the CORE stabilizes, and that stable CORE produces a much better classification of the lexicon. The total number of verb-noun mismatches over all categories is only 80 (3.90% of the 2,051 words classified). Further, the large class of nouns mentioned above is now composed of 729 nouns, 6 verbs, and 41 other words. And, the 6 verbs all have ambiguous usage in the corpus. (Details can be found in Chapter 5.) In addition, we found that 75% of the words that have at least TWO shared contexts are classified, and 37% of the words with only ONE shared context are classified.
In an attempt to further expand the CORE, we have experimented with "splitting" words. Above, we only added words to the CORE if their contexts overlapped the CORE contexts of a single category. For words that intersect the CORE contexts of more than one category, we could form a new lexical entry (word) by splitting off the contexts associated with ONE category. Preliminary experiments along these lines are discussed in Chapter 5 (with continued improvement in lexical classification).
The results discussed above were duplicated on two more databases. The results are discussed in Chapter 5 and 6. This provides valuable evidence that the technique seems to be robust.
1.4.5 HOW OUR METHOD RELATES TO THE BOOTSTRAPPING PROBLEM
It is important to emphasize how our method relates to the bootstrapping problem. We have developed a technique that uses distributional information from a corpus to identify an initial set of word groups. This is done without resorting to any information outside the actual text in the corpus. In this sense, we have implemented an automatic version of the structuralist solution to the bootstrapping problem.
We differ from the program established by the structuralists in two key respects. First, we maintain order information in our co-occurrence sets. This is accomplished by storing contexts in which a word occurs. This differs with the structuralist concept of merely maintaining predecessor and follow sets. And this step was critical to the second point of divergence, which has to do with input filtering. We do not filter the sentences of the input stream. Where structuralists limit their analysis to a privileged set of sentences (the "base" sentences), we allow all sentences to be processed. But, we do find it necessary to filter the input. Our filtering is centered on a privileged set of contexts (that is, the shared contexts). The filtering that the structuralists do requires human intervention. However, the lower level filtering that we perform (context vs. sentence) is easily automated.