1.3 ELMAN'S MODEL OF WORD CATEGORIZATION
Elman's model of word categorization [ELMAN89,90] was designed to extract significant regularities from "streams" of input data. A key component of this work was the inclusion of time. That is, this network used both the current input and the previous network state to determine its output. This inclusion of time has allowed his network to recognize regularities that go beyond the reach of earlier network designs. Significantly, his network's internal word representations lead to word categories that very closely follow the predictions of Harris.
In this model, Elman sought to demonstrate that "a network could learn the lexical category structure which was implicit in a language corpus." [ELMAN89, pg.3] A key assumption behind this model was: "One of the consequences of lexical category structure is word order. Not all classes of words may appear in any position. Furthermore, certain classes of words ... tend to co-occur with other words" [ELMAN89, pg.3]. The network was trained "to take successive words from the input stream and to predict the subsequent word" [ELMAN89, pg.4]. After being trained on 6 cycles through 10,000 two- and three-word sentences, the network's internal representations were examined. The hidden unit activations were averaged over all occurrences for each word in the lexicon. Then, these "mean vectors" were analyzed using hierarchical clustering analysis. The resulting similarity structure shows a grouping of the language's words into the traditional lexical categories (noun and verb). The verbs are further divided by their argument requirements. The nouns are divided into animates and inanimates. And each of these noun sub-categories is further divided into groups based on the set of verb argument roles they can fill.
Elman summarizes the model's performance as follows: "The network is not able to predict the precise order of specific words, but it recognizes that (in this corpus) there is a class of inputs (viz., verbs) which typically follow other inputs (viz., nouns). This knowledge of class behavior is quite detailed; from the fact that there is a class of items which always precedes chase, break, and smash, it infers a category of large animals (or possibly, aggressors)" [ELMAN89, pg.7].
When a corpus of language material has been analyzed in this way, Harris predicted that all words will fall into large operator classes. In this particular example, the nouns will be classified as type N operators, and the verbs will be classified as either a type On operator -- intransitive -- or as a type Onn operator -- transitive. Note that Elman identified a third class of verb with an optional direct object. Harris would eliminate this class by assigning the verbs to either class On or Onn (depending on whether it was more likely to occur in its intransitive or transitive form). Then sentences belonging to the non-assigned class would be excluded from processing as non-base sentences.
(Note that Harris would not object to having a missing direct object in the test set for the model. He would predict that the model would partially activate all of the possible objects when the verb is presented. Then, when no object occurred, the model should have a "very likely" object highly activated. This object could be at the word level, or it could be at a higher "word group" level.)
It is significant that Elman's model was able to categorize words without resorting to base sentences. Elman's third class of verbs did not prevent the model from establishing the desired noun/verb distinction. In effect what we get in the Elman model is a separation between the null operators and other operators. However, the division between intransitive verbs and transitive verbs is not quite as strong as Harris predicts (see section 2.2). We find that the initial word categories respect the On and Onn distinction. But, the optionally transitive/intransitive verbs rapidly begin to combine these two categories. However, recall that the key distinction in distinction in describing a word's arguments was between n (null operator) and o (other operators). This distinction remains.
This result has two implications. First, it appears that allowing non-base sentences to enter the analysis may not have an adverse affect. Therefore, it may be possible to have a categorization system without a base-sentence filter. This was a point in the Harris system that required human intervention. The fact that this filter may be omitted makes the automation of Harris's word categorization system far more likely.
The second implication is that some order information needs to be maintained in the co-occurrence sets. The internal word representations that Elman's model produces not only produced good word categories, they also produced outputs that are context-sensitive next-word probabilities. This implies that they must maintain some word sequence information. We found this to be an important point, and have included word sequences in our co-occurrence sets. These sets can then be used to generate the appropriate predecessor and follow sets that the Harris theory requires.
It should be noted that the Elman noun category has a number of subdivisions. Further, as Harris predicted, those subdivisions are based on the similarity between the co-occurrence sets associated with individual words. In the Elman corpus, we find that: since there is a class of items within the N operator class which always acts as the first argument for the Onn operators chase, break, and smash, it follows that a subdivision of "large animals" will occur in the N operator class. Note that the Elman model does form a subclass of words that are purely intransitive verbs. Thus, there is the possibility that the Harris On/Onn distinction is being maintained at the subclass level.
Note that in Harris' theory the co-occurrence sets are dynamic -- subject to change as the language users vary the meaning of their words. Thus, at any point in time, a word's co-occurrence set would reflect all previous experience with that word over time. In this respect, the co-occurrence set is a direct analog of Elman's "mean vector" of hidden unit activations (the model's internal word representations).
In the Elman model, the corpus of sentences was generated by a very simple sentence generator. It had a set of simple two- and three-word sentence "templates" that it randomly filled with words from the lexicon. This corpus was so constrained that it would easily satisfy Harris' criterion for forming a sublanguage [HARRIS89]. A sublanguage is a very restricted subset of the language as a whole. The key restriction is that the words assigned to the sublanguage only have a "standard" usage. It is interesting to note that in the Harris sublanguage study, the objects that were studied were simple two- and three-words sentences. The analysts selectively extracted specific sentences to serve as base sentences. Then, these sentences were hand parsed into either <SUBJECT, PREDICATE> or <SUBJECT, VERB, OBJECT> sequences. Then, the words were manually categorized based on these mini-sentences.
We find it interesting that the actual data used in this manual implementation of Harris theory should so closely resemble Elman's data. Again, the key point in the Harris system was that the words actually being studied should have a standard usage. And, the Elman model seems to show that some deviation from the standard usage criteria will not be fatal to the categorization process.
In conclusion, let us emphasize two points from Elman's model that are important to our work. First, the model shows that it may not be absolutely necessary to avoid processing non-base sentences. It appears that allowing non-base sentences will "muddle" the distinction between certain operator types, but that it will not adversely affect the key distinction for classification: n (for null operators) vs. o (for all other operators). Further, the finer distinction between the O-type operators may be obtainable by dividing the resulting O class into sub-classes. Second, it appears that maintaining order information in the actual co-occurrence sets will be important (see section 2.5 for a different route to the same conclusion).
We would like to emphasize the critical influence of Elman's work on our research. It showed us that it was possible to produce an automatic word categorization system using concepts compatible with Harris's theory. Further, it provided us with a demonstration "toy" language which was critical to the development of our system.