3. Lexical Category Structure

In his first model, Elman sought to demonstrate that "a network could learn the lexical category structure which was implicit in a language corpus." [ELMAN89, pg.3] A key assumption behind this model was:

One of the consequences of lexical category structure is word order. Not all classes of words may appear in any position. Furthermore, certain classes of words ... tend to cooccur with other words. [ELMAN89, pg.3]

The network was trained "to take successive words from the input stream and to predict the subsequent word" [ELMAN89, pg.4]. After being trained on 6 cycles through 10,000 two- and three-word sentences, the network's internal representations were examined. The hidden unit activations were averaged over all occurrences for each word in the lexicon. Then, these "mean vectors" were analyzed using "hierarchical clustering analysis." The resulting similarity structure shows a grouping of the words by the traditional lexical categories of verb and noun. The verbs are further divided by their argument requirements. The nouns are divided into animates and inanimates. And each of these categories is further divided into groups based on the set of verb argument roles they can fill.

Elman summarizes the model's performance as follows:

The network is not able to predict the precise order of specific words, but it recognizes that (in this corpus) there is a class of inputs (viz., verbs) which typically follow other inputs (viz., nouns). This knowledge of class behavior is quite detailed; from the fact that there is a class of items which always precedes chase, break, and smash, it infers a category of large animals (or possibly, aggressors). [ELMAN89, pg.7]

In Harris' theory, the ability of words to enter together into a sentence is based on their likelihood to co-occur. This likelihood is a statistical relationship between words that can be observed over time. In Harris' theory all words are operators. Operators can take zero or more arguments, with the operator's first argument always preceding it. Words that can start sentences form a priviledged class as operators with no argument predecessors. These "null" operators are labelled N. All other operators are labelled O(x,y,...), where x, y, ... identify the class of arguments that will co-occur with the operator. X and y are always chosen from n (for "null" operators) or o (for all other kinds of operators). Thus, as Harris is gathering co-occurrence statistics, he is also using the sequential ordering of words to determine operator classes.

When a corpus of language material has been analyzed in this way, Harris predicts that all words will fall into large operator classes. In this particular example, the nouns will be classified as type N operators, and the verbs will be classified as either a type O(n) operator -- intransitive -- or as a type O(n,n) operator -- transitive. Note that Elman identified a third class of verb with an optional direct object. Harris would eliminate this class by making the verb transitive and claiming that its object was "reduced" to zero. In fact, he would not have included any of the sentences with the missing direct objects in his "base" corpus on strictly theoretical grounds. Thus, this would be a case where having an operational paradigm to follow would have influenced the data selected to train the model.

(Note that Harris would not object to having a missing direct object in the test set for the model. He would predict that the model would partially activate all of the possible objects when the verb is presented. Then, when no object occurred, the model should have a "very likely" object highly activated. This object could be at the word level, or it could be at a higher "word group" level. In both cases, the object would be providing little or no information to the sentence and would be a candidate for reduction -- see TRANSFORMATIONS below.)

The main goal of Harris' analysis is to determine co-occurence "likelihoods" between words in the target lexicon. A further subdivision of the operator classes will occur based on the similarity between the co-occurence sets associated with individual words. From the fact that there is a class of items within the N operator class which always acts as the first argument for the O(n,n) operators chase, break, and smash, it follows that a subdivision of "large animals" will occur in the N operator class.

Note that in Harris' theory these co-occurence sets are "fuzzy". They are dynamic -- subject to change as the language users vary the meaning of their words. Thus, at any point in time, a word's co-occurence set would reflect all previous experience with that word over time. In other words, the co-occurence set is a direct analog of Elman's "mean vector" of hidden unit activations. However, the "lexical" operator classes that words belong to will remain constant over time. This again is based on theoretical considerations. Harris attempts to restrict all words to one operator class, leaving any appearance of membership in multiple classes to be explained by grammatical reductions.

Page updated

Google Sites

Report abuse