2.5 "OCCURRENCE-BASED" WORD REPRESENTATION
Stephen Pinker's work on language acquisition [PINKER84, PINKER89] provided insight at this point. In particular, his method for acquiring inflection proved to be relevant to our task. Pinker describes the problem as follows:
The principal difficulty that a learning mechanism for affixation faces is that different languages grammaticize different aspects of an event, and when they do, they do so OBLIGATORILY. Thus the child cannot encode the pragmatically salient notions in his or her interpretation of an input sentence containing inflections and work on the assumption that the inflections are encoding only those notions. ... And the child not only cannot use the situation to determine which notions are encoded, he or she cannot use a priori knowledge either, since for all the child knows, it could be subject animacy or object number that an affix is encoding. [PINKER84, p.168]
Thus the child has a two-fold problem. First there are a large number of possible features that might be grammaticized using inflection. Each language draws from that class a subset that is mandatorily implemented. Second, the child must learn not only which features are used in his/her language, but also what morphology is used to represent those features. This is very difficult search problem.
Pinker's solution involves a stochastic system. It is important to realize that Pinker uses semantics in conjunction with syntax to acquire the inflection system. Thus, the child will have a well defined semantic notion of what the target utterance is trying to convey. In a situation where multiple features might be relevant to a given word, the child simply chooses one of the features to be the correct one. This <word, feature> pair is retained in memory for future reference. If the same word occurs in a different situation, that word can be postulated as representing multiple features. Thus, we will gradually build up a set of word forms for each semantic concept, with the varying forms representing syntactically relevant features of the concept.
But, what if the child incorrectly assigns a <word, feature> pair. "There is an important constraint on affixation ...: no complete set of grammatical feature values may be encoded by two or more distinct morphemes [PINKER84, p.177]." Pinker refers to this as the "Unique Entry" principle. He implements it by allowing the child to have conflicting <word, feature> pairs early in the learning process, but eventually allowing the most frequently occurring pairs to eliminate the "erroneous" ones from the system.
The key point, for us, in this system is that Pinker is able to select relevant features for words by recording from actual linguistic events. These events will present certain dominant feature patterns that are the appropriate patterns for the language being learned. In our case, we will be considering only syntax -- not, semantics. Thus, the relevant linguistic context will be the words that surround our target word. This implies that the key features will be those contexts.
To apply this concept to Elman's data, we need to establish an appropriate context. In section 2.3, we experimented with a number of possible contexts. There, we determined that expanding the context beyond the immediate predecessor and successor did not improve our results. This coincides with Harris's insight that a neighbor of one word on either side of the target word should be sufficient to classify that word. So, we decided to see what kind of information was available if we simple assigned each word the set of all <previous-word, next-word> features that occur in the corpus.
The results are found in Figure 2.7. Again, I have used the primary groupings for the nouns, the INTRAN verbs and the DO-REQD verbs. The grid shows all possible features. Within the grid, I have labeled each square with a code representing which words have that feature. Note that the nouns have low numbers and early letters from the alphabet; the verbs, high numbers and later letters from the alphabet.
Probably the most important result is that all the verbs are found in the upper right-hand quadrant. This seems obvious in retrospect because the verbs always appear in a <NOUN, NOUN> context. (The two possibilities that can occur in the Elman corpus are: (1) transitive verbs have <SUBJECT, OBJECT> features and (2) intransitive verbs have <SUBJECT, SUBJECT> features. Further, it is never possible to get three nouns in succession. Thus, nouns will NEVER have a <NOUN, NOUN> feature. Thus, using these features as the internal representation of words in Elman's lexicon will place nouns and verbs in orthogonal sets. Thus, using this representation, any feature-based clustering technique should give us the clean split between nouns and verbs we have been seeking.
Figure 2.8 summarizes the key generalizations that can be drawn from figure 2.7. It implies that we may be able to actually detect the end of sentences. Further, we may be able to detect subject and object roles within the sentences. This all seems quite promising, but we must keep in mind that we are only looking at Elman's toy language.
It should be noted that we are only storing lists of features for each word in the lexicon, we are not maintaining any frequency counts. Thus, one occurrence of a feature is just the same as 1000 occurrences of another feature. Thus we get the notion of what can "possibly" occur versus what is "most likely" to occur. Since a feature occurring is the key event in this system, we have decided to call it and "occurrence-based" word representation.