2.7 CONCLUSION
We have developed a representation that is quite simple, and should be generalizable to more realistic language domains. It requires simply noting the <previous-word, next-word> contexts in which a word occurs. Since our current research is primarily concerned with categorizing words, it appears that this representation will be sufficient for our needs. However, as indicated in the preceding section, more sophisticated linguistic analysis will require additional statistical information.
NOTES
1. The vector we are using for the context-sensitive representation of words has two primary motivations. First, it is consistent with Elman's net design. For the word categorization experiment discussed in this chapter, he provided enough hidden units to hold up to 5 input vectors. Thus, the net could actually retain the previous 5 inputs if it wished. In a second experiment, discussed in Chapter 2, he allowed sufficient hidden units to retain 7 input vectors. Evidence concerning center-embedding results seems to confirm that the net was in fact retaining information from the 7 preceding inputs. The second reason for the representation we have chosen is our desire to only use simple statistics. The simplest way to allow the prior context to influence the representation of a word was to sum the occurrences of all words that preceded the subject word in the corpus. Then, normalizing the sum of the occurrences to 1 would produce a previous word transition vector. Pairing this previous-word transition vector with the next-word transition produced the context-sensitive representation that we use here.
It should be noted that we did some experiments with conditional probabilities. These involved representing a word by a matrix of transition probabilities. The matrix had the previous word as one dimension and the next word as the other. The actual cell entries were the number of occurrences of a next word given the current word and previous word. However, the "toy" language corpus was not large enough to produce meaningful results. Since we were using a corpus 6 times the size of Elman's (60,000 randomly generated sentence versus Elman's 10,000), this is further evidence of the power of his model's higher-order statistical analysis. We could have made these conditional probabilities meaningful by increasing the size of the corpus, but this would not be consistent with our goal of opening our analysis to low frequency words.