2.1 INTRODUCTION
Elman's neural network models, along with most other statistical approaches to language acquisition, assume that there is sufficient information in the input stream of words for the system to build a word classifier from scratch. This approach emphasizes frequency-based statistical inference, featuring various methods of mechanically extracting relevant information from the language input. The hope is that a coherent "inductive" theory of language acquisition will emerge. Although Elman's work showed that such a classifier could be built from scratch, the implementations were restricted to "toy" domains, and they have not yet yielded an "inductive" theory of acquisition.
A key disadvantage of frequency-based statistical approaches is that their statistics only become meaningful in the limit. With frequency based approaches, the data set must contain a large number of examples of each lexeme and each grammatical construct. Thus, low-frequency words must be ignored. Yet, humans are often capable of making linguistic generalizations on the basis of only a few examples, or even on the basis of a single example. Steven Pinker (1984) indicates that there is a severe paucity of data available to humans learning lexical categories. This implies that the child will need to learn much of the required information after one (or very few) relevant examples. This observation coincides with the version of Zipf's Law that states: There will always be a large tail of words that appear only a few times.
The neural nets that Elman uses are complex statistical inference machines. In this chapter we investigate whether simpler statistics - based on the outputs of Elman's word-categorization model - can replicate his results. Although we will be unsuccessful in actually duplicating his results, our investigation will yield a representation that replicates the portion of his results that are most applicable to our research. This representation is based on the contexts in which the words actually occur, and forms the basis for our "occurrence-based" model of word categorization (see Chapter 3).