Pinker [in PINKER84] described the bootstrapping problem as follows: "Linguistic entities ... have characteristic distributions: they appear in particular phrase structure positions and are marked by particular affixes, or both. However, this cannot be much help to the child, since the particular phrase structure rules and affixes that signal their presence in a language are part of what the child has to learn in the first place" [PINKER84, p.38]. In particular, to assign a word to categories, it is necessary to know the categories of the surrounding words. In order for such a system to work, it must first have an initial set of categorized words that is sufficient to start and sustain the categorization process.
There are two basic approaches to this problem. The first assumes that some modality, external to the word stream, will provide sufficient information to bootstrap the process. Pinker's theory of language acquisition follows this approach [PINKER84]. He assumes (following GRIMSHAW81 and MACNAMARA82) that the child can use the semantics of the corresponding event to assist in assigning categories to the words in a given utterance. Note that these semantically-based categories, since they were not established by distributional properties, are considered temporary. The assumption underlying the semantic bootstrapping hypothesis is that these semantically-based, initial categories will be sufficient to allow permanent, distribution-based categories to be assigned to words.
Some key assumptions in this first approach are: First, the actual word categories must be provided by something other than the word stream. In Pinker's system, we find that standard linguistic word categories (noun, verb, etc.) are assumed to be the "natural" categories for the word stream. Second, the actual stream of utterances must be "filtered" in some way to avoid erroneous initial word classifications. Note that some "noise" could be tolerated. The subsequent distributional learning could correct some erroneous semantic classifications. But, there is a very low threshold for such errors. Finally, the bootstrapping system must provide a set of categorized words sufficient to sustain subsequent word categorization.
The second approach to the bootstrapping problem is to assume that there is enough information in the stream of words itself to start the categorization process. The work of structural linguists follows this approach. They assume that by simply examining the distribution of words within a body of language, it is possible to induce a sufficient set of categorized words to begin the categorization process. Harris was a leading proponent of this approach, and we will discuss his work more below.
Some key assumptions in this approach are: First, there will be enough stable word contexts to establish some initial set of word classes by referring only to word distributions. It is by no means obvious that this can be done automatically by a machine. But, the presence of many high-frequency, closed-class, function words in English makes this step appear feasible. Note that this approach attempts to build up word categories from the actual word stream. The resulting "natural" word categories may, or may not, match the standard linguistic categories. This method is also subject to noise problems. In this case, noise involves words that have distributions compatible with multiple word categories. Again, a filtering mechanism on the input is required. And, finally, the bootstrapping system must provide a set of categorized words sufficient to sustain word categorization.
We are primarily concerned with machine categorization of words. For us, a critical question is: How much can a machine learn from the stream of words alone? (Whatever cannot be learned from the word stream will need to be provided to the machine by some other mechanism.) Our research has been focused on answering this question. To that end, we have been following the latter approach to the bootstrapping problem. Our occurrence-based methodology has been reasonably successful at inducing interesting, "natural" word categories from a moderate size corpus (approximately 40,000 words). We have also been able to assign categories to many of the words used in the corpora. Thus, it appears to be a sustainable word categorization procedure. We will discuss our technique in more detail in section 1.4.