1.2 Harris and Structuralist Word Categorization

Zelig Harris was one of the most prominent American practitioners of linguistic structuralism. The structuralists believed that the correct way to study language was to look for regularities in actual usage. That is, they believed that word categories, grammar, etc. could be identified by direct observation of the word streams produced by native speakers of the language. Their work was built on a foundation of word interactions. Structuralism postulates the emergence of "grammar-like" behavior from these low-level interactions.

In Harris' theory, the ability of words to enter together into a sentence is based on their likelihood to co-occur. This likelihood is a statistical relationship between words that can be observed over time. For example, the word "John" may allow a number of words to enter a sentence: "departs", "falls", etc. However, we cannot have "*John entails ...". Thus, we see that there is a class of words (verbs that allow human subjects) that are allowed to enter a sentence once "John" has occurred in the subject position. Further, all other words are blocked from entering the sentence.

The class of words that can follow "John" is established by identifying all words that have been observed to follow "John". Similarly, for words like "falls" and "departs", two sets of words are identified. One set records the words that precede the target word, and a second set records the words that follow it. In our preceding example, "John" would be a member of the class of words that could precede "falls" and "departs". But, "John" would not be a member of the word class that precedes "entails". As we observe a language, we can build up a set of co-occurrence relations for each word in that language.

In Harris' theory, the major word classes are operator types (he does not use the standard word classes: noun, verb, etc). Operators can take zero or more arguments. The operator's first argument always precedes it and the remaining arguments following it. Words that can start sentences form a privileged class: operators with no predecessors. These "null" operators are labeled N. In the preceding example, "John" would belong to the set of "null" operators.

All other operators are labeled O_xy..., where x, y, ... identify the class of arguments that will co-occur with the operator. X and y are always chosen from either n (for null operators) or o (for all other kinds of operators). Thus, verbs would normally fall into the classes O_n, O_nn, or O_no, depending on the complements they allow. For example, "John reads books" would imply that "read" would belong to the operator class O_nn.

But, read can also have the intransitive form "John reads". This would place "read" in two operator classes: O_n and O_nn. Harris has as a stated goal to try to place all words in only one operator class. In this case, we find that normally "read" is a transitive verb (that is, most of its occurrences occur with two N arguments). Therefore, it is preferable to consider the intransitive form as a non-basic form (possibly formed by eliminating the indefinite noun from the form "John reads things"). Thus, we have a distinction between the basic ("base") sentences of the language, and other sentences. It is only the base sentences that participate in word classification.

This, then, is the filtering mechanism that Harris uses to control which sentences will actually influence the word categorization process. This is also a key problem for machine implementation of his methodology. Although it may be relatively simple for a human to isolate the base sentences associated with words like "read", it will be difficult to automate this step. Further, the actual identification of base sentences is far more complex than this. The preceding example involved identifying a sentence that has a component that has been reduced to zero. But there are other reductions that will make a sentence non-basic. Many of these involve affixes, and would require the language learner to have an understanding of morphology. We will return to this filtering problem when we discuss Elman's work in the next section.

Note that Harris's system does bootstrap itself. After a corpus of text has been scanned, we will have established a large number of co-occurrence sets. We begin forming word classes by combining words that have similar co-occurrence sets. Then, we modify the co-occurrence sets by replacing the occurrences of these initially classified words with their class. This will produce new, more general, co-occurrence sets. We can continue to iterate this process until all words are classified. If the system stops at some point, human intervention can be used to restart the system (either by adding more base sentences, or removing some questionable base sentences).

The preceding discussion highlights the key parts of Harris's theory that directly impact our work. Before we leave, however, we want to mention three additional points. The first is that Harris states: "It should be noted that the assignment of a word X to a word class is made not on the basis of meaning (although a rough connection with meaning exists) but solely on what word classes are necessarily present with X ... [in base sentences]" [HARRIS82, p.3]. Thus, the word categorization that we get from this system is expected to be quite high level and general. (That is, we do not expect word classes to isolate semantically equivalent sets of words. It is possible that such classes might arise because the words have very close co-occurrence sets, but it is not necessary.)

The second point we wish to raise is that the process of bootstrapping the system will generate natural sub-classes within the word classes. The initial set of word categories will be established based on actual words instead of word classes. This implies that a further subdivision of the operator classes is possible based on the similarity between the co-occurrence sets associated with individual words. For example, from the fact that there is a class of items within the N operator class which always acts as the first argument for the operators "think", "read", and "write", it follows that a subdivision of human subjects will occur in the N operator class.

Finally, note that in Harris' theory the co-occurrence sets are dynamic -- subject to change as the language users vary the meaning of their words. Thus, at any point in time, a word's co-occurrence set will be subject to change, reflecting all previous experience with that word over time. But, a word's operator class (that is, its major word category) will remain constant over time. This is based on theoretical considerations. As we mentioned above, Harris attempted to restrict all words to one operator class. Any sentence that might make a word appear to be the member of multiple classes will be considered a non-base sentence, and will be excluded from analysis.

In conclusion, let us emphasize the points from Harris's structuralism that are important to our work. First, the information necessary for categorizing words can be extracted from a corpus by recording the co-occurrence sets associated with each word used in a base sentence from that corpus. Then, there should be sufficient regularity in these co-occurrence sets to begin forming word classes based on the actual words found in those sets. Once an initial set of words has been classified, further classification will be based on the word classes surrounding non-classified words. The resulting set of "natural" word classes is a set of operator classes that does not exactly coincide with the standard linguistic word classes.

Page updated

Google Sites

Report abuse