1.5 SUMMARY AND FUTURE DIRECTIONS
The studies mentioned above show great promise for our methodology. The simple metric of recording contexts in which a data element occurs can be quite useful in the study of natural language. Because of the nature of the natural language domain, every context provides valuable information about a word. In fact, a single context can tell us enough to allow us to classify a word. Obviously, a single context (or very few contexts) is an extremely fragile classifier. It should be highly susceptible to noise. But, our methodology manages to cope with this fragility by being very selective about which contexts can actually classify words. This has allowed us to process a large subset of low-frequency words. However, a much larger subset remains beyond the current capabilities of our technique.
As indicated above, we have just begun studying the use of this context-based knowledge. We have indicated that the study has born fruit in the area of word categorization thus far. But, the procedure has a number of parameters that may be subject to further tuning. In addition, word "splitting" has only received token attention. Yet to be determined is the optimal size for the initial CORE GROUPING generated by our clustering procedure. Another interesting question would be how small of a CORE will still generate reasonable classification results?
An additional set of questions arises when we consider whether a larger subset of the low frequency words can be classified. An obvious step would be to increase the corpus size. This however will cause two major problems. First, Zipf's law implies that there will be a proportional increase in the number of low frequency words. Thus, although an improvement in the number of words classified should be expected, the percentage of words classified may not improve. Second, an increase in the size of the corpus will compound the ambiguous-usage problem. Thus, we will need to tune the parameters that control this problem. The key to expanding the coverage of low frequency words is producing a set of generalized CORE contexts that more completely cover the corpus. Possibly a combination of corpus expansion and parameter tuning will be able to meet this goal.
Additional research topics arise both within and beyond the class level. Within word classification, there is the question of sub-categories. Specifically, it would be interesting to determine how the CORE contexts partition the "natural" categories. Beyond word classification, there is the question of phrasal components. Here, we have already seen patterns where subgroups of CORE contexts have a common initial or final word. For example, consider a hypothetical category in which all CORE contexts begin with the word TO. Then, a simple next step would be to consider the word combination "TO word" for CORE words from that category as a lexical item. These items would form "mini-phrases." Our occurrence-based procedure would then treat these mini-phrases just like words, and we would begin to look at higher-order components. (See Chapter 7 for more details.)
The preceding paragraphs only mentioned a few possible research directions for this approach. We believe that our preliminary investigations have shown the viability of this method, and that these, and other research projects should be pursued.