7.2 FUTURE RESEARCH
As we mentioned earlier, the research in this document is very preliminary. The intent of this research was to test the feasibility of occurrence-based processing as a research approach. We believe that the preceding section has shown that the technique has much to offer the field of computational linguistics. In this section we will try to highlight a number of areas where further work might be fruitful.
A key point of concern was the fact that only about 25% of the lexicon was actually categorized by our technique. This was because a large portion of the words in each corpus that we studied had fewer than two shared contexts. Note that our similarity metric requires similar word pairs to have at least two shared contexts in common. As a result, a large number of words were not eligible for categorization. We used our similarity metric to extract a small subset of words that would form the basis for categorization. This initial CORE categorization was expanded using a separate set of algorithms. As a result, the CORE was expanded from about less than 5% of the lexicon to about 8% of the lexicon. Then, we had a separate set of algorithms that actually categorized words. The expanded CORE produced a larger, and qualitatively better set of categorized words than the initial CORE did. Unfortunately, this set of categorized words was still only about 25% of the lexicon.
Our primary goal was to show that meaningful word categories could be formed. Therefore, we were very conservative in the way we designed our algorithms. This led to reasonably good categories, but we did not actually attempt to optimize the categories. Some key areas that need to be examined in more detail are:
Looking beyond the categorization technique developed thus far, there are two broad areas for increased study. The first involves within category studies. Harris's theory states that the way words are used in contexts will be a key determinant of sub-categories. For example, a detailed study of the predecessor words within a category may yield interesting sub-categories (for the predecessor words within their categories). More intuitively, the actual CORE contexts for a category may divide the words within that category in an interesting way. There seems to be plenty of ground to explore at the sub-category level.
The second broad are of study is phrasal components. Once the realm of word triples has been thoroughly investigated, we will want to move on to larger text segments. One method for doing that is to identify key word pairs that co-occur. This will probably involve a sub-category and a context element that is common to all elements within that sub-category. We have done some processing where such patterns do appear. Once a pattern like this is recognized, we could treat the <CORE word, context element> pair as a single lexical item. Whenever this sequence is detected in a stream of text, the single lexical item for the pair will be used. The remaining categorization algorithms would work the same as before. With these new, phrasal components as lexical items, we would begin to look at generalized lexical triples that cover larger blocks of text. This will be the critical step in determining whether occurrence-based processing is capable of actually detecting phrases. But, the exploration of this area awaits the completion of the study of sub-categories.