3.1 WORD CONTEXTS
Our method is quite simple. We have been processing corpora of text that range in size from 40,000 to 60,000 words. As we scan through this text, we record each word in the lexicon. At the same time, we store the word, its predecessor, and its successor in a context database. It is this context database that we use to categorize words.
We have established some nomenclature. In particular, we have chosen to call the contexts associated with a given word its feature set. We consider these contexts features since we do not count the number of times each context occurs in the text, but simply note that it has occurred at all. This simple "yes"/"no" identification on contexts is similar to the binary encoding of "featural" representations. We have found that these "feature" based similarity measures work well with our context variables.
However, it should be noted that our contexts do not represent any standard "linguistic" feature set. We can note that large word groups may have no context ("feature") that is common to all of its members. Further, contexts are frequently shared across word groups. Algorithm 3.1 describes the way that we scan the text and store the necessary contextual data. We are using a relational database with an accompanying high-level programming language, M. The relational database has a full range of application generating tools, and the accompanying M language has complete embedded-SQL functionality. Thus, our algorithms will not mention data structures explicitly, but will refer to appropriate TABLES. It should be understood that these tables are indexed such that appropriate access routes to the data are provided. For example, the word triples in table LEXTRIPLES are indexed such that we can easily identify all contexts associated with a given word AND all words associated with a given context.
Algorithm 3.2 posts a count of the features associated with each lexical object (word) in the table LEX. This "bookkeeping" will assist us in performing distance calculations in the next section.