4.2 THE LOB CORPUS
We decided that it was necessary to validate our algorithms by applying them to a tagged corpus. This would allow us to compare the word groupings that our algorithms generated to actual word categories. We chose the Lancaster/Oslo/Bergren (LOB) Corpus because it was HAND-tagged, and was of sufficient size to perform the experiments we wished to do.
The LOB Corpus is a corpus of approximately one million words of British English. Most of it is edited prose (primarily newspaper articles from the early 1960's). It is available in approximately 50 files of about 20,000 words each. Our analyses are performed on corpora of about 40,000 words. So, experiments can be run by sampling 2 files from the corpus at large.
The actual LOB text is a series of word/tag pairs connected by underscores. An example is:
As mentioned in section 4.1.1, we have decided to attach punctuation to preceding words, as is done in standard written text. So, all of our analyses of LOB text require a pre-processing stage in which punctuation is attached to the preceding word. In most cases, punctuation is identified by the pattern
symbol_symbol
so this is a relatively simple step. This step is actually performed on the words as they are loaded into our lexical databases. We merely needed to provide a one-word lookahead for punctuation detection. (Details are not provided.)
A problem with using the LOB corpus is its large number of word-category tags. In the corpus that I will discuss in the next section (files 1 and 2 of the LOB Corpus), there are 121 distinct tags. A key question becomes: What is the correct way to combine tags into tag-groups? Two answers occur immediately: (1) group the tags by traditional categories, or (2) group the tags by the way they are used.
Solution (1) has a number of problems: What do you do with tags like "RN" (a noun used as an adverb) and "NR" (nominal-adverb)? How do you group gerunds (verbal nouns capable of taking objects and adverbs)? And, so on.
We prefer solution (2). It involves analyzing the contexts associated with each tag, and grouping the tags accordingly. This has the advantage of identifying the actual usage of words assigned those tags. Thus, we will not be relying on a subjective evaluation of how the tags might be used.