2.4 A DIFFERENT CLUSTERING TECHNIQUE
In the preceding sections, we were seeking a statistical technique that would produce the higher-level clustering that Elman found. Specifically, we were looking for a technique that could separate the nouns from the verbs. Once that was accomplished, our intuition was that lower-level groupings could be achieved by incorporating context.
The main reason that the AGR group did not cluster with the other nouns in the preceding experiment was that it was too "far" away. Using the standard neural net (Dmean, see DUDA73), it would always be clustered after the verb group. However, if AGR is much "closer" to all the other nouns than it is to the verbs, than maybe the lack of a clean split between the NOUNS and VERBS is an artifact of that clustering technique. It may be that Elman's neural net model is simply transforming context-sensitive representations into vectors that provide good Dmean word clusters.
To investigate the actual distances between the words of the lexicon, we returned to the next-word probability experiment. We created a fully-connected graph with a node for each word. Then, we labeled each arc with the distance between its source and destination nodes. Finally, we used Prim's algorithm [HU82, pp.28-29] to find the minimum spanning tree of this graph. The resulting tree will be the minimum distance spanning tree for the word graph. This tree is a form of clustering since each new tree-arc will bring the next "closest" word into the tree. Figure 2.6 is the minimum distance spanning tree for this experiment.
Notice that this tree shows that although the AGR group may be farther away from the other nouns than many of the verbs are (see Figure 2.5), it is in fact closer to the other nouns than any other lexicon words. Further, disregarding the END-OF SENTENCE group (FOOD, FRAG, and INTRANS) each verb has another verb as its closest neighbor. This looks promising, but the clustering technique that closely follows the minimum spanning tree ("nearest-neighbor") does not improve on the clustering diagrams we have examined thus far.
At this point, we decided that Dmean clustering of Elman' outputs, using a Euclidean distance metric, would not provide a suitable word clustering. The alternatives that this suggests are: (1) try another clustering technique, and (2) try a different distance metric, or (3) try a different representation. In this section we tried a different clustering technique without much success. It should be noted that all of these experiments were duplicated using a "cosine" distance metric. That change in distance metric did not significantly improve the results. In the next section we discuss a different internal word-representation which does meet the goals established for this chapter.