3.2 GROUPING WORDS
Once the contextual features have been identified for each word, we proceed with forming word groups. To accomplish this, we use single-linkage clustering. That is, we combine the two closest word groups provided their similarity measure exceeds a specified threshold.
The similarity measure that we use is known as the Tanimoto coefficient. For any two words, let the first word's feature set be A and the second word's feature set be C. Then their Tanimoto coefficient will be
Note that this number can range from 0 to 1; with a measure of 1 indicating identical feature sets, and a measure of 0 indicating disjoint feature sets.
Since similarity and distance are analogous but contrary measures, we have chosen to describe the Tanimoto distance between two words as:
This distance metric is sometimes referred to as the Jaccard distance.
Algorithm 3.3 computes our distance metric. Since the number of features associated with each word was pre-computed by Algorithm 3.2, we need only determine the number of features shared by each word-pair to compute their distance. Also, we are storing two directed arcs for each word-pair. This can be done because all word-distances are symmetrical. The result is a graph with a node for each word. The nodes are connected by directed arcs labeled with the Tanimoto distance between their associated words. If two words do not share any features (that is, their distance is 1.0), their associated nodes are not connected. Thus, the resulting graph may not actually be a connected graph, but may be a series of connected sub graphs. Finally, the arcs stored in table LEXDISTANCE are indexed by increasing distance AND by destination node. These two indexes assist in making the clustering task computationally efficient.
Now our grouping algorithm can be restated as: combine the two closest word groups provided that their distance does not exceed a specified threshold. This is the normal way to define single-linkage clustering, and we will use it henceforth. A key advantage of using single-linkage clustering is that it is computationally very cheap. It can be implemented using a minimum-spanning-tree algorithm1.
Algorithm 3.4 describes our single-linkage clustering method. Starting with the graph created by Algorithm 3.3, we remove all arcs labeled with a Tanimoto distance greater than the specified threshold. Then we begin choosing nodes to be combined by detecting the minimum-distance arc in the graph. After an arc has been used to combine two nodes, we "mark" the arc as part of the spanning tree and deactivate it. If the minimum-distance arc attempts to join to nodes that have already been combined (that is, they already belong to the same cluster), the arc is deactivated. When there are no more active arcs, the "marked" arcs will form a minimum spanning tree. Further, the words within each cluster will be assigned a common label. Finally, since we are "marking" the spanning-tree arcs with sequential "clustering" numbers, an ordered list of the spanning-tree arcs will produce a hierarchical cluster chart. [Note that this is Kruskal's Algorithm for determining a minimum spanning tree2.]
It should be noted that the graph will normally have a number of disjoint sub graphs. Thus, the minimum-spanning-tree algorithm must actually detect a minimum spanning forest. The Kruskal Algorithm does this automatically.
It should be noted that I have also been calculating the minimum spanning tree using Prim's Algorithm (as in Chapter 3). This algorithm produces a list of arcs in the order they would be added to individual trees. The Kruskal Algorithm lists the arcs in order of increasing distance. Thus, it is difficult to extract the actual spanning trees from the Kruskal Algorithm output. Although it is not guaranteed that these two algorithms will always give the same spanning tree, with our data they have. We mention this because our implementation of Prim's Algorithm requires that we explicitly have two directed arcs between each pair of connected nodes. The Kruskal Algorithm requires only a single undirected arc between each pair of connected nodes.
In summary, we group words using four algorithms. The actual steps in our procedure are:
Figure 3.4 shows the results that we obtained using Elman's "toy" language and a distance threshold of 0.99. Notice that we obtained three disjoint sets: (1) nouns, (2) "eat", and (3) all other verbs. It would have been nice if the two verb sets could be combined. In fact they can be combined by using a process that we call "abstraction."