Author: Jesús Emeterio Navarro-Barrientos
Year: 2007
Company: Acrolinx GmbH
Goal: the main goal of this project is to find the clusters (groups) of sentences which are more similar within a document or directory with documents.
The core of the program is a non-hierarchical clustering algorithm called k-means, where the centroids are sentences which can be regarded as the template for the cluster providing the shortest within-cluster distance to the members. This method needs initially a starting number of clusters, k ,and a similarity metric to compare the elements, in this case sentences.
In general, the approaches based on hierarchical methods are not so practical for large number of data leading to inefficient results, whereas using non-hierarchical approaches, the starting number of clusters can be guessed and improved.
Scientific articles
H. Ye und S. Young, "A clustering Approach to Semantic Decoding“, Interspeech, Pittsburg, USA, 2006.
Algorithm based on no-hierarchical method "K-Means”. The similarity between sentences is done by means of the methods: "Saliency" und "Mutual Information".
S.S.L. Pia und A. McEnery, “A tool for Text Comparison”, Proc. Of Corpus Linguistics, Lancaster, 2003, pp. 637-646.
The algorithm finds similar texts by means of a hierarchical clustering algorithm for sentences. The algorithm uses "spelling variants" in order to normalize sentences and uses a "digram-based Dice Coefficient" to measure the similarity between sentences.
H. Zha, "Generic Sumarization and Keyphrase Extraction Using Mutual Reinforcement Principle and Sentence Clustering“, Proc. Of the 25th annual intercontinental ACM SIGIR Conf. on Research and Development in Information Retrieval, Tampere Finland, 2002, pp. 113-120.
Algorithm for "keyphrase” extraction und „generic text” sumarization. The method builds a "bipartite Graph“ for terms and sentences based on a „mutual reinforcement principle“. Afterwards from the bipartite graphs a "undirected Graph“ between sentences is obtained. From these a hierarchical clustering method is used to obtain clusters for terms and a non-hierarchical clustering method is used to obtain clustering of sentences
V. Hatzivassiloglou et al., “SIMFINDER: A Flexible Clustering tool for Summarization”, In Proceedings of the Workshop on Automatic Summarization, Annual meeting of the North American Association for Computational Linguistics (NAACL-01), pp. 41-49, Pittsburg, Pennsylvania, June 2001.
Tool to find similar short texts (paragraphs or sentences). Texts are normalized based on lexical and syntactical templates. The tool finds clusters of sentences by means of a non-hierarchical clustering algorithm and sumarization is done by means of a information retrieval system.
Software
SimFinder (Hatzivassiloglou et al., ’01), Columbia University For educational, research and house users (www1.cs.columbia.edu/nlp/tools.cgi)
RIPPER (Cohen ‘96), Carnegie Mellon University Educational pourposes (www.cs.cmu.edu/~wcohen)
METER (Measuring Text Reuse) Project, Sheffield University TESAS for detection and neasuring journalistic text reuse. For educational, research users (www.dcs.shef.ac.uk/nlp/meter)
A similar program that finds clusters for sentences is the algorithm "Y-Clustering", see Hui Ye and Steve Young's paper "A Clustering Approach to Semantic Decoding", presented in Interspeech 2006, Pittsburgh,USA. However, in our approach, instead of word saliency and DWT alignment between sentences as specified in Ye and Young's paper, we consider that sentences are exchanged between clusters based on clusters' average similarity and sentences' similarity. For this, a similarity threshold, also called penalty, is needed to determine when two sentences are similar.
More in detail the algorithm works as follows:
Initialization: the user specifies the desired initial number of clusters, k, a range of values where to search for the best k value a similarity metric to use when comparing sentences and a similarity threshold, also called penalty.
Search for best k: the algorithm iterates over different number of clusters, k, i.e. the parameter k that leads to less within-group distance between the sentences in the cluster. The algorithm k-means needs an initial number of clusters 'k' to start with, however, the search for a good initial 'k' value can be tedious, that is why Sente-Clus searches for the 'k' that leads to less average clusters' distance by means of the following algorithms:
Exponential Search: to perform a broadly search of possible good k values, we obtain clusters for different number of clusters k until a bracketed minimum is found or the maximal number of exponential steps is reached.
Golden Section Search: to perform a fine tuning on finding best parameter k, once a bracketed minimum is found. It uses the golden section method, see Brent's method (Golden Section Search in one dimension), to determine how much to increase/decrease the value of parameter k in order to find the minimum.
For each parameter k, the following clustering algorithm is used:
Initialization: k sentences are selected randomly from the pool of data and are assigned to be the centroid of a cluster.
Clustering: the rest of the sentences are randomly drawn from the pool, their distance to all centroids is calculated, and the sentences are assigned to the cluster with the less centroid-sentence distance, if any.
Centroid Recalculation: For every cluster, the sentence with the less within-class distance is being selected as a centroid of the cluster, leading to an exchange between exchanges between sentences and centroid in a cluster.
Exchange: For every iteration of the algorithm, sentences and centroid from different clusters are compared. Aggregation, disaggregation or exchange of sentences occur any time that the quality of a cluster or clusters is improved by performing any of these previous.
The two previous steps are repeated until convergence or a maximal number of iterations has elapsed.
In our approach, the quality of a cluster is the normalized sum of the distances between members and the centroid. Moreover, if there are free clusters (clusters with no centroid and no sentences) the clustering algorithm places those sentences with a distance to its centroid larger than penalty in a free cluster, forming a singleton, which eventually gives each sentence the opportunity to be a centroid.
Once two sentences have been normalized, the user has two possible options how to obtain the distance between these two sentences:
to compare the sentences by character matching and
to compare the sentences by word matching.
For character matching, the similarity metric is applied to a pair of sentences which means to simple calculate the distance between sentence S and T using any similarity metric, whereas for word matching, the similarity metric is applied to each pair of words in both sentences, which implies to calculate the similarity between the sentences S and T, by means of the formula:
Sim(S,T)= #matches(S,T) / (|S|+|T|),
where #matches(S,T) is the number of words in S that match in T plus the number of words in T that match in S, and |S| and |T| are the number of words in S and T, respectively. One of the most important ingredients in the formation of clusters is the similarity metric used to compare a pair of strings, the following similarity metrics are supported in this program: Edit Distance (LDN), Jaro (JDN), Block (BLOD), Cosine (COS), Dice Coeff. (DIS), Euclidean (EUD), Jaccard (JAS), Matching Coeff. (MAC) and Q-gram (QGD).
This program reads sentences from a file or files in a directory, and finds the clusters that lead to larger similarity between sentences. If a directory is selected, the program creates a new directory to save results for each file in the input directory, as well as a file "all_sentences" with the global results. A variety of options are available to filter and normalize sentences, as well as, to configure the clustering algorithm.
To test the performance of Sente-Clus, we perform the following steps to test the resulting clusters against a desired groups of sentences, which is previously given by the user:
We obtain from input file the groups of sentences that are supposed to be in the same cluster.
For different distance percentage threshold and for all sentences in the file we perform the following:
Obtain the clusters.
Compare between resulting clusters and predefined groupping, for every pair of sentences:
If sentences are in same cluster and in same group, then a "+1" is added to this metric and the corresponding distance threshold.
If the sentences are in the same group, but they are not in the same cluster, or vice versa, then a "-1" is added to the metric.
Finally, if sentences are not in same group and not in cluster, a "+1" is added to the metric.
Normalize score in range (-1,1), dividing it by total number of pair of sentences.
The following table shows the score for Sente-Clus for different similarity metrics and different distance threshold values, the computer time that Sente-Clus needed to compute the clustering is also shown in the rows runtime (in ms):
From these results we can see that the similarity measure QGD reaches the highest score for distThres = 0.4, however takes much more computing time. On the other hand, both BLOD and LDN have high scores and low computer times for distThres = 0.4.