Clustering Algorithms

Predefined Clustering

 

Users can run their own clustering algorithm independently and plug the results directly to Medusa application for visualization. Nodes that belong to the same layer will come together around a circle whereas each cluster will get assigned a different color. 

 

 

Affinity Propagation algorithms is a machine learning based algorithm which can be used in interdisciplinary fields to cluster data. This algorithm is fast and efficient and takes as input measures of similarity between pairs of data points and simultaneously considers all data points as potential exemplars. Real-valued messages are exchanged between data points until a high-quality set of exemplars and corresponding clusters gradually emerges. The user does not have to define the number of clusters since they are automatically calculated. This algorithm can also be used in biological cases. In the original paper the authors show how Affinity Propagation can efficiently identify segments of DNA that reflect the expression properties of genes. 

Complexity:  O((N^2)log(N)) where N is the number of nodes

 

 

Spectral clustering is a supervised clustering method where the user needs to define the number of cluster like in k-Means. Spectral methods allow one to study global properties of a dataset by making only pairwise similarity measurements between data points. Spectral clustering considers partitioning a weighted undirected graph G into a set of discrete clusters. Each node in the graph corresponds to a protein sequence and the weight on each edge corresponds to the similarity between the two protein sequences it connects. Ideally, we are looking for areas in the graph, the clusters, in which the nodes are connected with highly-similar edges; and at the same time the connections between such areas should be weak, constituted by edges with low similarity. Spectral algorithms was extensively tested and compared with other local methods on several subsets of the Structural Classification of Proteins database, a gold standard for protein structure classification. We consistently observed that, the number of clusters that we obtain for a given set of proteins is close to the number of superfamilies in that set; there are fewer singletons; and the method correctly groups most remote homologs. Its biological applicability can be explained extensively in the relevant article.

Complexity: The complexity of this algorithm is not defined yet but it is definitely slower that the rest algorithms described here.

 

PARAMETERS

User needs to provide a parameter which shows the number of clusters

 

 

   

k-Means (MacQueen, 1967) is an algorithm to cluster data. It is necessary for the user to define the number of clusters (k) manually. It is not often used in biological cases but it is a very well known and widely used algorithms that can be used in interdisciplinary fields. k-Means accepts as an input a full all-against-all distance matrix and has a running time complexity of O(nlk) were n is the number of nodes, l is the number of loops and k the number of clusters.The procedure follows a simple and easy way to classify a given data set through a certain number of clusters (assume k clusters) fixed a priori. The main idea is to define k centroids, one for each cluster. These centroids shoud be placed in a cunning way because of different location causes different result. So, the better choice is to place them as much as possible far away from each other. The next step is to take each point belonging to a given data set and associate it to the nearest centroid. When no point is pending, the first step is completed and an early groupage is done. At this point we need to re-calculate k new centroids as barycenters of the clusters resulting from the previous step. After we have these k new centroids, a new binding has to be done between the same data set points and the nearest new centroid. A loop has been generated. As a result of this loop we may notice that the k centroids change their location step by step until no more changes are done. In other words centroids do not move any more. It is a suitable algorithms when all of the relationships among all of the data points are known.

Complexity: O(nlk) where n is the number of nodes, k is the number of user predefined clusters and l the number of loops where the algorithm diverges.

PARAMETERS

User needs to provide k parameter which shows the number of clusters