Stability-based methods for biomolecular cluster assessment

The validation of clusters discovered by clustering algorithms is a central problem in bioinformatics: indeed algorithms can find clusters in biomolecular data, but we need to assess whether the discovered cluster are statistically significant and biologically meaningful.

We developed stability-based algorithms and specific statistical tests for:

  1. Analyzing the overall clustering reliability and for the model order selection in an unsupervised setting of the problem (Bertoni and Valentini, 2006, 2007, 2008; Valentini 2007)
  2. Analyzing the reliability of single clusters inside a clustering (Bertoni and Valentini 2006, 2005; Valentini 2006)

The new methods have been applied to the analysis and validation of subclasses of pathologies characterized at bio-molecular level and to the discovery of multiple structures in complex bio-molecular data (e.g. hierarchical structures), using data generated through high-throughput biotechnologies (Bertoni and Valentini, 2006, 2007; Valentini and Ruffino, 2006).

We tried also to develop stability-based method to assess the reliability of hierarchical clusterings characterized by a high number of clusters and examples, targeted to the unsupervised search and validation of functional classes of genes (Avogadri et al. 2008, 2009).

Publications

A. Bertoni, G.Valentini, Discovering multi-level structures in bio-molecular data through the Bernstein inequality BMC Bioinformatics 9(Suppl 2):S4, 2008

A.Bertoni, G.Valentini, Model order selection for biomolecular data clustering, BMC Bioinformatics, vol.8, Suppl.3, 2007

A. Bertoni, G.Valentini, Discovering Significant Structures in Clustered Bio-molecular Data Through the Bernstein Inequality Knowledge-Based Intelligent Information and Engineering Systems, 11th International Conference, KES 2007, Lecture Notes in Computer Science, vol. 4694 pp. 886-891, 2007.

A.Bertoni, G. Valentini, Randomized maps for assessing the reliability of patients clusters in DNA microarray data analyses, Artificial Intelligence in Medicine 37(2):85-109 2006, Science Direct access

A.Bertoni, G. Valentini, Model order selection for clustered bio-molecular data, In: Probabilistic Modeling and Machine Learning in Structural and Systems Biology, J. Rousu, S. Kaski and E. Ukkonen (Eds.), Tuusula, Finland, 17-18 June, pp. 85-90, Helsinki University Printing House, 2006

A. Bertoni, G. Valentini, Random projections for assessing gene expression cluster stability, IJCNN '05. Proceedings IEEE International Joint Conference on Neural Networks, vol. 1 pp. 149-154, 2005.

G.Valentini, Mosclust: a software library for discovering significant structures in bio-molecular data. Bioinformatics 23(3):387-389, 2007.

Mosclust web-site

G.Valentini, Clusterv: a tool for assessing the reliability of clusters discovered in DNA microarray data, Bioinformatics 22(3):369-370, 2006.

Clusterv web-site

G. Valentini, F.Ruffino, Characterization of Lung tumor subtypes through gene expression cluster validity assessment, RAIRO - Theoretical Informatics and Applications, 40:163-176, 2006.

R. Avogadri, M. Brioschi, F. Ferrazzi, M. Re, A. Beghini, and G. Valentini, A stability-based algorithm to validate hierarchical clusters of genes, International Journal of Knowledge Engineering and Soft Data Paradigms, 1(4), pp. 318-330, 2009.

R. Avogadri, M. Brioschi, F. Ruffino, F. Ferrazzi, A. Beghini and G. Valentini An algorithm to assess the reliability of hierarchical clusters in gene expression data, in: 12th International Conference, KES 2008, Zagreb, Croatia, September 3-5, 2008, Proceedings, Part III. Lecture Notes in Computer Science, vol.5179 pp. 764-770, Springer 2008.