Hierarchical ensembles of learning machines for gene function prediction

Gene function prediction (GFP) is a complex multi-class and multi-label classification problem characterized by a hierarchical structure of the classes: a DAG for the Gene Ontology (GO) and a forest of trees for FunCat (Functional Categories).

GFP requires the developments of methods and algorithms to analyze and process the graphs of GO and FunCat, and to perform pre-processing of complex and multi-view bio-molecular data, in order to support the development of hierarchical classification methods based on the taxonomy of FunCat and on the Gene Ontology (Valentini and Cesa-Bianchi, 2008 -- see Valentini, 2014 for a review on hierarchical ensemble methods for protein function prediction).

To explicitly consider the relationships between functional classes we designed hierarchical classification methods for GFP based on tree-structured ensembles of learning machines:

  1. Methods based on the ``true path rule'' (TPR) that governs both FunCat and the GO (Valentini 2009, 2011, Re and Valentini 2010, 2009)
  2. Cost-sensitive Bayesian methods for the ``reconciliation'' of the probabilistic output of the base learners (Cesa-Bianchi and Valentini, 2010, 2009).

By combining hierarchical ensemble methods, cost-sensitive strategies and the integration of multiple sources of biomolecular data (Re and Valentini, 2010), we can significantly improve performances in gene function prediction problems at genome and ontology-wide level (Cesa-Bianchi, Re and Valentini, 2012)

Publications

G. Valentini, Hierarchical Ensemble Methods for Protein Function Prediction, ISRN Bioinformatics, 2014 (accepted for publication)

N. Cesa-Bianchi, M. Re, G. Valentini, Synergy of multi-label hierarchical ensembles, data fusion, and cost-sensitive methods for gene functional inference,

Machine Learning, vol.88(1), pp. 209-241, 2012

N. Cesa-Bianchi, G. Valentini, Hierarchical cost-sensitive algorithms for genome-wide gene function prediction, Journal of Machine Learning Research, W&C Proceedings, vol.8: Machine Learning in Systems Biology, pp.14-29, 2010.

N. Cesa-Bianchi, G. Valentini, Hierarchical cost-sensitive algorithms for genome-wide gene function prediction, Machine Learning in Systems Biology, Proceedings of the Third international workshop, Ljubljana, Slovenia, pp. 25-34, 2009.

M. Re, G. Valentini, Simple ensemble methods are competitive with state-of-the-art data integration methods for gene function prediction Journal of Machine Learning Research, W&C Proceedings, vol.8: Machine Learning in Systems Biology, pp. 98-111, 2010.

M. Re, G. Valentini, Integration of heterogeneous data sources for gene function prediction using Decision Templates and ensembles of learning machines,

Neurocomputing, 73:7-9 pp. 1533-37, 2010

M. Re, G. Valentini, An experimental comparison of Hierarchical Bayes and True Path Rule ensembles for protein function prediction, In: (N. El Gayar, J. Kittler and F. Roli, Eds) Nineth International Workshop on Multiple Classifier Systems MCS 2010, Lecture Notes in Computer Science, vol. 5997, pp. 294-303, Springer, 2010.

G.Valentini, N. Cesa-Bianchi, HCGene: a software tool to support the hierarchical classification of genes,

Bioinformatics, 24(5), pp. 729-731, 2008.

HCGene web-site

G. Valentini, True Path Rule hierarchical ensembles for genome-wide gene function prediction,IEEE ACM Transactions on Computational Biology and Bioinformatics, vol.8 n.3 pp. 832-847, 2011. IEEE CS Digital library

G. Valentini, True Path Rule Hierarchical Ensembles,

In: (J. Kittler, J. Benediktsson, F. Roli, Eds.) Eighth International Workshop on Multiple Classifier Systems MCS 2009, Lecture Notes in Computer Science, vol.5519 pp.232-241, Springer 2009.

G. Valentini, M. Re, Weighted True Path Rule: a multilabel hierarchical algorithm for gene function prediction, MLD-ECML 2009, 1st International Workshop on learning from Multi-Label Data, Bled, Slovenia, pp. 133-146, 2009.