Detection of pathogenic variants in the non-coding human genome

The identification of genetic variants associated with human diseases represents one of the core challenges in precision medicine, and requires the design and application of a new generation of machine learning-based prediction methods able to prioritize potential “deleterious” variants (i.e. causative or otherwise linked with disease risk) among the huge amount of neutral variants that represent natural genetic variation present in individuals.

Most of state-of-the-art ML-based methods do not adopt specific imbalance-aware learning techniques to deal with imbalanced data that naturally arise in several genome-wide variant scoring problems, thus resulting in a significant reduction of sensitivity and precision. We developed hyperSMURF (hyper-ensemble of SMOTE under-sampled random forests), a novel method that adopts imbalance-aware learning strategies based on resampling techniques and a hyper-ensemble approach to deal with highly imbalanced genomic data (Schubach et al, 2017).

This machine learning approach has been successfully applied as part of Genomiser, a software tool that uses both genotypic and phenotypic information, to discover variants in both coding and non coding regulatory regions associated with specific genetic Mendelian diseases (Smedley et al. 2016).

Fine tuning of learning parameters of hyperSMURF may lead to significantly better results (Petrini et al. 2017), and we are developing a High Performance Computing parallel version of this hyper-ensemble method in the context of the LISA project HyperGeV - Detection of Deleterious Genetic Variation through Hyper-ensemble Methods.


M. Schubach, M. Re, P.N. Robinson and G. Valentini Imbalance-Aware Machine Learning for Predicting Rare and Common Disease-Associated Non-Coding Variants,

Scientific Reports, Nature Publishing, 7:2959, 2017.

M. Schubach, M. Re, P.N. Robinson, G. Valentini Variant relevance prediction in extremely imbalanced training sets,

F1000Research 2017, 6(ISCB Comm J):1392 (poster) (doi: 10.7490/f1000research.1114637.1), presented at the 25th International Conference on Intelligent Systems for Molecular Biology (ISMB), Prague 2017

A. Petrini, M. Schubach, M. Re, M. Frasca, M. Mesiti, G. Grossi, T. Castrignano', P.N. Robinson, G. Valentini Parameters tuning boosts hyperSMURF predictions of rare deleterious non-coding genetic variants,

PeerJ Preprints 5:e3185v1, 2017 presented at Methods, tools & platforms for Personalized Medicine in the Big Data Era - NETTAB 2017, Palermo, Italy

D. Smedley, M, Schubach, J. Jacobsen, S. Kohler, T. Zemojtel, M. Spielmann, M. Jager, H. Hochheiser, N. Washington, J. McMurry, M. Haendel, C. Mungall, S. Lewis, T. Groza, G. Valentini and P.N. Robinson A Whole-Genome Analysis Framework for Effective Identification of Pathogenic Regulatory Variants in Mendelian Disease,

The American Journal of Human Genetics, 99:3, pp.595--606, September 2016