Accurate and Efficient Gene Function Prediction using a Multi-Bacterial Network

Tens of thousands of bacterial genomes are sequenced each year, yet less than 0.01% of their genes have experimentally-validated functional annotations. Current standards are able to infer some cellular function for about 60% of sequences, but those annotations are far from complete. The lack of knowledge of the cellular processes in which genes are involved limits the usefulness of these genomes in biomedical and other types of research. Function prediction methods that integrate multiple types of molecular data, including gene expression, protein interactions, and genomic and sequence features, have worked well on a single species level, but such data is likely not available for newly sequenced genomes. Can we leverage the data available for well-studied bacteria to improve the quality of predictions for new species?

We propose to construct a multi-species network integrated with heterogeneous datasets from the STRING database and sequence similarity (SSN). We would then propagate functional labels from the genes of species with annotations (core) to the newly sequenced species with no annotations (target). However, the large size of such multi-species networks pose a challenge for the scalability of current state-of-the-art methods which typically operate on a single species or a small group of genes at a time.

Inspired by this challenge, we developed a novel iterative label propagation algorithm called FastSinkSource. By using mathematically-provable bounds on the rate of progress of FastSinkSource to develop a new convergence strategy, we decreased the running time by a factor of 100 or more without sacrificing prediction accuracy.

We then systematically compared and evaluated many approaches to construct a multi-species bacterial network and apply FastSinkSource along with other state-of-the-art methods to these networks. We found that by pre-computing scores for species with experimentally-validated annotations and then transferring those scores to other species, FastSinkSource is able to make the most accurate functional predictions for 200 bacterial species, taking under 4 minutes for this computation.

Our results point to the feasibility and promise of multi-species, genome-wide gene function prediction, especially as more experimental data and annotations become available for a diverse variety of organisms.


The code for this project is available under an open source license at https://github.com/Murali-group/multi-species-GOA-prediction. For more information, please see our paper "Accurate and Efficient Gene Function Prediction using a Multi-Bacterial Network".