PhD position: candidate genes prioritization using knowledge graphs and AI 

Context: To meet the challenges of the global demand for food in a context of climate change, a better understanding of agronomically important traits, such as yield, quality, and resistance to abiotic and biotic stresses is crucial to improve crops production capacities. Deciphering molecular mechanisms that drive a particular trait is one of the most critical research areas in biology. However, these genotype-phenotype interactions are difficult to identify because they occur at different molecular levels in the plant and are strongly influenced by environmental factors (i.e., climate change). For biologists, it is difficult to search for relevant information as it is often dispersed in several databases on the Internet each with different data models, scales or distinct means of access. Today's major challenges are related to the development of methods to integrate these heterogeneous data and to enrich biological knowledge. The scientists also need methods to dig into this mass of data and to highlight relevant information that identifies key genes. To this end, we developed the AgroLD [1] platform which is a knowledge graph that uses Semantic Web technologies to integrate heterogeneous agronomic data from the genome to the phenome (i.e., from the set of genes to the set of phenotypes observed in a plant organism). AgroLD is actively developed. As of today, AgroLD contains more than 900 million triples resulting from the integration of around 100 datasets gathered in 33 named graphs.

The thesis is proposed under the frame of the DIG-AI ANR project which aims to develop machine learning methods combined with knowledge graphs such as AgroLD to study the molecular interactions driving the phenotype development in crops.  


Objective 1: The current challenges are related to the development of methods for functional analysis of genes and in particular to methods for prioritization of candidate genes. Indeed, the data integrated from databases are incomplete, heterogeneous, insufficient to infer genes function with good accuracy. One of the first objectives of the thesis will be the development of knowledge extraction methods to extract functional information on genes in scientific documents. 


Objective 2: The recent success of graph neural networks (GNNs) suggests the possibility of systematically incorporating multiple sources of information into a heterogeneous network and learning the nonlinear relationship between phenotypes and genes [2]. However, knowledge graphs like AgroLD can be complex and contain interference information. Therefore, as proposed by [3, 4], some GNN models could reduce the influence of noisy data on the overall prediction effect by assigning low weights to unreliable nodes/edges. The second objective will be to develop an adapted approach to the AgroLD context by building meaningful representations from the high dimensional and complex gene data.


Objective 3: Finally,  based on previous candidate gene studies in the biomedical field [5, 6] and because inferring gene regulatory networks (GRN) can be formulated as a link prediction problem in Graph Neural Networks (GNN) [7], the third objective will be to apply GNN models to implement candidate gene prioritization and GRN methods to answer biological questions related to adaptation of crops to drought stress and plant diseases.


Keywords: Deep Learning, Graph Neural Network, Bioinformatics, Gene prioritization, Gene Regulation Networks, knowledge graphs, neuro-symbolic AI 


References


1. Venkatesan A, Tagny Ngompe G, Hassouni NE, Chentli I, Guignon V, Jonquet C, et al. Agronomic Linked Data (AgroLD): A knowledge-based system to enable integrative biology in agronomy. PLOS ONE. 2018;13:1–17. 

2. Zhang X-M, Liang L, Liu L, Tang M-J. Graph Neural Networks and Their Current Applications in Bioinformatics. Front Genet. 2021;12.

3. Neil D, Briody J, Lacoste A, Sim A, Creed P, Saffari A. Interpretable Graph Convolutional Neural Networks for Inference on Noisy Knowledge Graphs. ArXiv181200279 Cs Stat. 2018.

4. Li X, Saude J. Explain Graph Neural Networks to Understand Weighted Graph Features in Node Classification. ArXiv200200514 Cs. 2020.

5. Alshahrani M, Hoehndorf R. Semantic Disease Gene Embeddings (SmuDGE): phenotype-based disease gene prioritization without phenotypes. Bioinform. 2018;34:i901–7.

6. Chen J, Althagafi A, Hoehndorf R. Predicting candidate genes from phenotypes, functions and anatomical site of expression. Bioinformatics. 2021;37:853–60.

7. Gligorijević V, Barot M, Bonneau R. deepNF: deep network fusion for protein function prediction. Bioinformatics. 2018;34:3873–81.


Location: University of Montpellier. LIRMM computer science lab and IRD research institute, Montpellier, France.

 

Contact: pierre (dot) larmande (at) ird (dot) fr and jerome (dot) aze (at) lirmm (dot) fr 



Thesis advisors: 

LIRMM Univ. Montpellier:  Jérôme Azé and François Scharffe

DIADE IRD: Mikael Lucas and Pierre Larmande


Expected profile

The candidate must have the equivalent of a BAC+5 degree from a University or Engineering School, with specialization in data science-related, graph theory or machine learning fields. A good understanding of molecular biology and bioinformatics is a plus. We are expecting applicants to have a solid background in programming (Python). The candidate must have a good understanding of English.


Duration: 3 years

Funding: secured by the DIG-AI ANR research project (2023- 2027)

Salary: 2135 gross euros / month 

Starting date: between Sept. 1st 2023 and Dec. 1st 2023



Who are we

LIRMM  – Laboratory of Informatics, Robotics, and Microelectronics of Montpellier : LIRMM (https://www.lirmm.fr ) is a 350-person cross-faculty joint research entity of UM & CNRS which research activities cover a broad range of topics, including AI, knowledge engineering, bioinformatics, integrated, mobile and communicating systems, algorithms, human-machine interaction, robotics, databases, distributed systems and more. LIRMM’s computer science department counts 85 permanent researchers, and more than 70 PhD candidates. 


DIADE-IRD is a 100-person joint unit IRD/Univ. Montpellier/CIRAD (http://www.diade.ird.fr/en/ ) which aims to understand the diversification of tropical plants for which conservation, management and exploitation are an important issue for Sustainable Development. The unit develops new approaches and new tools allowing the management of big data and the integration of multi-scale data to favor an optimum use of these data. DIADE-IRD is leading in the development of the AgroLD platform with the ambition to provide tools and methods necessary to exploit the data and knowledge produced on cultivated tropical plants.