"Ensuring Robustness in LLM-based Research: Reproducibility, Interoperability, and Reliable Evaluation" --- BLAH9
Rice and Wheat are important food crops worldwide. The data resources for rice and wheat are rich and diverse, covering genome, transcriptome, proteome, and metabonomics, which are known as multi-omics data. Known data resources related to the characterization of rice genes include Rice Gene Index, Oryzabase, Rice-Alterome, and Rice Trait Ontology. Those for wheat include WheatOmics, Triticeae-Gene Tribe, and Wheat Trait Ontology. The Ensembl Plants database encompasses the genetic information of various crops, including both rice and wheat (Tab. 1).
A dilemma appears when researchers are searching for comprehensive information about a biological object. That is because multi-omics data usually scatters in different databases and does not have a unified format. For example, suppose a researcher is interested in a drought-resistant phenotype as well as its associated genes, DNA sequences, Protein sequences, transcripts, etc. In that case, he/she has to bounce around different databases to obtain data on different omics. Moreover, there is always a need for the alignment of data formats from diverse resources.
To avoid the complexity of shifting from one database to another, Q&A platforms based on Large Language Modeling (LLM) can effectively integrate the existing resources and provide fast and easy-to-understand responses to users' questions using the Retrieval Augmented Generation (RAG) strategy. It provides LLM with the ability to retrieve information from data sources and use it as a basis for generating responses.
Supported by BLAH9, we are calling on the construction of "An LLM-based Retrieval Augmented Generation Pipeline for Multi-omics Rice and Wheat Data". Project steps and issue points are as follows:
How to semantically parse multi-omics data by accurate and reliable indexing, thus facilitating personalized Q&A generation for resource linking.
How to smoothly handle database disagreements during resource linking when using the RAG strategy.
How to employ RAG and LLM to build a reliable pipeline offering user-friendly and knowledge-supportive query services.
BLAH9 (the 9th Biomedical Linked Annotation Hackathon): BLAH is an annual hackathon event to promote the development of the BioNLP community, which contains biomedical literature annotation and mining resources sharing and linking. This year, the BLAH9 is organized with a special theme which is "Ensuring Robustness in LLM-based Research: Reproducibility, Interoperability, and Reliable Evaluation". The registration, timeline, and more information about BLAH9 can be found here.
Retrieval Augmented Generation is to obtain relevant knowledge through retrieval and integrate it into Prompt so that the language model can refer to the corresponding knowledge to generate a reasonable answer. Therefore, the core of RAG can be understood as “Retrieval + Generation”, where the former mainly utilizes the efficient storage and retrieval capabilities of vector databases to recall the target knowledge, and the latter utilizes the big model and prompt engineering to rationally utilize the recalled knowledge to generate the target answer (Fig. 1).
Figure 1. Mechanism of Retrieval Augmented Generation.
Multi-omics data refers to a combination of multiple types of biological data, including genomics, transcriptomics, proteomics, metabolomics, and so on, which reflect a comprehensive view of the biological system. The complexity of data and the disperstiveness of the resources bring a huge challenge to data indexing. Fortunately, a RAG+LLM-based approach facilitates integrating these distinct omics datasets, enabling a deeper understanding of the complex structures within multi-omics data. This, in turn, enhances the system's ability to comprehend user queries and generate reliable responses accurately.
In order to construct data resources with more comprehensive coverage of omics data for rice and wheat, we need to integrate information from multiple data sources. Based on the research, we list the data resources we intend to include and their corresponding data types below:
For Rice:
Rice Gene Index: Gene ID, Ortholog Gene Indices, Assembly Information, Gene Loci, Gene Symbols, Function, GO.
Oryzabase: Trait Id, CGSNL Gene Symbol, Gene symbol synonym(s), CGSNL Gene Name, Gene name synonym(s), Gene name synonym(s), Allele, Chromosome No., Explanation Ja, Explanation En, Trait Class, RAP ID, MSU ID, Gramene ID, Arm, Locate(cM), Gene Ontology, Trait Ontology, Plant Ontology.
Rice-Alterome: Gene Symbol, Alteration, Downstream Term, Term Ancestor, Related Publications, Information of the Publications.
Rice Trait Ontology: An ontology for rice traits and phenotypes in scientific publications.
Ensembl Plants: Assembly Tables, External References, Features, Fundamental Tables, ID Mapping, Miscellaneous.
For Wheat:
WheatOmics: Gene ID, Gene Name, Position Trait, Species, DOI.
Triticeae-Gene Tribe: Gene ID, Genome, Location, Description, Protein Property, Gene Ontology, Homology, Microcollinearity, Regulation, Expression, Transcript, Exon/CDS Regions, Protein Sequence.
Wheat Trait Ontology: An ontology for wheat traits and phenotypes in scientific publications.
Ensembl Plants: Assembly Tables, External References, Features, Fundamental Tables, ID Mapping, Miscellaneous.
Building an indexing service to parse data from different resources.
Gene ID Addition: For the data with only Gene Symbols, add a Gene ID column regarding Gene ID in Ensembl Plants.
Gene ID Normalization: To eliminate mention disagreements in different data resources, Gene IDs will be normalized into a unified version as that in Ensembl Plants (Disagreement).
Data Fields Selection: Before integrating the data, identify the key fields to be extracted from each database. Ensure that each field (e.g., Gene Symbol, Gene Function, Disease Association) is distinct and unambiguous.
Matching and integrating data: Use SQLiteStudio to match and merge the data from different sources and handle data disagreements among different resources.
Search engine setup:
Use Bio-BERT¹ to convert the textual information in the database (e.g., gene descriptions, disease associations) into numerical embedding vectors.
Chroma, a specialized vector database, will be used to store and organize the generated embedding vectors. The embeddings will be indexed in a way that allows for fast and efficient similarity searches.
LLM integration and prompt design for Q&A generation:
The LLM to be used is Llama2². This model will be tasked with understanding user queries and generating coherent responses based on the information retrieved from the vector database.
Design a prompt to guide Llama2 in answering questions in the most relevant and accurate way.
Create templated responses for frequently asked questions to ensure consistency and speed in answering common inquiries.
To provide Q&A generation and linking services through a web platform, Django will be used to handle both frontend and backend development.
Frontend:
The platform will feature a simple user interface allowing users to input queries about the genetic characteristics of rice and wheat.
Backend:
User queries will be passed through the system, with Bio-BERT embeddings retrieving relevant data, and Llama2 generating responses. The results will be delivered in real time through the platform.
Achieving reliable and unified indexing for multi-omics data from different databases.
Using RAG to provide fast and accurate access to integrated multi-omics data.
Providing a Q&A service for resource linking.
Enhancing the conceptual understanding of bio-related professional queries.
The task flow chart is in Figure 2.
Figure 2. Process of RAG+LLM to build the intelligent resource linking of multi-omics information for rice and wheat
Day 1. Collecting Data from Different Resources And Integration of the Data
Downloading multi-omics data of rice and wheat from different resources.
Normalization of the data.
Organizing the data in an appropriate structure(e.g. indexing, duplicates removal).
-----------------------------------------------------------------
Day 2. Assessment of the Integrated Multi-omics Data
Evaluating coverage level of the multi-omics data.
Evaluating strategies for data standardization.
Architectural structure of the integrated multi-omics data designing.
-----------------------------------------------------------------
Day 3. Link the Integrated Data Resources to LLM Using RAG Strategy
Integrated multi-omics data vectorization and storage in Chroma vector database.
Setting up of the search engine with LLM based on RAG architecture using Llama2 model.
Implemention of the LLM+RAG Intelligent resource linking.
-----------------------------------------------------------------
Day 4. Evaluation and Web Page Publication
Checking the accuracy of linkings provided.
Convenient query functions in a user-friendly web interface provision.
Web page publication.
-----------------------------------------------------------------
*Specific scheduling may be flexible according to hackathon discussions.
Lee, Jinhyuk et al. “BioBERT: a pre-trained biomedical language representation model for biomedical text mining.” Bioinformatics (Oxford, England) vol. 36,4 (2020): 1234-1240.
Touvron, Hugo et al. “Llama 2: Open Foundation and Fine-Tuned Chat Models.” ArXiv abs/2307.09288 (2023): n. pag.
College of Informatics
Huazhong Agricultural University
Wuhan, Hubei 430070, China
Yawen Liu, liuyawen854@foxmail.com
Jingbo Xia, xiajingbo.math@gmail.com