We start by collecting gene information from reliable external databases. Specifically, we use data from NCBI, which provides comprehensive and up-to-date gene information. This data includes gene names, functions, expressions, and related biological processes. We chose NCBI due to its credibility and the depth of its gene-related data.
To prepare the data for our model, we preprocess it by normalizing the text, removing duplicates, and ensuring consistency in gene terminology. This preprocessing ensures that the information fed into the LLM is clean and standardized, which is crucial for accurate embedding creation.
Once we have the processed gene data, we use a LLM to create embeddings. We chose an LLM like text-embedding-3-large (Harris et al., 2024) because of its ability to understand and generate human-like text, making it suitable for creating meaningful embeddings from the gene information.
These embeddings are vector representations of the gene data, capturing the semantic meaning of the information. We create a vector database from these embeddings, which allows us to efficiently search and compare the data later. This step is crucial as it forms the foundation for matching user queries with the relevant gene information.
When a user inputs a prompt, GeneRAG processes it to determine its intent and content. The system converts the prompt into an embedding using the embedding LLM to facilitate accurate similarity detection.
For detecting similarity, we use cosine similarity. Cosine similarity measures the cosine of the angle between two vectors, which in our case are the embeddings of the prompt and the gene data. It is a widely used method in natural language processing due to its effectiveness in capturing semantic similarity between texts. By using cosine similarity, we can accurately match user queries with the most relevant gene information in our database.
In the final step, GeneRAG uses retrieval-augmented generation to provide accurate and contextually relevant answers. We employ the Maximal Marginal Relevance (MMR) algorithm to enhance the retrieval process. MMR balances relevance and diversity in the retrieved results, ensuring that the information is both pertinent and non-redundant.
This approach ensures that the selected documents are relevant to the query while also providing diverse information, leading to more comprehensive and accurate answers.
In this paper, we introduced GeneRAG, a framework for enhancing LLMs in gene-related tasks using RAG. By leveraging the MMR algorithm, GeneRAG improves retrieval quality and provides more accurate responses. Evaluations show that GeneRAG outperforms GPT-3.5 and GPT-4 in gene-related questions, cell type annotation, and gene interaction prediction. These findings highlight GeneRAG's potential to bridge the gap between LLMs and external knowledge bases, advancing their application in genetics and other scientific fields.