To evaluate the effectiveness of GeneRAG, we aim to answer the following two research questions:
RQ1: Q&A Effectiveness: How effectively does GeneRAG answer gene-related questions?
RQ2: Downstream Task Effectiveness: How effectively does GeneRAG perform gene-oriented downstream analysis?
Downstream Tasks
Beyond the Q\&A task, we also evaluate \tool{} on other gene-oriented downstream tasks using LLMs. Specifically, we select two downstream tasks.
Cell Type Annotation. Cell type annotation is a crucial step in gene-oriented analysis. Genes that are highly expressed in cells indicate cell type information. GPTCelltype~\cite{hou2024assessing} uses LLMs to generate cell type annotations based on the highly expressed genes in each cell. Following this method, we selected 3,000 cells and instruct \tool{} to annotate their cell types.
Gene Interaction Prediction. Gene interaction prediction helps us understand how genes interact with each other. We created a dataset containing known interaction relationships between different genes, following previous work~\cite{azam2024comprehensive}. In this task, we provided the cell type and gene name information and instruct \tool{} to list other genes that interact with the query gene.
Q&A Effectiveness
Based on the results presented in Table 1, GeneRAG consistently outperforms both GPT-3.5 and GPT-4o across all question types in terms of accuracy and error rates. On average, GeneRAG demonstrates a 39% improvement compared to GPT-4o. Notably, there is a significant decrease in the false negative rate for trap questions, reduced by 65%. This improvement is likely due to GeneRAG’s reliable reference information, enabling it to focus on facts and ignore traps in the questions. Additionally, GeneRAG exhibits a 43% increase in accuracy for exact answer questions, attributed to its access to trustworthy data sources.
These results suggest that GeneRAG is more effective and reliable for answering gene-related questions, establishing it as a superior tool for downstream analysis.
Downstream Task Effectiveness
In downstream analysis tasks, GeneRAG consistently surpasses GPT-3.5 and GPT-4, as shown in Table 1. In the cell type annotation task, GeneRAG addresses the challenge of insufficient information for rare cell types by utilizing external knowledge, achieving significant improvements over GPT-3.5 and GPT-4 (66% and 41% increases, respectively). Verifying gene-gene interactions with reliable sources also enhances accuracy and reduces error rates. Notably, the false negative rate does not significantly decrease with GPT-4o compared to GPT-3.5, indicating that merely increasing inference capability and training on more data with low-quality information does not substantially improve performance. However, applying high-quality external data proves to be useful.