"Biomedical annotations in the age of LLMs" --- BLAH 8
Rice, a crucial global food source. Understanding genetic alteration regulation events in rice is pivotal for precision breeding research. Along with the rapid increasing of rice-related literature, huge information of GARE of rice were extracted by AGAC-based BioNLP pipeline and accessed on Rice-Alterome .
Today, traditional BioNLP corpora and methods show limited performance on rice-related literature due to their specific design.Interestingly, the emerging large language models (LLM) exhibit a broad knowledge background and powerful semantic understanding, offering the possibility to enhance the traditional BioNLP methods. Based on the idea of integrating LLMs with traditional BioNLP, therefore, it is promising to automatically construct a valuable resource for rice precision breeding.
Supported by BLAH8, we are calling on discussion and collaboration of "enhancing the genetic alterations regulation events extraction on rice literature by synergizing LLM and traditional BioNLP methods". Project issues and discussion points are as below:
How to organize a golden dataset for LLM prompts from the existed annotations rice corpus.
How to design and build a pipelines to integrate traditional BioNLP methods with LLMs for the extraction of genetic alteration regulation events.
Using pipeline for events extraction from crop texts and building accessible structured databases.
BLAH8 (the 8th Biomedical Linked Annotation Hackathon) : BLAH is an annual hackathon events to promote the development of BioNLP community, which contains the biomedical literature annotation and mining resources sharing and linking. In this year, the BLAH8 is organized with a special theme which is "Biomedical annotations in the age of LLMs". The registration, timeline and more information about BLAH8 can be found here.
Alterome is a term from the corpus AGAC (Annotation of Genes with Alteration-Centric function changes). AGAC corpus aims to annotate the functional mutations and the subsequent biological processes. The "Variation" named entities and other 4 bio-concept named entities are able to help recognized the mutation semantic and the subsequent biological processes in text, while the effect direction of the mutation on biological processes are annotated by 3 regulatory named entities. Besides, 2 types of thematic relation annotate the semantic relation between labels. More information about AGAC [1-3] can be found here.
BioNER methods for rice-related concept.
Rice Gene: PubTaotor (e.g. ABA)
Genetic Altertaions: PubTator & -based model (e.g. p.K32Q)
Biological Process: OGER & spaCy (e.g. GO:0006915 apoptotic process)
Rice Trait: OGER & spaCy (e.g. TO:0000144 milled rice yield)
BioRE methods for relations between concept.
Rule-based Rice Gene Alteration Regulatory Event Identification.
Complete events: Rice Gene -- occurs in -- gene alteration -- cuases -- BP & Rice Trait
(e.g. PD1 -- occurs in -- p.K32Q -- causes -- TO:0000329 tillering ability)
literature source:
32,229 abstract from PubMed.
56,368 full-text from PMC.
Annotations:
530,993 records.
8,256 unique rice genes.
3,376 unique gene alterations.
4,855 unique BPs.
513 normalized rice traits.
63,860 unnormalized rice traits.
4,195 unique rice geneitc alteration regulatory events.
Web site: http://lit-evi.hzau.edu.cn/RiceAlterome/
Web Service:
Data query for rice genes.
Citation table browser: Gene, Genetic alteration, BP & Trait, # of citations.
Detail table browser: Literature source infomation, Sentence support.
Figure 1. Data brower in Rice-alterome web page.
Explore and develop strategies for LLM to be used to enhance the traditional BioNLP result, using Rice-Alterome as an example.
Discuss and develop an evaluation strategy for the LLM generated annotation results.
Build an integrable pipeline of LLM-enhanced BioNLP annotation, and attempt to release to platforms (e.g. PubAnnotation).
Construct and release the LLM-enhanced Rice-Alterome datasets.
Figure2. Attempt of LLM-aided prompt and annotation
Day 1. Golden datasets construction
Construction of the golden datasets of the rice gene regulatory events from current annotated rice corpus.
Discuss the quality of selected rice gene regulatory events.
Discuss the representativeness and balance of the constructed datasets.
-----------------------------------------------------------------
Day 2. Prompt design & Example data generation
Design prompt for describing the task and providing sample data to guide LLM generation.
Novel rice gene regulatory event generation using example data based on online version of ChatGPT.
Discuss the reasonableness of the prompt design.
-----------------------------------------------------------------
Day 3. LLM API construction & Batch data generation
Build the ChatGPT 3.5 API based experimental platform.
Batch generation of new rice gene regulatory event annotation using API.
Discuss and test the possibility of API for batch annotation.
-----------------------------------------------------------------
Day 4. Generated data evaluation & Prompt optimization
Design the evaluation metrics for generated rice gene regulatory events.
Optimize the prompt based on the evaluation.
Discuss the justification of the evaluation metrics.
-----------------------------------------------------------------
Day 5. Generate data curation and release
Curate the generated rice gene regulatory events and present them on the online data browser.
Discuss the format of the curated corpus and the function of the online data browser.
-----------------------------------------------------------------
*Specific scheduling may be flexible according to hackathon discussions.
Yuxing Wang, et. al. Guideline Design of an Active Gene Annotation Corpus for the Purpose of Drug Repurposing. 2018 11th CISP-BMEI 2018, Oct, 2018, Beijing.
Yuxing Wang, et al. An Active Gene Annotation Corpus and Its Application on Anti-epilepsy Drug Discovery. BIBM 2019: International Conference on Bioinformatics & Biomedicine, San Diego, U.S, Nov, 2019.
Yuxing Wang, Kaiyin Zhou, Mina Gachloo, Jingbo Xia*. An Overview of the Active Gene Annotation Corpus and the BioNLP OST 2019 AGAC Track Tasks. BioNLP Open Shared Task 2019, workshop in EMNLP-IJCNLP 2019, Hong Kong.
College of Informatics
Huazhong Agricultural Univ
Wuhan, Hubei 430070, China
Jingbo Xia, xiajingbo.math@gmail.com
Xinzhi Yao, yaoxinzhiwork@gmail.com
Ziming Tang, tangziming1998@gmail.com