Corpus-ET, BLAH8 Hackathon Project

"automatic Corpus Expansion with Topic transferring by leveraging LLM"

---the 8th BLAH Biomedical Linked Annotation Hackathon


"Biomedical annotations in the age of LLMs "  

                                 --- BLAH 8 

1. Project Background

In the era of LLM, how to develop NLP pipeline for automatic corpus expansion is an interesting issue among BioNLP community. 

Two limitations in corpus construction

i) Corpus is usually constructed manually, which is time-consuming and labor-intensive, making it difficult to have a large scale.

ii) Corpus usually only focuses on specific backgrounds, making it difficult to have a wide range of cross topic applications.

Corpus construction plays a key role in downstream model building and knowledge discovery in the BioNLP community, and common corpus construction often has two obvious limitations. One is that common corpora, such as GENIA [1], Bacteria biotope (BB) [2], and Annotation of Genes with Alteration-Centric function changes (AGAC) [3], are usually manually annotated, which makes it difficult to have a large corpus size, and these annotations are usually limited to abstracts and not extended to full text. Secondly, these corpora usually focus on specific topics and have a more limited scope of use. For example, GENIA only contains annotations for 9 categories of biological events including gene expression, transcription, protein catabolism etc. , while the BB corpus focuses only on annotations of associations between microorganisms, habitats and phenotypes.

Advantages of LLMs in BioNLP

The rise of large language models has widen the chances of solving the above limitations. 

i) LLMs have the powerful generative ability to output similar samples in large quantities under the condition that we provide a small number of samples and appropriate prompts, and are not limited by language styles, which makes it possible to scale up the corpus in terms of size and style. 

ii) LLM has a strong semantic comprehension capability to fully capture the underlying semantics behind the annotations, which promises to transfer annotations focusing on specific topics. 

During BLAH8, we plan to explore the possibilities of LLM for existing corpus extensions as well as topic migration, and to build a complete pipeline to generate the extended corpus and explore the possibility of integrating this pipeline into PubAnnotation.


2. Aimed scientific and engineering issues for the project

The automatic expansion of BioNLP corpus by using LLM

Topic Transfer in BioNLP Corpus by using LLM

3. Case Study, "AGAC-to-AD"

i) Original corpus with cancer topic, AGAC

AGAC [3] is the corpus proposed by the HZAU-BioNLP team in BLAH4 and initially released and applied in BLAH5. The corpus focuses on genetic variants that bring about downstream functional changes, and its 500 abstracts articles containing manual annotations are from PubMed abstracts included Cancer as keyword.

ii) Exampled data repository for AGAC annotations: AML-Alterome

We developed NLP pipelines to annotate PubMed and PMC texts using annotation logic of AGAC.

A set of PubMed/PMC-wide literature on acute myeloid leukemia (AML) was annotated and neatly collected. 

Data repo provides annotation visualization and query. Please visit here

iii) "AGAC to AD", from cancer to genetic disease--- Topic transferring. 

We plan to use AGAC in a case study for corpus expansion and topic transferring by using LLM. 

To-do: Utilize a small number of AGAC samples as prompt input to force LLM to generate more AGAC annotations and expand the annotations to the full text. 

To-do: Use LLM to transfer the cancer topic of AGAC to the text of genetic disease, Alzheimer's disease (AD), thus resulting in the construction of AD-AGAC corpus.

4. Applications for More Corpora

AGAC to AD 

As stated in Section 3.

GENIA to Covid-19 

GENIA [3] is a semantically annotated corpus that mainly focus on 9 biological process, e.g. Gene expression, Transcription, Protein catabolism, it provides extremely important resources on BioNLP technology for bioinformatics. 

It is promising that utilizing LLM to capture the internal annotation logic of GENIA annotations, extend and transfer GENIA's annotation to COVID-19 literature.

BB to metagenome

BB corpus [2] is a valuable corpus focusing on microbial habitats as well as phenotypes. In LLM, there is also potential for the BB corpus to transfer to the macrogenomic domain, thus focusing on the capture of gut flora research and helping to extract structured knowledge.

5. Brief LLM-based Ideas

Two Examples to Expand and Transfer Corpus Using LLM

Evaluate the quality of the generated corpus

Corpus precision is paramount. We provide three assessment directions for LLM-generated corpus, ensuring its quality and credibility.




6. Discussion Issues During BLAH8

Issue 1. Prompt design and texts selection for LLM to corpus expansion and transfering

i) Discussing methods and criteria for AGAC sample selection.

ii) Evaluating representativeness and balance of the selected samples.

iii) Testing the prompt using ChatGPT online.

Issue 2. “AGAC to AD”, a case study for corpus expansion and topic transferring

i) Expanding AGAC to full-text using ChatGPT online.

ii) Transferring AGAC to AD-topic literature.

iii) Evaluating the quality of generated corpus.

iv) Discussing the evaluation method of generated corpus.

Issue 3. Applications for other corpora

i) Building ChatGPT 3.5 API-based experimental platform for batch generation.

ii) Exploring LLM for GENIA to COVID-19.

iii) Exploring LLM for BB to metagenome.

iv) Discussing the reasonableness and quality of generated corpus.

Issue 4. Pipeline integration with PubAnnotation

i) Building a complete pipeline for corpus expanding and transferring using LLM.

ii) Discussing the possibility of integration of pipeline to PubAnnotation. 

References

[1] Kim, J-D., et al. "GENIA corpus—a semantically annotated corpus for bio-textmining." Bioinformatics 19.suppl_1 (2003): i180-i182.

[2] Bossy, Robert, et al. "Bacteria biotope at BioNLP open shared tasks 2019." Proceedings of the 5th workshop on BioNLP open shared tasks. 2019.

[3] Wang, Yuxing, et al. "Guideline design of an active gene annotation corpus for the purpose of drug repurposing." 2018 11th International Congress on Image and Signal Processing, BioMedical Engineering and Informatics (CISP-BMEI). IEEE, 2018. https://hzaubionlp.com/agac/

Contact Us

College of Informatics

Huazhong Agricultural Univ

Wuhan, Hubei 430070

China

Project Applicant: Ziming Tang 

Lab name: HZAU-BioNLP. Visit our lab

Lab PI: Jingbo Xia

Affiliation: Hubei Key Lab of Agricultural Bioinformatics, Huazhong Agricultural University, Wuhan, China

Project page: https://github.com/HeartrooT/Corpus-ET

Ziming Tang is a master's student in HZAU. He is currently a key research member of the HZAU BioNLP laboratory led by Dr. Xia Jingbo. Ziming Tang 's main research directions include data mining, natural language processing, and machine learning.

Email: tangziming1998@gmail.com