BioNLP OST 2019 (AGAC Track)

International Workshop on BioNLP Open Shared Tasks (BioNLP-OST) 2019 is accepted to be collocated with EMNLP-IJCNLP 2019 either on 3rd or 4th of November, in Hong Kong.

BioNLP Open Shared Task 2019

AGAC Track

Overiew

AGAC corpus contains eight trigger labels and two themetic roles.

Trigger words:

      • Trigger labels:
        1. Variation (Var) : including the mutation in DNA, RNA, protein and structural changes in molecule, e.g., mutations on the Arg248 and Arg282, mutant R282W, missense mutations.
        2. Molecular Physiological Activity (MPA) : the activity at the molecule level including molecular activity, gene expression and molecular physiological activity, e.g., phosphorylation, transcription, histone methylation, bioactivation of cyclophosphamide.
        3. Interaction : the association between molecule and molecule or molecule and cell, e.g., bind, interaction.
        4. Pathway : including the vairous pathway, e.g., Bmp pathway, PI3K pathway.
        5. Cell Physiological Activity (CPA) : the activities that are at or above cell level, including cell responsiveness and the development and growth of cells or organs, e.g., T helper cell responses, renal development.
        6. Regulation (Reg) : neutral clue word or phrase which means no loss or gain, e.g., resulted in, regulated.
        7. Positive Regulation (PosReg) : clue word or phrase that means gain of function, e.g., facilitates, enhanced, increased.
        8. Negative Regulation (NegReg) : clue word or phrase that means loss of function, e.g., suppressed, decreased, inhibited.
      • Other entities:
        1. Disease
        2. Gene
        3. Protein
        4. Enzyme

Themetic roles:

      • ThemeOf: pointing from the theme to the center.
      • CauseOf: pointing from the cause to center.

NOTE: Important annotation guideline: only the sentence that simultaneously referred to specific mutation and the biology function or disease will be annotated in AGAC. See the thorough annotation guideline in xxxx.

Tasks

AGAC track consists of three task: trigger words NER, themetic roles extraction and mutation-disease knowledge discovery. The participants may chose any one task described below, but Task 2 requires Task 1, and Task 3 can be performed indepandently or based on Task 1 and Task 2.

Task 1: Trigger words NER

      • Recognize the trigger words in PubMed abstracts and annotated them as correct trigger labels or entities (Var, MPA, Interaction, Pathway, CPA, Reg, PosReg, NegReg, Disease, Gene, Protein, Enzyme).

Task 2: Themetic roles identification

      • Identify the themetic roles (ThemeOf, CauseOf) between trigger words.

Task 3: "Gene;Function change;disease" link discovery

      • Extract the gene-function change-disease link. There are 4 different kinds of function change that link gene and disease: Loss of Function(LOF), Gain of Function(GOF), Regulation(REG), Complex(COM). LOF and GOF means loss or gain of function, while REG means the neutral or unknown link, and COM means the function changes between the gene and disease are in more complex way that can hardly to determine whether they are LOF or GOF.
      • For example, let us pick a sentence, "Mutations in SHP-2 phosphatase that cause hyperactivation of its catalytic activity have been identified in human leukemias, particularly juvenile myelomonocytic leukemia." From a biological view, hyperactivation of catalytic activity is clearly a description of Gain-Of-Function. Henceforth, this sentence carries clear semantic information that, a gene "SHP-2" after mutation plays a "GOF" function related to the disease "juvenile myelomonocytic leukem". Therefore, the Task 3 requires the participant extract the triple from this sentence, i.e., SHP-2;GOF;juvenile myelomonocytic leukemia.
      • In another sentence, "Lynch syndrome (LS) caused by mutations in DNA mismatch repair genes MLH1.", it describes the association between disease "Lynch syndrome" and gene "MLH1", but the phrase "caused by" means no loss or gain, hence the triple from this sentence should be MLH1;REG;Lynch syndrome.
      • In a COM example, "Here, we describe a fourth case of a human with a de novo KCNJ6 (GIRK2) mutation, who presented with clinical findings of severe hyperkinetic movement disorder and developmental delay. Heterologous expression of the mutant GIRK2 channel alone produced an aberrant basal inward current that lacked G protein activation, lost K+ selectivity and gained Ca2+ permeability." , the description "lost K+ selectivity and gained Ca2+ permeability" shows both LOF and GOF, therefore the function change can not be labels as LOF or GOF but COM, GIRK2;COM;hyperkinetic movement disorder.

Sample data for task 1, 2, and 3:

{ "target": "http://pubannotation.org/docs/sourcedb/PubMed/sourceid/25805808", "sourcedb": "PubMed", "sourceid": "25805808", "text": "Loss-of-function de novo mutations play an important role in severe human neural tube defects.\nBACKGROUND: Neural tube defects (NTDs) are very common and severe birth defects that are caused by failure of neural tube closure and that have a complex aetiology. Anencephaly and spina bifida are severe NTDs that affect reproductive fitness and suggest a role for de novo mutations (DNMs) in their aetiology.\nMETHODS: We used whole-exome sequencing in 43 sporadic cases affected with myelomeningocele or anencephaly and their unaffected parents to identify DNMs in their exomes.\nRESULTS: We identified 42 coding DNMs in 25 cases, of which 6 were loss of function (LoF) showing a higher rate of LoF DNM in our cohort compared with control cohorts. Notably, we identified two protein-truncating DNMs in two independent cases in SHROOM3, previously associated with NTDs only in animal models. We have demonstrated a significant enrichment of LoF DNMs in this gene in NTDs compared with the gene specific DNM rate and to the DNM rate estimated from control cohorts. We also identified one nonsense DNM in PAX3 and two potentially causative missense DNMs in GRHL3 and PTPRS.\nCONCLUSIONS: Our study demonstrates an important role of LoF DNMs in the development of NTDs and strongly implicates SHROOM3 in its aetiology.", "project": "AGAC2_PubMed_2","denotations": [ { "id": "T8", "span": { "begin": 771, "end": 778 }, "obj": "Protein" }, { "id": "T7", "span": { "begin": 779, "end": 789 }, "obj": "NegReg" }, { "id": "T6", "span": { "begin": 790, "end": 794 }, "obj": "Var" }, { "id": "T9", "span": { "begin": 823, "end": 830 }, "obj": "Gene" }, { "id": "T10", "span": { "begin": 936, "end": 939 }, "obj": "NegReg" }, { "id": "T11", "span": { "begin": 940, "end": 944 }, "obj": "Var" }, { "id": "T12", "span": { "begin": 961, "end": 965 }, "obj": "Disease" }, { "id": "T3", "span": { "begin": 1224, "end": 1227 }, "obj": "NegReg" }, { "id": "T1", "span": { "begin": 1228, "end": 1232 }, "obj": "Var" }, { "id": "T2", "span": { "begin": 1255, "end": 1259 }, "obj": "Disease" }, { "id": "T5", "span": { "begin": 1284, "end": 1291 }, "obj": "Gene" } ], "relations": [ { "id": "R1", "pred": "CauseOf", "subj": "T1", "obj": "T3" }, { "id": "R10", "pred": "ThemeOf", "subj": "T12", "obj": "T10" }, { "id": "R11", "pred": "ThemeOf", "subj": "T5", "obj": "T1" }, { "id": "R2", "pred": "ThemeOf", "subj": "T2", "obj": "T3" }, { "id": "R5", "pred": "CauseOf", "subj": "T6", "obj": "T7" }, { "id": "R6", "pred": "ThemeOf", "subj": "T8", "obj": "T7" }, { "id": "R7", "pred": "ThemeOf", "subj": "T9", "obj": "T6" }, { "id": "R8", "pred": "ThemeOf", "subj": "T9", "obj": "T11" }, { "id": "R9", "pred": "CauseOf", "subj": "T11", "obj": "T10" } ]} 


The format of the data is JSON. "target" is the adress of the annotated text. "sourcedb" is where the text original from, all the text in AGAC corpus are from PubMed. "sourceid" is pmid of the text. "text" contains the raw abstract.

"denotations" for Task 1:

"denotations" contains the trigger word annotations corresponding to Task 1. Each trigger word annotation has an "id"; a "span": its position in the abstract; an "obj": the trigger label it belongs to.

"relations" for Task 2:

"relations" contains the themetic roles between the trigger words, which corresponds to Task 2. Each relation contains an "id"; a "pred": the themetic roles; "subj" and "obj": the trigger word "id" that the relation associates, and the derection of the relation is from "subj" to "obj".

Note that Task 2 requires the result of Task 1.

Triples for Task 3:

25805808;SHROOM3;LOF;Neural tube defects

Triples showed above is the result of Task 3, which is required to be extracted from the sample text.

The format of triples is:

pmid;gene;function channge;disease.

Submission format

Task 1: Please submit JSON file, with the same format of the above example. Exclude the "Relations" section.

Task 2: Please submit JSON file, with the same format of the above example.

Task 3: Please submit the triples in a plain text, one triple per line.

!!!Result submission template.