BioCreative/OHNLP Challenge 2018

The application of Natural Language Processing (NLP) methods and resources to clinical and biomedical text has received growing attention over the past years, but progress has been limited by difficulties to access shared tools and resources, partially caused by patient privacy and data confidentiality constraints. Efforts to increase sharing and interoperability of the few existing resources are needed to facilitate the progress observed in the general NLP domain. Leveraging our research in corpus analysis and de-identification research, we have created multiple synthetic data sets for a couple of NLP tasks based on real clinical sentences. We are organizing a challenge workshop to promote community efforts towards the advancement in clinical NLP.

The challenge workshop will have two tasks:

1. Family History Information Extraction

2. Clinical Semantic Textual Similarity

Task 1 – Family History Extraction

The fact that many care process models uses FH information highlights the importance of FH in the decision-making process of diagnosis and treatment. However, acquiring accurate and complete FH information remains challenging for clinical NLP community. The main source of FH data is from Patient Provide Information (PPI) questionnaires, which is usually stored in semi-structured/unstructured format in electronic health records. In order to provide a comprehensive patient-provided FH data to physicians, there is a need for NLP systems that are able to extract FH from text. Elements of FH data are not pre-determined or limited. They depend on pieces of information that provided by patients about their relatives’ health situation during visits. The FH elements may include: disease, family member, cause, medication, age of onset, length of disease, etc. This variety of FH elements makes the extraction process from unstructured data challenging. In the past, though there are several systems are proposed and implemented for this purpose, the number of such systems is quite limited. To address this issue, we plan to organize a shared task and encourage researchers in relevant areas to propose and develop FH extraction (FHE) systems.


We divided the challenge into two subtasks:

1) Entity identification (family members and disease names)

2) Family history extraction: the participant systems are expected to extract family members and corresponding observations.

To participate in Task 1:!forum/ohnlp2018

Task 2 – Clinical Semantic Textual Similarity

The wide adoption of electronic health records (EHRs) has provided a way to electronically document patient’s medical conditions, thoughts, and actions among the care team. While the use of EHRs has led to an improvement in quality of healthcare, it has introduced new challenges. One such challenge is the growing use of copy-and-paste, templates, and smart phrases due to ease of use causing bloated clinical notes poorly organized or erroneous documentation among many other problems. EHRs are no longer optimized for tracking multiple complex medical problems or maintaining continuity and quality of clinical decision-making process. There is a growing need for automated methods to better synthesize patient data from EHRs and reduce the cognitive burden in clinical decision-making process for providers. Patient data can be scattered in several heterogeneous sources. Tools that can aggregate data from diverse sources and minimize data redundancy, and organize and present the data in a user friendly way to reduce the cognitive burden are desired.

One necessary task for extracting and consolidating information is to compute semantic similarity between text snippets. In the general English domain, the SemEval Semantic Textual Similarity (STS) share tasks have been organized since 2012 to develop automated methods for the task. Clinical text contains highly domain-specific terminologies and thus domain-specific NLP tools and resources are needed for analysis, interpretation and management of clinical text. In the clinical domain, there is no existing resource for the study of STS.

The construction of a dataset by gathering naturally occurring pairs of sentences with different degree of semantic equivalence itself is a very challenging task. The objective of this shared task is to build systems for clinical STS.

To participate in Task 2!forum/ohnlp2018sts

Timeline of the challenges:

· Registration: June 1st, 2018

· Sample data release: June 1st, 2018

· Training data release: June 15th, 2018

· Test data release: August 1st, 2018

· Paper submission: August 15th, 2018

Location and Time

ACM-BCB 2018, which will be held on August 29th, 2018 at JW Marriott Washington DC.

Workshop Proceedings

Task 1: Family History Extraction

  1. Overview of the BioCreative/OHNLP 2018 Family History Extraction Task. Sijia Liu, Majid Rastegar Mojarad, Yanshan Wang, Liwei Wang, Feichen Shen, Sunyang Fu, Hongfang Liu. Proceedings of BioCreative/OHNLP Challenge 2018. (pdf)
  2. Efficient rule-based approaches for tagging named entities and relations in clinical text. Dongmin Kim, Soo-Yong Shin, Hee-Woong Lim and Sun Kim. Proceedings of BioCreative/OHNLP Challenge 2018. (pdf)
  3. Hybrid Approach for End-to-End Entity Recognition and Entity Linking using CRFs and Dependency Parsing. Anshik ., Vinit Gela, Sagar Madgi. Proceedings of BioCreative/OHNLP Challenge 2018. (pdf)
  4. A combined rule-based and statistical approach to family history extraction from unstructured text. Emily Tseng, Jacob Lee. Proceedings of BioCreative/OHNLP Challenge 2018. (pdf)
  5. Family History Information Extraction Via Joint Deep Learning. Xue Shi, Dehuan Jiang, Yuanhang Huang, Xiaolong Wang, Qingcai Chen, Jun Yan, Buzhou Tang. Proceedings of BioCreative/OHNLP Challenge 2018. (pdf)
  6. Family History Information Extraction with Neural Sequence Labeling Model. Feng-Duo Wang, Chen-Kai Wang, Hong-Jie Dai. Proceedings of BioCreative/OHNLP Challenge 2018. (pdf)

Task 2: Clinical Textual Similarity

  1. Overview of BioCreative/OHNLP Challenge 2018 Task 2: Clinical Semantic Textual Similarity. Yanshan Wang, Naveed Afzal, Sijia Liu, Majid Rastegar-Mojarad, Liwei Wang, Feichen Shen, Sunyang Fu, Hongfang Liu. Proceedings of the BioCreative/OHNLP Challenge. 2018. (pdf)
  2. Combining rich features and deep learning for finding similar sentences in electronic medical records. Qingyu Chen, Jingcheng Du, Sun Kim, W. John Wilbur and Zhiyong Lu. Proceedings of the BioCreative/OHNLP Challenge. 2018. (pdf)
  3. A Hybrid System for Clinical Semantic Textual Similarity. Ying Xiong, Shuai Chen, Yedan Shen, Xiaolong Wang, Qingcai Chen, Jun Yan, Buzhou Tang. Proceedings of the BioCreative/OHNLP Challenge. 2018. (pdf)
  4. Correcting the Common Discourse Bias in Linear Representation of Sentences using Conceptors. Tianlin Liu , Joao Sedoc, Lyle Ungar. Proceedings of the BioCreative/OHNLP Challenge. 2018. (pdf)


Untitled spreadsheet

Organizers and Contact information

Majid Rastegar-Mojarad (mojarad.majid at mayo dot edu)

Sijia Liu (Liu.Sijia at mayo dot edu)

Yanshan Wang (Wang.Yanshan at mayo dot edu)

Naveed Afzal(Afzal.Naveed at mayo dot edu)

Liwei Wang (Wang.Liwei at mayo dot edu)

Feichen Shen (Shen.Feichen at mayo dot edu)

Sunyang Fu (fu.sunyang at mayo dot edu)

Hongfang Liu (Liu.Hongfang at mayo dot edu)

Feel free to contact for more information and join the group for the future updates.