EHRSQL 2024

Reliable Text-to-SQL Modeling on

Electronic Health Records

NAACL 2024 - Clinical NLP Shared Task

Motivation

Electronic Health Records (EHRs) are relational databases that store the entire medical histories of patients within hospitals. They record numerous aspects of patients' medical care, from admission and diagnosis to treatment and discharge. While EHRs are vital sources of clinical data, exploring them beyond a predefined set of queries or requests requires skills in query languages like SQL. To simplify access to EHR data, one straightforward strategy is to build a question-answering system, specifically leveraging text-to-SQL models that can automatically convert natural language questions into corresponding SQL queries and use the queries to retrieve answers.

The goal of this shared task is to build a reliable text-to-SQL model for an EHR database, specifically MIMIC-IV [1] Demo. This model should be able to selectively answer questions (through accurate SQL generation) when certain and abstain from providing answers for the rest, regardless of whether the input questions are intrinsically answerable or unanswerable. The scope of the input questions includes diverse topics relevant to clinical settings (e.g., patient demographics, vital signs, and disease survival rates) [2], as well as questions that are unanswerable given the database schema (e.g., asking about today's weather) and SQL functionalities (e.g., drawing a plot). Successfully solving this task will allow healthcare experts, including physicians, nurses, and researchers, to freely explore EHRs using natural language, significantly reducing their burden for information retrieval and synthesis across multiple tables in EHRs.

[1] Johnson, Alistair, Bulgarelli, Lucas, Pollard, Tom, Horng, Steven, Celi, Leo Anthony, and Roger Mark. "MIMIC-IV" (version 2.2). PhysioNet (2023). https://physionet.org/content/mimiciv/2.2.

[2] Lee, Gyubok, Hyeonji Hwang, Seongsu Bae, Yeonsu Kwon, Woncheol Shin, Seongjun Yang, Minjoon Seo, Jong-Yeup Kim, and Edward Choi. "EHRSQL: A Practical Text-to-SQL Benchmark for Electronic Health Records." Advances in Neural Information Processing Systems 35 (2022): 15589-15601. https://github.com/glee4810/EHRSQL.

Registration, Dataset, and Evaluation

Registration

Accept the Terms and Conditions in the Codabench project (https://www.codabench.org/competitions/1889)
Comply with the dataset licenses EHRSQL (CC-BY-4.0 license) and MIMIC-IV-Demo (Open Data Commons Open Database License v1.0)

Dataset: https://github.com/glee4810/ehrsql-2024

Evaluation: https://www.codabench.org/competitions/1889

Schedule

All deadlines are 11:59PM UTC-12:00 (Anywhere on Earth), unless stated otherwise

Registration opens: Monday January 29, 2024
Training and validation data release: Monday January 29, 2024
Test data release: Tuesday March 26, 2024
Run submission due: Thursday March 28, 2024 (11:59PM UTC)
Code submission and fact sheet deadline: Friday March 29, 2024
Final result release: Monday April 1, 2024
Paper submission period starts: Monday April 8, 2024
Paper submission due: Wednesday April 10, 2024
Notification of acceptance: Thursday April 18, 2024
Final versions of papers due: Wednesday April 24, 2024
Clinical NLP Workshop @ NAACL 2024: June 22, 2024, Mexico City, Mexico

Contact

For more updates, join our Google group https://groups.google.com/g/ehrsql-2024.

Organizer

Organizers are from EdLab @ KAIST.