The data sources are selected to replicate a realistic setting in a hospital: practitioners can provide few examples of the CRF they aim to automate; many examples of the clinical notes they are handling; and researchers can find from different sources CRF examples that have some degree of similarity with the use case at hand.ย
All datasets are released at this NLP-FBK HuggingFace collection*.
*The NLP-FBK organisation hosts multiple resources, which are not of interest to the CRF filling task. The only relevant collection that needs to be accessed is 'CRF:filling SharedTask @ CL4Health2026'.
The objective of the participants is to populate the Dyspnea CRF for the patients in the development and test datasets. The Dyspnea CRF is a list of 134 items, released here: NLP-FBK/dyspnea-valid-options. The Dyspnea CRF is in English.
To train systems to fill the Dyspnea CRF, participants are provided with three training datasets:
๐ 10 gold-standard pairs of clinical note + filled Dyspnea CRFย
link: NLP-FBK/dyspnea-crf-train
description: Examples to showcase the task. Can be used for few-shot settings, base for data-augmentation, etc.
๐ 71 (80 for Italian) semi-automatically annotated pairs of clinical notes + filled CRFย
link: NLP-FBK/synthetic-crf-train
description: Intended as examples of how the CRF filling task can be performed in domains other than Dyspnea.
๐ฅ 2667 unannotated clinical notes about patients with Dyspnea
link: ย NLP-FBK/dyspnea-clinical-notes
description: Data without CRF annotation. Intended to provide participants with knowledge about Dyspnea.
In addition, we release a dataset for development purposes:
๐ 80 gold-standard pairs of clinical notes + filled Dyspnea CRFย
link:ย NLP-FBK/dyspnea-crf-development
The ๐ gold-standard pairs are the ones that describe the task at hand, while the ๐ฅ unannotated clinical notes about Dyspnea patients and the ๐ semi-automatically annotated pairs are provided as extra data, potentially helpful in solving the task, but not strictly necessary.
The datasets for Dyspnea patients were collected from San Giovanni Bosco (SGB) Hospital, Turin, Italy, within the context of the eCream project.
[1] Pietro Ferrazzi, Alberto Lavelli, and Bernardo Magnini. 2025. Converting Annotated Clinical Cases into Structured Case Report Forms. In Proceedings of the 24th Workshop on Biomedical Language Processing, pages 307โ318, Viena, Austria. Association for Computational Linguistics.
Clinical notes from the San Giovanni Bosco hospital have been anonymised to preserve privacy. All information, such as family members, locations, names, phone numbers, etc., has been replaced by placeholders.
DATA TRANSLATION
There are three data sources:ย
๐ the gold-standard pairs of clinical note + filled Dyspnea CRF;ย
๐ฅ the clinical notes about patients with Dyspnea;ย
๐ the semi-automatically annotated pairs of clinical note + filled CRF.ย
๐ is natively bilingual, while ๐ and ๐ฅ have been collected in Italian, and then translated into English:
๐ clinical notes have been manually translated into English by professional translators
๐ฅ clinical notes have been automatically translated into English using GPT-5. Translation quality has been evaluated via back-translation
Below, we show an example for each of the three training datasets.
๐ ย NLP-FBK/dyspnea-crf-trainย ย ย ย Gold-standard pairs of clinical note + filled Dyspnea CRF
This dataset contains the train annotated CRFs for the CRF:filling Shared Task at CL4Health2026. The clinical notes have been collected, anonymized and annotated at the San Giovanni Bosco (SGB) hospital, Turin, Italy. There are two splits, each representing a different language: en (English) and it (Italian).
Each example (10 in total) in the dataset is composed by:
document_id: clinical note identifier
clinical_note: the note reporting on the patient's clinical history
annotations: the CRF items with their ground_truth labels
๐ฅ NLP-FBK/synthetic-crf-train ย ย ย Clinical notes about patients with Dyspnea
This dataset contains the train unannotated clinical notes for the CRF:filling Shared Task at CL4Health2026. The clinical notes have been collected, anonymized and annotated at the San Giovanni Bosco (SGB) hospital, Turin, Italy. There are two splits, each representing a different language: en (English) and it (Italian). English data has been automatically translated from Italian.
Each example (2667 in total) in the dataset is composed by:
document_id: clinical note identifier
clinical_note: the note reporting on the patient's clinical history
๐ ย NLP-FBK/synthetic-crf-train ย ย ย ย Semi-automatically annotated pairs of clinical note + filled CRF
This dataset contains the synthetic Case Report Forms built for 71/80 (English/Italian) patients with different medical conditions. Each example in the dataset is composed by:
document_id: patient/clinical note identifier coming from the original E3C dataset
clinical_note: the note reporting on the patient's clinical history
annotations: the CRF items with their ground_truth labels
crf_type: the identifier of the medical condition that defines the items in the CRF. Patients with the same crf_type have the exact same CRF items.