The data consists of 513 ultrasonography reports provided by a pediatric hospital in Argentina. Reports are unstructured and have abundance of orthographic and grammatical errors and have been anonymized in order to remove patient IDs, and names and the enrollment numbers of the physicians.
Reports were annotated by clinical experts and then revised by linguists. Annotation guidelines and training were provided for both rounds of annotations. Automatic classifiers will be expected to perform well in those cases where human annotators have strong agreement, and worse in cases that are difficult for human annotators to identify consistently.
Annotations are provided in brat format.
Further descriptions of the corpus can be found in the following publications:
[Cotik et al. 2017] Viviana Cotik, Darío Filippo, Roland Roller, Hans Uszkoreit, Feiyu Xu. Annotation of Entities and Relations in Spanish Radiology Reports. In Proceedings of the International Conference Recent Advances in Natural Language Processing (RANLP), 2017 [PDF]
[Cotik 2018] Viviana Cotik. Information extraction from Spanish radiology reports. PhD thesis, Chapter 5, specifically 5.3.3. Universidad de Buenos Aires. Facultad de Ciencias Exactas y Naturales, 2018. [PDF]
Since reports are highly repetitive, half of the annotated corpus will be used as test set for evaluation.
Two different partitions are distinguished within the test set: a held-out partition containing words that are not in the training corpus, and a same-sample partition without such restriction, that is, where out of vocabulary words may occur without a biased probability.
The held-out test partition has been created by identifying terms belonging to a given semantic field within the reports, and selecting all reports containing those terms.
A held-out development partition has been created with reports containing terms not in the training set or in the held-out test set.
Thus the annotated dataset is constituted as follows:
Training set (175 reports)
Same-sample development set (47 reports)
Held-out development set (45 reports)
Held-out test set (207 reports)
To download the development and test corpus fill this form.
PadChest: A large chest x-ray image dataset with multi-label annotated reports (Spanish)
MIMIC Chest X-ray (MIMIC-CXR) Database (English)
CLEF eHealth 2020 – Task 1: Multilingual Information Extraction (CodiEsp)
IBERLEF eHealth-KD 2018 (TASS 2018)
Disability annotation on documents from the biomedical domain
Biomedical Abbreviation Recognition and Resolution (BARR)
Biomedical Abbreviation Recognition and Resolution 2nd Edition (BARR2)
Portuguese and Spanish RadLex http://archive.rsna.org/2007/5004975.html
Snomed CT terminology (see Argentinian edition 2020-11-30)
MeSH in Spanish
ICD10 in Spanish
Spanish NEGEX lexicon for the radiology domain
Spanish radiology embeddings based on skipgram and radiology reports (please cite [1] if you use the data)
Spanish embeddings
Spanish medical embeddings based on SCIELO and Wikipedia
[1] Detección de relaciones en informes medicos escritos en español. Graduate Thesis. Departamento de Computación, FCEyN, Universidad de Buenos Aires. Author: Javier Minces Müller. Advisor: Viviana Cotik