Data

The data consists of 513 ultrasonography reports provided by a pediatric hospital in Argentina. Reports are unstructured and have abundance of orthographic and grammatical errors and have been anonymized in order to remove patient IDs, and names and the enrollment numbers of the physicians.

Reports were annotated by clinical experts and then revised by linguists. Annotation guidelines and training were provided for both rounds of annotations. Automatic classifiers will be expected to perform well in those cases where human annotators have strong agreement, and worse in cases that are difficult for human annotators to identify consistently.

Annotations are provided in brat format.

Further descriptions of the corpus can be found in the following publications:

[Cotik et al. 2017] Viviana Cotik, Darío Filippo, Roland Roller, Hans Uszkoreit, Feiyu Xu. Annotation of Entities and Relations in Spanish Radiology Reports. In Proceedings of the International Conference Recent Advances in Natural Language Processing (RANLP), 2017 [PDF]

[Cotik 2018] Viviana Cotik. Information extraction from Spanish radiology reports. PhD thesis, Chapter 5, specifically 5.3.3. Universidad de Buenos Aires. Facultad de Ciencias Exactas y Naturales, 2018. [PDF]

Training, Development and Test partitions of the corpus

Since reports are highly repetitive, half of the annotated corpus will be used as test set for evaluation.

Held-out and Same-sample partitions

Two different partitions are distinguished within the test set: a held-out partition containing words that are not in the training corpus, and a same-sample partition without such restriction, that is, where out of vocabulary words may occur without a biased probability.

The held-out test partition has been created by identifying terms belonging to a given semantic field within the reports, and selecting all reports containing those terms.

A held-out development partition has been created with reports containing terms not in the training set or in the held-out test set.

Thus the annotated dataset is constituted as follows:

Training corpus

Training set (175 reports)

Development corpus

Same-sample development set (47 reports)

Held-out development set (45 reports)

Test corpus

Held-out test set (207 reports)

To download the development and test corpus fill this form.

Additional resources

Snomed CT terminology (see Argentinian edition 2020-11-30)

MeSH in Spanish

ICD10 in Spanish

Spanish NEGEX lexicon for the radiology domain

Embeddings and Language Models

Spanish radiology embeddings based on skipgram and radiology reports (please cite [1] if you use the data)

Spanish embeddings

Spanish medical embeddings based on SCIELO and Wikipedia

[1] Detección de relaciones en informes medicos escritos en español. Graduate Thesis. Departamento de Computación, FCEyN, Universidad de Buenos Aires. Author: Javier Minces Müller. Advisor: Viviana Cotik

Page updated

Google Sites

Report abuse