SpanishCorpusforOutbreakDetection

A Corpus for Outbreak Detection of Diseases Prevalent in Latin America

( Corpus para detección de brotes de enfermedades prevalentes en Latinoamérica)

DataSet for doing named entity recognition and relation extraction from ProMED-mail Spanish articles

About the Dataset

We present an annotated corpus, which can be used for training and testing algorithms to automatically extract information about disease outbreaks from newspaper articles. The source for the corpus is Spanish articles from ProMED-mail [3], a publicly available reporting system for emerging diseases and outbreaks.

The corpus has been constructed with two main tasks in mind. The first one, to extract entities about outbreaks of diseases, such as disease, date, geographic location, number of cases, host, origin -reported cause of the disease-, transmission form of the diseases -eg. bite-, and modifiers such as negation and uncertainty terms. The second one, to retrieve relations among two or three entities, for instance, among: a disease and the geographical location where it occurs, the number of cases and a disease or a geographical location, date, and cause of a disease, disease and host to which it occurs, and speculation and negation terms associated with some entity types.

The corpus is the one used in [2] and [6], and whose construction and the previous version were described in [1]. It was generated as part of Antonella Dellanzo's Undergraduate Thesis [6].

Please cite our work ([1,2]) if you use the data.

Creation of the dataset

In order to construct the corpus, we downloaded articles from ProMED-mail [3], a reporting system dedicated to the rapid dissemination of information on epidemics of infectious diseases, among others. The articles published on ProMED-mail have been edited based on journalistic notes from different media by an interdisciplinary staff. ProMED-mail articles are formed by a title, a date, the main text, and metadata (such as the source and editor of the articles).

Since titles of the articles are informative and easier to process and annotate than whole articles, we decided to work with:

1) only the title and the date of the article (from now on the Title), and with

2) the main text of the article -including the title, the date and excluding the metadata- (from now on PMA -ProMed mail article-).

Thus, our corpus is composed of two corpora: one with only titles and one with complete texts (including titles).

A summary of the steps performed to obtain the corpus can be seen below:

The download of Spanish articles written between 23 August 2001 and 18 August 2020 and focused on reported issues in Latin American Spanish and Portuguese-speaking countries, that mention the appearance of at least one of the following pathologies: dengue, hantavirus, measles, Guillain-Barré syndrome, Zika or Chagas.
Data cleansing by
- 1) removal of metadata with regular expressions, and
- 2) data normalization (unification of: the date separator character, the date format, the decimal separator in numbers, and the name of countries, among others).
Creation and refinement of an annotation schema and guideline (see [1] for more information). Seven named entities, three modifiers and several binary and ternary relations among named entities were defined.
Selection and training of annotators.
Annotation with brat rapid annotation tool [4] and refinement of annotation criteria following Model-Annotate-Model-Annotate (MAMA cycle} [7]. A subset of articles was annotated by two annotators in order to assess the inter-annotator agreement (IAA), which was calculated through the implementation of Cohen’s Kappa coefficient [8]. For those articles, annotated by more than one annotator, those annotated by the most expert annotator were chosen.

The resulting corpus was analyzed in order to know its characteristics, such as the number of entities and relations, and the average number of sentences in the articles. The results can be seen in [1] (for a prior version of the corpus) and in [2] for the final version of the corpus.

Data Statements

In the following paragraphs, we describe the ethic and data statements of our corpus. Data statements were proposed by Bender and Friedman [9] to address bias and other critical issues that emerge when working with natural language processing.

The type of language used in the texts is the usual in newspaper articles. ProMED articles are a shortened version of the original articles. Although each Spanish-speaking Latin American country uses the language in different ways (eg. often different terms are applied to the same objects), in the articles, standard Spanish is used. There is no information available of ProMed-mail editors' demographics.

The annotation was carried out by eight Spanish native speakers from Peru and Argentina, where different variations of Spanish are spoken. Nevertheless, we evaluate that this fact did not hinder an accurate understanding of the annotation criteria or of ProMED-mail articles.

The annotation team was composed of five computer science master students, one linguist, and two PhDs in computer science, researchers in natural language processing with experience developing annotation criteria and annotating in different domains.

Annotators were not economically compensated.

Download

We provide the annotated PMA (ProMED-mail articles) in two flavors:

complete articles (complete ProMED-mail articles without metadata) and
Titles (-also called reduced dataset- i.e. only PMA titles and publication dates).

For each complete article or article Title there is a .ann and a .txt file with the same name. The .txt file contains the original article -or title and date- (without metadata). .ann files describe all annotated entities and relations, one per line. There, entities contain an entity identifier, the type of entity, the offset of where it can be found in the .txt file and its value (eg. Brasil in an entity of type Location). Relations contain a relation id, the type of the relation and its arguments (identifiers of the named entities that are related).

Annotations can be visualized through the use of brat rapid annotation tool ([4]) by adding the configuration files provided or used independently of brat. Therefore, we also provide the annotated articles in IOB2 format [5].

Both datasets are divided into Training and Testing datasets. We provide links to download the:

reduced annotated dataset (only titles and dates): training dataset [brat][BIO format], testing dataset [brat][BIO format]
complete articles: training dataset [brat][BIO format], testing dataset [brat][BIO format]
brat configuration files.

Please cite our work ([1,2]) if you use the data.

Acknowledgements

Besides the authors of this webpage, Daniel Yunior Lozano Barriga, Jonathan Jimmy Mollapaza Apaza, Daniel Alfredo Palomino Paucar, Fernando Schiaffino and Alexander Yanque Aliaga also participated in the annotation of the corpus. This work received financial support from CONCYTEC-PROCIENCIA under the call E041-01 [contract number 34-2018-FONDECYT-BM-IADT-SE].

References

[1] Dellanzo, Antonella; Cotik, Viviana; and Ochoa-Luna, Jose . "A Corpus for Outbreak Detection of Diseases Prevalent in Latin America." Proceedings of the 24th Conference on Computational Natural Language Learning. 2020, [PDF][BibTex]

[2] Event-based surveillance in Latin American diseases outbreaks: Information Extraction from a Novel Spanish Corpus. To be published. 2022.

[3] Carrion, Malwina and Madoff Lawrence C. "ProMED-mail: 22 years of digital surveillance of emerging infectious diseases". Int Health. 2017 May 1;9(3):177-183, [PDF][BibTex]

[4] Stenetorp, Pontus, et al. "BRAT: a web-based tool for NLP-assisted text annotation." Proceedings of the Demonstrations at the 13th Conference of the European Chapter of the Association for Computational Linguistics. 2012. [PDF][BibTex]

[5] IOB2 Format: Wikipedia article

[6] Detección de epidemias en textos periodísticos escritos en español. Antonella Dellanzo, Undergraduate Thesis, Supervisor: Viviana Cotik, Universidad de Buenos Aires. Facultad de Ciencias Exactas y Naturales, 2021. [PDF][BibTex]

[7] Ide, Nancy and Pustejovsky, James, "Handbook of linguistic annotation". Springer. 2017 [Book][BibTex]

[8] Cohen, Jacob, "A coefficient of agreement for nominal scales". Educational and psychological measurement. 20 (1) pp 37--46. 1960. Sage Publications Sage CA: Thousand Oaks, CA

[9] Bender, Emily M and Friedman, Batya, "Data statements for natural language processing: Toward mitigating system bias and enabling better science" .Transactions of the Association for Computational Linguistics. 6, pp 587--604, 2018, MIT Press.

Contact

For questions e-mail: Antonella Dellanzo (antodellanzo at gmail.com), Viviana Cotik (vcotik at dc.uba.ar), or José Ochoa-Luna(jeochoa at ucsp.edu.pe).