Data

The dataset contains approximately 9,034 5W1H annotations distributed in 190 news items. The 5W1H dataset has been splitted into a proportion of 70% (6,934 5W1H annotation) for training and 30% (2,100 5W1H annotation) for testing. The corpus contains text in Spanish manually annotated with an extraction technique known as 5W1H, which is a journalistic technique consisting of annotating all the entities present in a text related to the questions What, Who, Where, When, Why and How. It was annotated by three expert annotators (2 linguistics and 1 sociologist), all three specialized in NLP and with knowledge in the annotation guideline created for this purpose.

The language chosen is Spanish because, although it is the fourth most widely spoken language in the world and the second mother tongue in the world in terms of number of speakers, there are few corpora built in Spanish to address the automatic reliability detection task in NLP.

With regard to the domain, the topics covered in this dataset include economy, sports, science, education, health, society, entertainment, politics and security.

Regarding the collection procedure, 190 news was collected manually and via web crawling. The news was collected from digital newspapers such as ABC, BBC News, CNN Spanish, El Espectador, El Financiero, El Mundo, El País, Huffpost, Marca, La Jornada, El Diestro, Eje21, Periodista Digital, Ok Diario, 20 Minutos, la Vanguardia, among others. For the construction of the dataset, news that did not follow the language, format, extension, or other semantic characteristics desired in the corpus was filtered out.

Dataset files:

Subtask 1:

Subtask 2:

Page updated

Report abuse