Dataset

The corpus consists of a total of 3,000 pieces of news from various municipalities in the province of Alicante (Spain), covering topics such as sports, culture, leisure, and festivities. It has been developed and validated by a team of expert validators in this field. For each of these articles, two versions have been generated: a) one in PL format, referred to as the “facilitated version” in the corpus. This version adheres to the adaptation criteria but is less strict than the other, particularly in its presentation, though it still aids in text comprehension (related to Subtask 1); and b) another in E2R format, which strictly complies with the corresponding UNE 153101 EX guidelines (AENOR 2018), including both the language used and its presentation (related to Subtask 2).

The dataset has been divided into 70% (2,100 news) for training and 30% (900 news) for testing. The news items were selected in Spanish because, as described by Instituto Cervantes (2022), although it is the fourth most spoken language in the world and the second most widely spoken mother tongue, it is necessary to develop more corpora to support the adaptation of texts into E2R and PL formats. This will enable NLP tools to more effectively address the challenges associated with accessibility and text comprehension.

Page updated

Report abuse