IberLEF 2025 task
IberLEF 2025 task
Deep language comprehension is essential for understanding semantic nuances and logical inferences behind natural language understanding. A major challenge in language comprehension is the development of new resources, which often require significant human intervention. One solution has been to reuse human Reading Comprehension collections, such as RACE (Lai et al. 2017), allowing researchers to compare the performance of automated systems against human benchmarks. However, most of these resources have been created primarily in English and often contain a substantial amount of training data similar to the test set, which can limit the generalization of results regarding reasoning capabilities.
PROFE 2025 reuses the exams for Spanish proficiency evaluation developed by Instituto Cervantes along many years to evaluate human students. Therefore, automatic systems will be evaluated under the same conditions as humans were. Systems will receive a set of exercises with their corresponding instructions without specific training material. In this way we expect Transfer Learning approaches or the use of Generative Large Language Models.
PROFE 2025 has three subtasks, one per exercise type. Teams can participate in any combination of them. Each subtask contains several exercises of the same type. The subtasks are:
Multiple choice subtask: each exercise includes a text and a set of multiple-choice questions about the text where only one answer is correct. Given a multiple-choice question, systems must select the correct answer among the candidates.
Matching subtask: each exercise contains two sets of texts. Systems must find the text in the second set that best matches the first set. There is only one possible matching per text, but the first set can contain extra unnecessary texts.
Filling the gap subtask: each exercise contains a text with several gaps corresponding to textual fragments that have been removed and presented disorderly as options. Systems must determine the correct position for each fragment. There is only one correct text per gap, but there could be more candidates than gaps.
The different exercises open research on how to approach them, adapting different prompts when using generative models.
We will use the IC-UNED-RC-ES dataset created from real examinations at Instituto Cervantes. These exams were created by human experts to assess language proficiency in Spanish. We have already collected the exams and converted them to a digital format, which is ready to be used in the task. The dataset contains exams at different levels (from A1 to C2).
The complete dataset contains 282 exams with 855 exercises. The total number of evaluation points are 6146 (among 16570 options) distributed by exercise type as:
multiple-choice: 3544 responses
matching: 2309 responses
fill-the-gap: 293 responses
In PROFE 2025 we plan to use around 50% of the exams, so the other 50% remains hidden for PROFE second edition.
We intend not to distribute the gold standard to prevent overfitting in post-campaign experiments and data contamination in LLMs.
See submission section for examples on each task, input and output formats and submission procedure.
We will use traditional accuracy (proportion of correct answers) as the main evaluation measure. Systems will receive evaluation scores from two different perspectives:
At the question level, where correct answers are counted individually without grouping them.
At the exam level, where scores for each exam are considered. Each exam contains several exercises of different types. An exam is considered to be passed if an accuracy score (accounted as the proportion of correct answers) above 0.6 is reached. Then, the proportion of passed exams is given as a global score. This perspective will only apply to those teams participating in the three subtasks.
More in detail, the exact evaluation per subtask is as follows:
Multiple choice subtask: we will measure accuracy as the proportion of questions correctly answered
Matching subtask: we will measure accuracy as the proportion of correct texts matched.
Fill in the gap subtask: We will measure accuracy as the proportion of correctly filled gaps.
We will use accuracy as the evaluation measure because there is only one correct option among candidates and because it is the measure applied to humans doing the same exams. Thus, we can compare the performance of automatic systems and humans under the same conditions
A preliminary baseline using ChatGPT obtains the following results for each exercise type (provided that different prompting can produce slightly different results):
Multiple choice accuracy: 0.64
Filling the gap accuracy: 0.43
Matching accuracy: 0.51
If you want to participate in PROFE 2025, please register filling out this form.
Participants will be required to submit their runs, and their submissions will be evaluated on the test partitions for the corresponding corpora. To submit their runs, a form will be available on the date.
Participants must submit their runs and are asked to describe their systems in paper submissions.
See Important Dates.
Alvaro Rodrigo, UNED NLP & IR Group (Universidad Nacional de Educación a Distancia)
Anselmo Peñas, UNED NLP & IR Group (Universidad Nacional de Educación a Distancia)
Alberto Pérez, UNED NLP & IR Group (Universidad Nacional de Educación a Distancia)
Sergio Moreno, UNED NLP & IR Group (Universidad Nacional de Educación a Distancia)
Javier Fruns Jiménez, Instituto Cervantes
Inés Soria Pastor, Instituto Cervantes
Rodrigo Agerri, HiTz (Universidad del País Vasco, UPV/EHU )
This task is partially funded by the Spanish Research Agency (Agencia Estatal de Investigación), DeepInfo (PID2021-127777OB-C22) and DeepKnowledge (PID2021-127777OB-C21) projects (MCIU/AEI/FEDER,UE).