Evaluation metrics

Automatically assessing the quality of E2R texts remains a challenge (Alva-Manchego et al., 2021; Al Ajlouni, et al., 2023). Since ATS can be considered a form of monolingual translation, evaluation metrics originally designed for Machine Translation (MT), such as BLEU (Papineni et al., 2002) and BERTscore (Zhang et al., 2020), have been adapted for this purpose. However, these metrics have limitations. For instance, BLEU relies on n-gram overlap and does not explicitly evaluate semantic meaning. Alternatively, SARI (Xu et al., 2016) measures lexical paraphrasing by analyzing which n-grams are inserted, deleted, or retained by the system output compared to human references. However, SARI penalizes valid simplifications that use synonyms instead of exact matches. SAMSA (Sulem et al., 2018), on the other hand, focuses specifically on sentence splitting quality.

In the context of Easy-to-Read (E2R) texts, comprehensive lexical and syntactic guidelines must be followed for a text to be considered properly adapted. Given the availability of a validated corpus created by E2R experts, we propose using two evaluation metrics to assess the generated texts:

1- Cosine Similarity

This metric will be used to measure the textual similarity between the participants’ generated texts and the reference texts created by APSA, an NGO specialized in text adaptation for people with disabilities, which collaborated in the creation of the corpus. Higher similarity scores indicate that the generated text aligns more closely with the human expert-adapted versions.

2- Fernández Huerta Readability Index

This readability metric, designed specifically for Spanish texts, is based on the Flesch-Kincaid readability formula for English. It evaluates texts by considering average sentence length and average syllable length, assigning a readability score. A higher readability score suggests a text that is easier to understand.

Participants' submissions will be ranked based on a combined evaluation of these two metrics. The winning submissions will be those that achieve high similarity to expert-adapted texts while also maximizing readability according to the Fernández Huerta Index.

Page updated

Report abuse