Call for Participation

The competition is finished. The report describing the results and findings is available here. Thank you for participating!

Background

Predicting lexical complexity can enable systems to better guide a user to an appropriate text, or tailor a text to their needs. Complex Word Identification (CWI) is the task of identifying which words are likely to be considered complex by a given target population. CWI is an important task in many text simplification pipelines. Two competitions have been organized on this topic, CWI 2016 and CWI 2018.

The first edition of the CWI shared task was organized at SemEval 2016 (Paetzold and Specia, 2016). CWI 2016 provided participants with an English dataset in which words in context were annotated as non-complex (0) or complex (1) by a pool of human annotators. The goal was to predict this binary value for the target words in the test set. A post-competition analysis of the CWI 2016 results (Zampieri et al. 2019) showed how challenging CWI 2016 was, with respect to the distribution (more testing than training instances) and annotation (binary and aggregated) of its dataset.

The second edition of the CWI shared task was organized in 2018 at the BEA workshop (Yimam et al., 2018). CWI 2018 featured a multilingual (English, Spanish, German, and French) and multi-domain dataset. This time, predictions were evaluated not only in a binary classification setting like CWI 2016 but also in terms of probabilistic classification in which systems were asked to give a probability of the given target word in its particular context being complex. Although CWI 2018 provided an element of regression, the continuous complexity value of each word was calculated as the proportion of annotators that found a word complex (i.e., if 5 out of 10 annotators marked a word as complex then the word was given a score of 0.5). This measure relies on an aggregation of absolute binary judgments of complexity to give a continuous value.

The LCP 2021 addresses some of these points by providing participants with an augmented version of CompLex (Shardlow et al., 2020), a multi-domain English dataset annotated with a 5-point Likert scale (1-5). The annotation model in CompLex addresses complexity as a continuum instead of a binary feature. The goal of LCP 2021 is to predict this complexity score for each target word in context in the test set.


Data

We provide participants with an augmented version of CompLex, a multi-domain English dataset with sentences annotated using a 5-point Likert scale (1-5) described in Shardlow et al. (2020). The task is to predict the complexity value of words in context.

LCP 2021 is divided into two sub-tasks:

  • Sub-task 1: predicting the complexity score for single words;

  • Sub-task 2: predicting the complexity score for multi-word expressions.

Teams who participate in both tracks will also be evaluated with respect to the overall performance for sub-task 1 and sub-task 2.

Teams can use external resources (e.g. corpora, lexicons) in the competition (open submission).

The trial, training, and test data are available here.


Participate

The competition is finished. The report describing the results and findings is available here. Thank you for participating!


Dates

  • Trial data available: July 31, 2020

  • Training data available: September 4, 2020

  • Test data available/Evaluation starts: January 11, 2021

  • Evaluation ends: January 20, 2021

  • Paper submission due: February 23, 2021

  • Notification to authors: March 29, 2021

  • Camera ready due: April 5, 2021

  • SemEval workshop: Summer 2021


References

Shardlow, M., Zampieri, M. and Cooper, M. (2020) CompLex: A New Corpus for Lexical Complexity Prediction from Likert Scale Data. In Proceedings of the 1st Workshop on Tools and Resources to Empower People with REAding DIfficulties (READI). pp. 57-62.

Paetzold, G. Specia, L. (2016) SemEval 2016 Task 11: Complex Word Identification. In Proceedings of the 10th International Workshop on Semantic Evaluation (SemEval-2016). pp. 560-569.

Yimam, S.M., Biemann, C., Malmasi, S., Paetzold, G., Specia, L., Štajner, S., Tack, A. and Zampieri, M. (2018) A Report on the Complex Word Identification Shared Task 2018. In Proceedings of the Thirteenth Workshop on Innovative Use of NLP for Building Educational Applications (BEA). pp. 66-78.

Zampieri, M., Malmasi, S., Paetzold, G., Specia, L. (2017) Complex Word Identification: Challenges in Data Annotation and System Performance. In Proceedings of the 4th Workshop on Natural Language Processing Techniques for Educational Applications (NLPTEA 2017). pp. 59-63.