CAPITEL @ IberLEF 2020


Description

Within the framework of the PlanTL, the Royal Spanish Academy (RAE) and the Secretariat of State for Digital Advancement (SEAD) of the Ministry of Economy signed an agreement for developing a linguistically annotated corpus of Spanish news articles, aimed at expanding the language resource infrastructure for the Spanish language. The name of such corpus is CAPITEL (Corpus del Plan de Impulso a las Tecnologías del Lenguaje}, and is composed of contemporary news articles thanks to agreements with a number of news media providers. CAPITEL has three levels of linguistic annotation: morphosyntactic (with lemmas and Universal Dependencies-style POS tags and features), syntactic (following Universal Dependencies v2), and named entities.

The linguistic annotation of a subset of the CAPITEL corpus has been revised using a machine-annotation-followed-by-human-revision procedure. Manual revision has been carried out by a team of graduated linguists using the Annotation Guidelines created specifically for CAPITEL. The named entity and syntactic layers of revised annotations comprise about 1 million words for the former, and roughly 250,000 for the latter. Due to the size of the corpus and the nature of the annotations, we propose two IberLEF sub-tasks under the more general, umbrella task of CAPITEL @ IberLEF 2020, where we will use the revised subset of the CAPITEL corpus in two challenges, namely:

(1) Named Entity Recognition and Classification and

(2) Universal Dependency Parsing.

Because of the ever-evolving nature of the NLP field and its associated shared task competitions, we deem it relevant to propose new challenges for the Spanish language to determine whether recent developments can push the boundaries of the current state of the art.

Sub-task 1: Named Entity Recognition and Classification in Spanish News Articles

Description

Information extraction tasks, formalized in the late 1980s, are designed to evaluate systems which capture pieces of information present in free text, with the goal of enabling better and faster information and content access. One important set of such information are named entities (NE) which, roughly speaking, are textual elements corresponding to names of people, places, organizations and others. Three processes can be applied to NEs: recognition (or identification), categorization (assigning a type according to a predefined set of semantic categories), and linking (disambiguating the reference).

Since their advent, NER tasks have had notable success, but despite the relative maturity of this subfield, work and research continues to evolve, and new techniques and models appear alongside challenging datasets in different languages, domains and textual genres. The aim of this sub-task is to challenge participants to apply their systems or solutions to the problem of identifying and classifying NEs in Spanish news articles. This two-stage process is referred to as NERC (Named Entity Recognition and Classification).

The following NE categories will be evaluated:

      • Person (PER)
      • Location (LOC)
      • Organization (ORG)
      • Other (OTH)

as defined in the Annotation Guidelines that will be shared with participants.

Linguistic Resources

A subset of the CAPITEL corpus will be provided (a maximum supporting data set of 1 million revised words is estimated). The supporting data will be randomly sampled into three subsets: training, development and test. The training set will comprise 50% of the corpus, whereas the development and test sets will roughly amount to 25% each. Together with the test set release, we will release an additional collection of documents (background set) to ensure that participating teams are not be able to perform manual corrections, and also encourage features such as scalability to larger data collections. All the data will be distributed tokenized with named entities annotated in IOBES format.

Evaluation Metrics

The metrics used for evaluation will be the following:

  • Precision: The percentage of named entities in the system's output that are correctly recognized and classified.
  • Recall: The percentage of named entities in the test set that were correctly recognized and classified.
  • F-measure: The harmonic mean of Precision and Recall (macro averaged).

with the latter being used as the official evaluation score, and which will be used for the final ranking of the participating teams.

Sub-task 2: Universal Dependency Parsing of Spanish News Articles

Description

Dependency-based syntactic parsing has become popular in NLP in recent years. One of the reasons for this popularity is the transparent encoding of predicate-argument structures, which is useful in many downstream applications. Another reason is that it is better suited than phrase-structure grammars for languages with free or flexible word order.

Universal Dependencies (UD) is a framework for consistent annotation of grammar (parts of speech, morphological features and syntactic dependencies) across different human languages. Moreover, the UD initiative is an open community effort with over 200 contributors which has produced more than 100 treebanks in over 70 languages.

The aim of this sub-task is to challenge participants to apply their systems or solutions to the problem of Universal Dependency parsing of Spanish news articles as defined in the Annotation Guidelines for the CAPITEL corpus that will be shared with the participants.

Linguistics Resources

A subset of the CAPITEL corpus will be provided (a maximum supporting data set of 250,000 revised words is estimated). In addition to head and dependency relations in CoNLL-U format, this subset will be tokenized and annotated with lemmas and UD tags and features.

The entire CAPITEL supporting data will be randomly sampled into three subsets: training, development and test. The training set will comprise 50% of the corpus, whereas the development and test sets will roughly amount to 25% each. Together with the test set release, we will release an additional collection of documents (background set) to ensure that participating teams are not be able to perform manual corrections, and also encourage features such as scalability to larger data collections

Evaluation Metrics

The metrics for the evaluation phase will be the following:

  • Unlabeled Attachment Score (UAS): The percentage of words that have the correct head.
  • Labeled Attachment Score (LAS): The percentage of words that have the correct head and dependency label.

with the latter being used as the official evaluation score, and which will be used for the final ranking of the participating teams.

Schedule for Both Sub-tasks

  • March, 15: Sample set, Evaluation script and Annotation Guidelines released.
  • March, 17: Training set released.
  • April, 1: Development set released.
  • April, 29: Test set released (includes background set).
  • May, 17May, 24: Systems output submissions.
  • May, 21May, 28: Results posted and Test set with GS annotations released.
  • May, 31June, 7June, 15: Working notes paper submission.
  • June, 15June, 22 Notification of acceptance (peer-reviews).
  • June, 30: Camera ready paper submission.
  • September: IberLEF 2020 Workshop.

Registration and Submissions

Please fill out the form on Codalab to register and submit results for NERC or UD Parsing.

Results

Sub-task 1: Named Entity Recognition and Classification in Spanish News Articles

Sub-task 2: Universal Dependency Parsing of Spanish News Articles

Organization Committee

  • David Pérez Fernández, PlanTL - Ministry of Economy, Spain.
  • Jordi Porta-Zamorano, Centro de Estudios de la RAE, Spain.
  • José-Luis Sancho-Sánchez, Centro de Estudios de la RAE, Spain.
  • Rafael-J. Ureña-Ruiz, Centro de Estudios de la RAE, Spain.
  • Doaa Samy, Instituto de Ingeniería del Conocimiento (PlanTL-GTO), Spain.
  • Luis Espinosa-Anke, School of Computer Science and Informatics, Cardiff University, UK.


Contact