Task & Data

TRAINING DATA IS NOW AVAILABLE

~~Training data can be downloaded~~ ~~[HERE]~~

TEST DATA IS NOW AVAILABLE

Training data can be downloaded [HERE]

Also, you can download the test data on our [CODALAB PAGE]. Please register your team if you had not done it before.

Task Description

In this task, participants will be given pairs of sentences and the aim is to indicate whether each pair captures a paraphrase relationship, i.e. to classify them as paraphrases (P), labelled as 1 or not paraphrases (NP), labelled as 0.

Data description

The corpus for this task is composed by sentence-pairs extracted from more than 200 texts that were manually elaborated with literary creation, low paraphrase, high paraphrase, and no paraphrase methods. All the writings are based on an original text whose structure has been manually modified at different levels to produce a wide range of variations. This wide variety of paraphrases makes this corpus useful for research in Spanish because it considers the richness of language and the different ways an idea can be expressed.

There are a total of 7 original texts whose topics are: molecular cuisine, sushi, tequila, kebab, vegan food, food-truck, and food on Mexican ofrenda. Each one is composed of an average of 25 phrases that we call sentences. The original texts are manually paraphrased following the method proposed by Torres-Moreno et al., 2014 to build a German corpus, which is made up of three main sections: low paraphrase, high paraphrase, and no paraphrase.

The statistics for the Training corpus are the following: