Task description

PreTENS is articulated into the two following sub-tasks:

a binary classification sub-task, which consists in predicting the acceptability label assigned to each sentence of the test set;
a regression sub-task, which consists in predicting the average score assigned by human annotators on a seven point Likert-scale with respect to the subset of data evaluated via crowdsourcing.

For each sub-task and each language, the dataset will be split into training and test set:

for the binary-classification sub-task, the training and test set will be composed by 5,838 and 14,556 samples, respectively;
for the regression sub-task, 524 sentences will be provided for the training set and 1,009 for the test set.

The two sub-tasks are independent. Participants can decide to participate in just one of them, though we encourage participation in multiple subtasks.

In the two sub-tasks, participants are free to use external resources, with the exception of lexical resources where semantic relationships (including taxonomical ones) are manually marked, such as WordNet, BabelNet, etc. When used, any external resources have to be described in detail in the final report.

Data description

The Task will comprise datasets in 3 languages: English, Italian, French. The French and Italian are slightly adapted translations of the English dataset.

Each dataset will contain about 20,394 artificially generated sentences that exemplify constructions that enforces presuppositions on the taxonomic status of their arguments A and B, e.g. comparatives (I like A more than B ), exemplifications (I like A, and in particular B), generalizations (I like A, and B in general) and others.

The argument nouns are taken from 30 semantic categories (e.g. dogs, birds, mammals, cars, motorcycles, cutlery, clothes, trees, plastics...). A and B

All sentences will be provided with an acceptability label such as in the following examples:

I like trees, and in particular birches 1
I like oaks, and in particular trees 0

where 1 stands for acceptable (i.e. the taxonomical relations is compatible with the construction at issue) and 0 stands for not acceptable (not compatible).

A subset of 1,533 sentences of the whole dataset, corresponding to about 5% of the total and representative of the patterns considered, was be judged by human annotators via a crowdsourcing campaign on a seven point Likert-scale, ranging from 1 (not at all acceptable) to 7 (completely acceptable). In this case, the sentences will be provided with the average judgment they received, which could be affected by plausibility considerations, argument order and other factors. Examples of this conditions are:

I like politicians, an interesting type of farmer 1.42
I like governors, an interesting type of politician 6.16

The datasets are available on GitHub here.

Evaluation measures and baselines

Two different evaluation metrics are defined:

for the binary classification sub-task, the evaluation metric will be based on Precision, Recall and F-measure;
for the regression sub-task, the evaluation metric will be based on Spearman’s rank correlation coefficient between the task participants’ scores and the test set scores

Two different baselines are defined in our GitHub here.

Evaluation Platforms:
Subtask1: https://codalab.lisn.upsaclay.fr/competitions/1292
Subtask2: https://codalab.lisn.upsaclay.fr/competitions/1290

Test data in: https://github.com/shammur/SemEval2022Task3

Notice:

For each sub-task a separate baseline is defined: i) for the binary classification sub-task baseline, a Linear Support Vector classifier using n-grams (up to three) as input features is used, and ii) as for the regression sub-task, a baseline using a Linear Support Vector regressor with the same n-grams features is provided. Participants can run the evaluation system and obtain the results by using different cross-validation configurations on the training set. Due to the presence in the official test-set of additional constructions with the same presuppositional constraints, we have found that applying the baseline methods on the official test-set yields results that are from 10% to 20% lower than the training set. This highlights the importance of achieving a great deal of syntactic generality on this task. For this reason we encourage to test different cross-validation configurations on the training set.