SemEval 2022 Task 2

Multilingual Idiomaticity Detection and Sentence Embedding

Announcements


The competition has now come to an end. We'd like to thank all participants for their hard work.

Results


The competition results are available on the results section of this website.


Published Task and System Description Papers

You can find an overview of the submitted papers in the task description paper: SemEval-2022 Task 2: Multilingual Idiomaticity Detection and Sentence Embedding

You can see the list of published papers here.


Continue to experiment with the data

You can continue to experiment with the data on the task CodaLab website


Contents

Motivation

By and large, the use of compositionality of word representations has been successful in capturing the meaning of sentences. However, there is an important set of phrases — those which are idiomatic — which are inherently not compositional. Early attempts to represent idiomatic phrases in non-contextual embeddings involved the extraction of frequently occurring n-grams from text (such as “big fish”) before learning representations of the phrase based on their context (Mikolov et al., 2013). However, the effectiveness of this method drops off significantly as the length of the idiomatic phrase increases as a result of data sparsity (Cordeiro et al., 2016). More recent studies show that even state-of-the-art pre-trained contextual models (e.g. BERT) cannot accurately represent idiomatic expressions (Garcia et al., 2021).


Task Overview

Given this shortcoming in existing state-of-the-art models, this task (part of SemEval 2022) is aimed at detecting and representing multiword expressions (MWEs) which are potentially idiomatic phrases across English, Portuguese and Galician. We call these potentially idiomatic phrases because some MWEs, such as "wedding date" are not idiomatic (i.e literal). This task consists of two subtasks, each available in two "settings".

Participants have the freedom to choose a subset of subtasks or settings that they'd like to participate in (see sections detailing each of the subtasks for details). You cannot pick a subset of languages.

This task consists of two subtasks:

  1. Subtask A

A binary classification task aimed at determining whether a sentence contains an idiomatic expression.

  1. Subtask B

Subtask B is a novel task which requires models to output the correct Semantic Text Similarity (STS) scores between sentence pairs whether or not either sentence contains an idiomatic expression. Participants must submit STS scores which range between 0 (least similar) and 1 (most similar). This will require models to correctly encode the meaning of idiomatic phrases such that the encoding of a sentence containing an idiomatic phrase (e.g. Who will he start a program with and will it lead to his own swan song?) and the same sentence with the idiomatic phrase replaced by a (literal) paraphrase (e.g. Who will he start a program with and will it lead to his own final performance?) are semantically similar to each other and equally similar to any other sentence. (See details of subtask below).


Important Dates

Training data available: September 3, 2021

Evaluation start: January 10, 2022

Evaluation end: (TBC) January 31, 2022

Paper submissions due: (TBC) February 23, 2022

Notification to authors: March 31, 2022

Please join the task Google Group so you can receive the latest information on the task.


Getting Started

To familiarise yourself with this Task, we suggest:

  1. Join the Task mailing list so you receive regular updates on the task.

  2. Read through this page to understand the two Subtasks and their settings.

  3. Familiarise yourself with the timelines of this Task.

  4. Decide on the Subtask(s) and Setting(s) you intend to participate in. You can participate in any one (or more) of the following:

    1. Subtask A: Zero-shot

    2. Subtask A: One-shot

    3. Subtask B: Pre-train

    4. Subtask B: Fine-tune

  5. Step through the Google Colab Notebooks with the Baselines for each Subtask and Setting you wish to participate in so you understand the requirements.

  6. Submit the resultant baseline file to Codalab so you are clear about the submission format.

  7. Start working on your own method for the Subtask(s) Setting(s)

  8. Submit your results and Win : ) !!

Subtask A

This is a binary classification task that requires classifying sentences into either "Idiomatic" or "Literal". It will be available in each of the settings described below and all three languages. Example sentences and associated labels are shown in the example alongside.

Subtask A Settings

We provide two different settings to better test models' ability to generalise: zero-shot and one-shot

In the "zero-shot" setting, MWEs (potentially idiomatic phrases) in the training set are completely disjoint from those in the test and development sets. In the "one-shot" setting, we include a one positive and one negative training examples for each MWE in the test and development sets. Note that the actual examples in the training data are different from those in the test and development sets in both settings. Please see the Section Data and Model Restrictions for details.

Participants can choose to participate in only one or both of these settings.

Note: Please see the submission format section for details on the format that should be used for submissions and for file names and the structure expected.

Example sentences and labels for Subtask A.

Please note that "Idiomatic" is assigned the label 0 in the dataset and "non-idiomatic" (including proper nouns) are assigned the label 1

Subtask A Data Description and Evaluation Metrics

This subtask is evaluated using the Macro F1 score between the gold labels and model predictions (see the details in the evaluation script).

The following sample of the training data provides an illustration of the training data associated with Subtask A. The evaluation data is in the same format but without the labels.

Each row of the data will specify the language and the potentially idiomatic MWE which has been used to annotate that row. The "Target" is the sentence that contains this MWE. In addition to the Target sentence containing the potentially idiomatic MWE, we also provide the previous and next sentences for context. The label provides the annotation of that row, and a label of 0 indicates "Idiomatic" and a label of 1 indicates "non-idiomatic", including proper nouns.

There are two train files associated with this task: The "train_zero_shot.csv" and "train_one_shot.csv". You may use only train_zero_shot.csv to train models for the zero-shot setting, however, you can use data from both files to train models in the one-shot setting.

The evaluation data (dev and eval) is the same for both the settings. You must use it to create a submission file as detailed in submission format.

Subtask B

Subtask B is a novel task which requires models to output the correct Semantic Text Similarity (STS) scores between sentence pairs whether or not either sentence contains an idiomatic expression. Participants must submit STS scores which range between 0 (least similar) and 1 (most similar). This will require models to correctly encode the meaning of idiomatic phrases such that the encoding of a sentence containing an idiomatic phrase (e.g. Who will he start a program with and will it lead to his own swan song?) and the same sentence with the idiomatic phrase replaced by a (literal) paraphrase (e.g. Who will he start a program with and will it lead to his own final performance?) are semantically similar to each other and equally similar to any other sentence.

Given that this is a new challenge, please read through the requirements carefully and do not hesitate to ask for clarification on the Google Group. In addition to the task description here, the Google Colab Notebook, which describes the baseline solution for this subtask in detail will be helpful in understanding the requirements of this task. Please also see the data description section below.

To test a model's ability to generate sentence embeddings that accurately represent sentences regardless of whether or not they contain idiomatic expressions, we provide paraphrases of each possible meaning of a MWE and require models to output semantic similarity scores between sentences such that:

Case 1

The semantic similarity between a sentence with an idiom and that same sentence in which the idiom has been replaced by a paraphrase that incorrectly represents the meaning of the idiom in context is

approximately equal to the semantic similarity score between a sentence where the idiom has been replaced by a paraphrase that correctly represents its meaning and one wherein it incorrectly represents the meaning.

i.e, Sim( Column1, Column 3) is approximately equal to Sim( Column2, Column 3) in Table 1 below.

The intuition here is that sentences containing a MWE and the correct paraphrase of that MWE (e.g. "big picture" and "whole situation" in row 2 of Table 1 below) should be equally similar to any other sentence (in this case the one with the incorrect replacement).

Case 2

The semantic similarity score between a sentence with an idiom and that same sentence in which the idiom has been replaced by a paraphrase that correctly represents the meaning of the idiom in context is

approximately equal to one.

i.e, Sim( Column1, Column 2) is approximately equal to 1 in Table 1 below

Once again, the intuition here is that sentences containing a MWE and the correct paraphrase of that MWE (e.g. "big picture" and "whole situation" in row 2 of Table 1 below) should mean the same thing and so must have a similarity of 1.


Table 1: Illustration of Task B - Sentence embeddings must provide semantic similarity based on replacing an idiom with meaning paraphrases that are either correct or incorrect based on context.

Subtask B Settings

Subtask B will be available in two settings: "pre-train" and "fine-tune".

The "pre-train" setting requires models to be trained without use of idiom specific data (as in the pre-training of language models) and the fine-tune setting allows for fine-tuning models using the training data we provide.

We define pre-training to be the training of a model on any task other than idiomatic Semantic text similarity (and can include "fine-tuning" on a different task - such as semantic text similarity on a dataset that does not contain idiomatic information). Fine-tuning includes training on any STS dataset which includes potentially idiomatic MWEs. Note that sentence representations output by most pre-trained language models (such as BERT) cannot be used for STS without fine-tuning on an STS dataset.

The test set for each of the settings is the same, but the submission format clearly differentiates between pre-training and fine-tuning settings and participants can choose to participate in only one of these settings.

Please see the submission format file for details on what the format should be and submission naming convention on CodaLab for file names and structure expected in your submission.


Non-Contextual Models

While the terms "pre-training" and "fine-tuning" are generally used with contextual language models, you are free to use non-contextual models (such as GloVe or word2vec). Please make sure that you do not train the model on the training data provided when submitting results to the "pre-train" setting.


Subtask B Evaluation Metrics

The metric for this subtask is the Spearman Rank correlation between a models' output STS between sentences containing idiomatic expressions and the same sentences with the idiomatic expressions replaced by non-idiomatic paraphrases (which capture the correct meaning of the MWEs).

Subtask B: Evaluation Data

The evaluation data consists of sentence pairs as in any Semantic Text Similarity (STS) dataset, with the addition of potential idioms or multiword expressions (MWEs). The MWE associated with sentence_1 is in the column MWE1 (None if there is no MWE in sentence_1) and that associated with sentence_2 is in the MWE2 column (again, None if there isn't one). Since this task is designed to ensure that a model is consistent, the STS scores associated with pairs of sentences containing MWEs must be the same as the same sentences where the MWE is replaced by a paraphrase.

Consider the following snippet of Evaluation data:

You are required to submit Semantic Similarities Scores for the sentence pairs sentence_1 and sentence_2.

Notice how, if your model correctly interprets the meaning of "Swan Song", it should output the same STS score for the sentence pairs in Rows 1 and 2 and should output the STS score of 1 for the sentence pair in Row 3.

We evaluate models using Spearman's rank correlation coefficient between these two sets of scores (see details in section below).

The evaluation script (with the help of the gold labels) will automatically perform this evaluation. The evaluation script is available on the GitHub repository.

The evaluation data (dev and eval) is the same for both the settings. You must use it to create a submission file as detailed in submission format.

Note: The training, development and evaluation splits do not contain a MWE in sentence_2 and so MWE2 will always be None. MWE2 is included for consistency and for flexibility in training data (also see note at end of next section).


Subtask B: Training Data

If you are participating in the fine-tune setting of Subtask B, you also have access to the training data. You must NOT use this training data when training models for your pre-train setting.

Again, since this task is designed to ensure that a model is consistent, the STS scores for training your model on a pair of sentences containing MWEs must come from your model's predictions on the same sentences where the MWE is replaced by a paraphrase.

The training data is slightly different depending on the two cases described above. Consider the following example of case 1:

Before training your model, you must first make predictions on the STS between alternative_1 and alternative_2 and use those values as the similarities between corresponding sentence_1 and sentence_2. You might find the pre-processing scripts made available with the baseline useful for this purpose.

Now let's explore the sample data below (Case 2)

In such cases, where the similarity score is provided, you can directly use that similarity score to train your model.

We also include some standard STS evaluation data so as to make sure that models continue to be able to perform (reasonably) well on standard Semantic Text Similarity (we want to avoid "overfitting" on the MWE dataset). This data is included in the evaluation data and drawn from STSBenchmark dataset in English and the ASSIN and ASSIN 2 datasets in Portuguese. For this reason you must NOT use the development and test splits of any of these datasets in your training. (see also Data and Model Restrictions for more details).

NOTE: We strongly suggest you step through the Google Colab notebook we use to generate the baseline for this task. The fine-tune data generation section of the Colab notebook which describes preprocessing the training data might be especially useful.

NOTE: MWE2 is provided to allow you the flexibility of augmenting the training data to include pairs where both sentence_1 and sentence_2 contain potentially idiomatic MWEs. The training, development and evaluation splits do not contain a MWE in sentence_2 and so MWE2 will always be None. This will also be the case for the Test split which will be released in January.

Data and Model Restrictions:

Subtask A

Please note that you must NOT use the one-shot data in training models for the zero-shot setting. You can, however, use the zero-shot data when training models for the one-shot setting.

Please do NOT use training data other than that provided [train_one_shot.csv, train_zero_shot.csv] for this subtask. We have carefully designed the tasks to differentiate between the zero shot and one shot settings and we also want to be able to directly compare models which we will not be able to do if you use your own training data.

Subtask B

In the pre-train setting of Subtask B, you must NOT use the training data provided. Note that sentence representations output by most pre-trained language models (such as BERT) cannot be used for STS without fine-tuning on an STS dataset and so fine-tuning on the training split of standard STS datasets (including those listed below) is allowed.

Since we want to ensure that models continue to perform well on standard Semantic Text Similarity, we include STS data in our evaluation.

WHEN TRAINING YOUR MODELS, DO NOT USE DATA FROM THE DEVELOPMENT OR TEST SPLITS OF THE FOLLOWING STS DATASETS:

  1. STSBenchmark dataset in English

  2. ASSIN and ASSIN 2 dataset in Portuguese

Model Restrictions

You are required to use the same method/model for all samples in a particular setting (except language). You must NOT filter the evaluation data and use different methods for different parts of it except in the case of language and setting. You CAN use different methods for each of the two different settings and a different method for each language in each setting. You can NOT use a different method for the STS subsection and the idiomatic subsection.

Data Restrictions Summary:

  1. Subtask A, Zero Shot: Use "train_zero_shot.csv" only. Do NOT use your own training data or the one shot data.

  2. Subtask A, One shot: Use "train_zero_shot.csv" and "train_one_shot.csv" only. Do not use your own training data.

  3. Subtask B, pre-train: Use any training data that you generate which does NOT explicitly consider MWEs or idiomaticity for the assignment of STS scores (no idiom specific data). An example of data that you can NOT use is any data that is similar to the fine-tune setting's training data. For clarity: You are allowed to use sentences containing MWEs (e.g. for pre-training as in the dataset paper) as long as you do not include associated STS scores.

  4. Subtask B, fine-tune: Use train_data.csv and any training data you generate. For clarity: You are allowed to add your own sentences containing STS scores for this setting.


Using results of Subtask A in Subtask B

If you choose to make use of the models you've developed for Subtask A in methods you develop for Subtask B (such as, for example, first classifying a sentence and encoding those which a model classifies as idiomatic in a different way), you can only use this approach in the fine-tune setting and not in the pre-train setting as the use of a model trained on Subtask A data.

Please note that this is contrary to the experimental settings in the "AStitchInLanguageModels" dataset paper (see section Dataset below for details) where they do make use of models from (the equivalent of) Subtask A in the pre-train setting of (the equivalent of) Subtask B. In this SemEval task, this is explicitly not allowed.

Dataset

The data associated with this task is available on Github.

This data and tasks were adapted from the dataset "AStitchInLanguageModels" described in the (findings of) EMNLP 2021 paper: "AStitchInLanguageModels: Dataset and Methods for the Exploration of Idiomaticity in Pre-Trained Language Models" (Code).

You can find the pre-recording of the conference talk describing this work HERE.

Data Distribution over Languages

Each of English and Portuguese have an average of about 20 (min 10, max 30) naturally occurring sentences drawn from 223 (163 train and 60 dev) and 113 (73 train and 40 dev) MWEs respectively. As we aim to test models' ability to transfer learning across languages, Galician has no associated training or development sets.

The test data will be drawn from 50 MWEs associated with each language and one-shot training data associated with these MWEs will be released with the test data.

Trial Data

The trial data is available on Github. Please see the data description sections for Subtask A and Subtask B for details on the data format and evaluation metrics. This data is aimed at providing participants with a clearer understanding of what to expect.

Training Data

The training data is now available on Github. Please see the data description sections for Subtask A and Subtask B for details on the data format.

IMPORTANT: We provide three data splits: a "train" split to train on, a "dev" split with the associated gold labels so you can use the evaluation scripts on the GitHub page to evaluate and a "eval" split without the gold labels that you must use to generate your submissions on CodaLab. This eval split must not be confused with the "test" split that will be released in January.

Participate on CodaLab

The CodaLab competition website for this task is: https://competitions.codalab.org/competitions/34710

During the practice phase submit the results for the "eval" split to CodaLab. (You can use the dev split to optimise your models.)

We will release the test data to use for your evaluation submission to CodaLab in January.

Submission Format

Your submission must consist of a zip file of a single folder called "submission" containing a separate file for each subtask as follows:

.

└── submission

├── task2_subtaska.csv

└── task2_subtaskb.csv

Note that you may choose to include only one file if you are participating in only one subtask. If you are participating in only one setting of a subtask, you must still use the same template, but simply leave the entries associated with the setting you are not participating in blank.

The evaluation data (dev and eval) is the same for both the settings. The submission format defined above provides a way of including your results from both settings in the same submission file. The IDs in the dev or eval splits match that in the submission file and are repeated twice: once for each setting.

You might find it helpful to step through the Baseline Google Colab Notebook which provides a function to insert your results into the relevant submission format file.

Baselines and preprocessing scripts

We provide strong baselines for each subtask (and setting) based on the results presented in the paper: "AStitchInLanguageModels: Dataset and Methods for the Exploration of Idiomaticity in Pre-Trained Language Models".

These Colab notebooks also provide details on the submission format.

You are strongly encouraged to run through the notebooks associated with the subtask(s) you are participating in.

Subtask A (Baseline and Preprocessing)

Subtask B (Baseline and Preprocessing)

Organisers

Contact us at semeval-2022-task-2-mwe-organisers-group@sheffield.ac.uk

Harish Tayyar Madabushi

Deep Learning for NLP, Deep Contextual Meaning Representations, Question Answering, Integrating Cognitive and Psycholinguistic Information into Deep Learning

https://www.harishtayyarmadabushi.com/

Edward Gow-Smith

Cross-Domain Idiomatic Multiword Representations for Natural Language Processing


Marcos Garcia

Lexical semantics, automatic identification and classification of multiword expressions, and multilingual NLP.

http://www.grupolys.org/~marcos/index.html

Carolina Scarton

Online content verification (misinformation detection), Personalised NLP, Text simplification, Machine Translation, Quality estimation of machine translation, Document-level evaluation of NLP tasks outputs

https://carolscarton.github.io/

Marco Idiart

Textual Simplification of Complex Expressions, Cognitive Computational Models of Natural Languages, Analysis and Integration of MultiWord Expressions in Speech and Translation.

http://www.if.ufrgs.br/~idiart/

Aline Villavicencio

Lexical semantics, multilinguality, and cognitively motivated NLP. This work includes techniques for Multiword Expression treatment using statistical methods and distributional semantic models, and applications like Text Simplification and Question Answering.

https://sites.google.com/view/alinev

SIGLEX-MWE

This task is endorsed by SIGLEX-MWE, a section of the Special Interest Group on the Lexicon (SIGLEX) of the Association for Computational Linguistics (ACL). We'd like to thank SIGLEX-MWE, in particular: