Evaluation

We have multiple metrics to measure systems's behaviour arranged in three main categories: performance, efficiency and manual annotations. 

Performance metrics 

Performance metrics are intended to measure how well systems achieve the proposed task regarding prediction quality. Each submission must be evaluated with the following metrics:

These metrics are designed to evaluate the semantic similarity between two sentences (a reference and a candidate). To calculate the semantic similarity it is needed to calculate the different embeddings of the two texts, for which the XLMRoBERTa model will be used.

Efficiency metrics 

Efficiency metrics are intended to measure the impact of the system in terms of resources needed and environmental issues. We want to recognize those systems that can perform the task with minimal demand for resources. This will allow us to, for instance, identify those technologies that could run on a mobile device or a personal computer, along with those with the lowest carbon footprint. Each submission must contain the following information:

A notebook with a sample code to collect this information is here.

Manual Annotations metrics 

At the end of the evaluation campaign, we are going to make a random sampling. This sampling consists of a random selection of a few Hate-Speech-Counternarrative pairs from the submission of the participants (the sampling includes pairs with higher and low results in our performance metrics). To this sampling, we are going to apply manual annotation metrics with the intent to measure how well proposed automatic metrics align with manual annotation metrics. Each pair of the random sampling must be evaluated by human annotators with the same metrics as the CONAN-MT-SP Corpus: