Evaluation

We have multiple metrics to measure systems's behaviour arranged in three main categories: performance, efficiency and manual annotations.

Performance metrics

Efficiency metrics

Manual Annotations metrics

Performance metrics

Performance metrics are intended to measure how well systems achieve the proposed task regarding prediction quality. Each submission must be evaluated with the following metrics:

Sentence-MoverScore (Zhao et al. 2019)
BERTScore: the BERTScore consists in BERT-precision, BERT-recall, BERT-f1Score.

These metrics are designed to evaluate the semantic similarity between two sentences (a reference and a candidate). To calculate the semantic similarity it is needed to calculate the different embeddings of the two texts, for which the XLMRoBERTa model will be used.

Efficiency metrics

Efficiency metrics are intended to measure the impact of the system in terms of resources needed and environmental issues. We want to recognize those systems that can perform the task with minimal demand for resources. This will allow us to, for instance, identify those technologies that could run on a mobile device or a personal computer, along with those with the lowest carbon footprint. Each submission must contain the following information:

Total RAM needed
Total % of CPU usage
Floating Point Operations per Second (FLOPS)
Total time to process (in milliseconds)
Kg in CO2 emissions. For this, the Code Carbon tool will be used.

A notebook with a sample code to collect this information is here.

Manual Annotations metrics

At the end of the evaluation campaign, we are going to make a random sampling. This sampling consists of a random selection of a few Hate-Speech-Counternarrative pairs from the submission of the participants (the sampling includes pairs with higher and low results in our performance metrics). To this sampling, we are going to apply manual annotation metrics with the intent to measure how well proposed automatic metrics align with manual annotation metrics. Each pair of the random sampling must be evaluated by human annotators with the same metrics as the CONAN-MT-SP Corpus:

Offensiveness:
- 0 (not sure)
- 1 (not offensive)
- 2 (maybe offensive)
- 3 (completely offensive)
Stance:
- 0 (irrelevant)
- 1 (strongly agree)
- 2 (slightly agree/disagree)
- 3 (strongly disagree)
Informativeness:
- 0 (irrelevant)
- 1 (not informative)
- 2 (generic and uninformative statement)
- 3 (specific and informative)
Truthfulness:
- 0 (not sure)
- 1 (not true)
- 2 (partially true)
- 3 (completely true)
Editing required:
- 0 (no editing)
- 1 (yes editing)