Evaluation

We have multiple metrics to measure systems's behaviour arranged in two main categories: performance and efficiency.

Performance metrics

Efficiency metrics

Server

Download Trial Data and Train Data

Download Test Data (new)

GET Test Data (and trial data)

POST predictions of Test Data (and trial data)

Performance metrics

Performance metrics are intended to measure how well systems achieve the proposed task in terms of prediction quality. We differentiate three types of performances: classification performance (binary classification), regression performance (regression errors and correlation), and latency-based measures (to promote systems that detect earlier).

Binary classification: accuracy, micro, and macro precision, recall, f-score.
Regression metrics: root mean squared error (RMSE), Pearson correlation coefficient and P@k.
Latency-based: Early risk detection metric ERDE or its variants.

The latency-based evaluation is only applied to binary classification subtasks (task1a, task2a and task3a) and multiclass classification tasks (task2c). It shall be taken into account when a subject is evaluated as positive in the first instance. If in a later round the user is evaluated as negative, this evaluation will not be taken into account as it was previously sent as positive.

Efficiency metrics

Efficiency metrics are intended to measure the impact of the system in terms of resources needed and environmental issues. We want to recognize those systems that are able to perform the task with minimal demand for resources. This will allow us to, for instance, identify those technologies that could run on a mobile device or a personal computer, along with those with the lowest carbon footprint. To this end, each submission (each prediction sent to the server) must contain the following information:

Total RAM needed
Total % of CPU usage
Floating Point Operations per Second (FLOPS)
Total time to process (in milliseconds)
Kg in CO2 emissions. For this, the Code Carbon tool will be used.

A notebook with a sample code to collect this information is here.

Server

Each team will be provided with a token identifier when registering for the task to be used throughout this evaluation campaign. With this token, teams will be able to extract trial, training, and test datasets as well as send their predictions.

Teams can download trial and training data with the token, however, the test data will be provided by the server. So the team has to connect to our server using its token and the server will iteratively provide user writings by rounds. It means that each request will be followed by a server response with a new writing per user. The server will only provide the next round of writings per user for the current round, so the team should keep a record of all the users and their writings in every round. After each request, the team has to give back to the server its prediction about each user.

NOTE: The server will be opened according to the dates indicated to simulate the evaluation phase with the trial data even if it can be downloaded.

Download Trial Data and Train Data

Training and test datasets will be available according to the dates indicated. To extract them, users must access the server using the address given with the token identifier provided to extract the necessary data according to the tasks in which they are signed up.

Trial data:
- - - For Task1 (1a, 1b)
      - Trial data for Task1 (1a, 1b): URL/task1/download_trial/{token}
      - Golden truth for each subtask: URL/{subtask}/download_trial/{token}
        Replace {subtask} with the corresponding subtask: task1a, task1b.
        Replace {token} with your token identifier provided.
    - For Task2 (2a, 2b, 2c, 2d)
      - Trial data for Task2 (2a, 2b, 2c, 2d): URL/task2/download_trial/{token}
      - Golden truth for each subtask: URL/{subtask}/download_trial/{token}
        Replace {subtask} with the corresponding subtask: task2a, task2b, task2c, task2d.
        Replace {token} with your token identifier provided.
Train data:
- - - For Task1 (1a, 1b)
      - Train data for Task1 (1a, 1b): URL/task1/download_train/{token}
      - Golden truth for each subtask: URL/{subtask}/download_train/{token}
        Replace {subtask} with the corresponding subtask: task1a, task1b.
    - For Task2 (2a, 2b, 2c, 2d)
      - Train data for Task2 (2a, 2b, 2c, 2d): URL/task2/download_train/{token}
      - Golden truth for each subtask: URL/{subtask}/download_train/{token}
        Replace {subtask} with the corresponding subtask: task2a, task2b, task2c, task2d.

Remember trial and train data are different, so it is recommended to merge both sets to have a larger amount of train data.

Download Test Data (new)

After evaluation, test dataset and test gold labels is available in the same way that the trial and train data. To extract them, users must access the server using the address given with the token identifier provided to extract the data according to the tasks in which they are signed up.

Test data:
- - - For Task1 (1a, 1b)
      - Test data for Task1 (1a, 1b): URL/task1/download_test/{token}
      - Golden truth for each subtask: URL/{subtask}/download_test/{token}
        Replace {subtask} with the corresponding subtask: task1a, task1b.
        Replace {token} with your token identifier provided.
    - For Task2 (2a, 2b, 2c, 2d)
      - Test data for Task2 (2a, 2b, 2c, 2d): URL/task2/download_test/{token}
      - Golden truth for each subtask: URL/{subtask}/download_test/{token}
        Replace {subtask} with the corresponding subtask: task2a, task2b, task2c, task2d.
        Replace {token} with your token identifier provided.

GET Test Data (and trial data)

For the first GET request, the server outputs the first message of each user. To send a GET request:

Trial server: URL/{task}/getmessages_trial/{token}
Test server: URL/{task}/getmessages/{token}

The output format is the following:

[

{

"id_message": 123,

"round": 1,

"nick": "subject1",

"message": "...",

"date": "..."

},

{

"id_message": 134,

"round": 1,

"nick": "subject10",

"message": "...",

"date": "..."

},

...

]

Attributes:

id_message: internal identifier of the writing
round: the number of the round (from 1 to unknown)
nick: alias's subject
message: message's subject
date: the format of the date of the writing is 'YYYY-MM-DD HH:MM:SS'

The first round contains all users in the collection (because all users have at least one message). However, after a few rounds, some users will disappear from the server's response. For example, a user with 10 messages will only appear in the first 10 rounds. Furthermore, the server does not inform the teams that a given user writing is the last one in the user's thread. The last round will be detected by the teams when they receive an empty list from the server.

After each request, the team has to run its own prediction pipeline and give back to the server its prediction about each individual. The server will provide always the next round of writings, regardless of having received the responses (all users and all runs) for the current round.

POST predictions of Test Data (and trial data)

Each team has a limited number of three runs for each subtask they participate in. To submit the predictions, each team needs to send a POST request. A team can skip a run if they choose not to use all three runs by not sending a POST request and instead just making a GET request to proceed to the next round. To send a POST request:

Trial server: URL/{subtask}/submit_trial/{token}/{run}
Test server: URL/{subtask}/submit/{token}/{run}

For each subtask, the prediction file to be sent has a different format.

For the subtasks associated with binary classification (task1a, task2a, and task3a), predictions will be 0 for “control” (negative) or 1 for “suffer” (positive). The structure would be as follows:

[

{

"predictions":

{

"subject1": 1,

"subject10": 0,

...

},

"emissions":

{

"duration": 0.01,

"emissions": 3.67552e-08,

"cpu_energy": 8.120029e-08,

"gpu_energy": 0,

"ram_energy": 5.1587e-12,

"energy_consumed": 8.1205e-08,

"cpu_count": 1,

"gpu_count": 1,

"cpu_model": "Intel(R) Xeon(R) CPU @ 2.20GHz",

"gpu_model": "1 x Tesla T4",

"ram_total_size": 12.681198120117188

}

]

For the subtasks associated with simple regression (task1b, task2b and task3b), predictions provide a probability for a user to suffer. A value of 0 means 100% negative and a value of 1 would be 100% positive. The structure would be as follows:

[

{

"predictions":

{

"subject1": 0.2,

"subject10": 0.9,

...

},

"emissions":

{

"duration": 0.01,

"emissions": 3.67552e-08,

"cpu_energy": 8.120029e-08,

"gpu_energy": 0,

"ram_energy": 5.1587e-12,

"energy_consumed": 8.1205e-08,

"cpu_count": 1,

"gpu_count": 1,

"cpu_model": "Intel(R) Xeon(R) CPU @ 2.20GHz",

"gpu_model": "1 x Tesla T4",

"ram_total_size": 12.681198120117188

}

]

For the subtask associated with multiclass classification (task2c), the predictions must be equal to one of the marked labels (“suffer+against”, “suffer+in favour”, “suffer+other”, “control”). The structure would be as follows:

[

{

"predictions":

{

"subject1": "control",

"subject10": "suffer+against",

...

},

"emissions":

{

"duration": 0.01,

"emissions": 3.67552e-08,

"cpu_energy": 8.120029e-08,

"gpu_energy": 0,

"ram_energy": 5.1587e-12,

"energy_consumed": 8.1205e-08,

"cpu_count": 1,

"gpu_count": 1,

"cpu_model": "Intel(R) Xeon(R) CPU @ 2.20GHz",

"gpu_model": "1 x Tesla T4",

"ram_total_size": 12.681198120117188

}

]

For the subtasks associated with multi-output regression (task2d), the predictions provide the probability that a user belongs to one of the marked labels (“suffer+against”, “suffer+in favour”, “suffer+other”, “control”). In total they should add up to 1. The structure would be as follows:

[

{

"predictions": {

"subject1": {

"suffer+against": 0.5,

"suffer+in favour": 0.1,

"suffer+other": 0.2,

"control": 0.2

},

"subject10": {

"suffer+against": 0.1,

"suffer+in favour": 0.2,

"suffer+other": 0.1,

"control": 0.6

},

...

},

"emissions": {

"duration": 0.01,

"emissions": 3.67552e-08,

"cpu_energy": 8.120029e-08,

"gpu_energy": 0,

"ram_energy": 5.1587e-12,

"energy_consumed": 8.1205e-08,

"cpu_count": 1,

"gpu_count": 1,

"cpu_model": "Intel(R) Xeon(R) CPU @ 2.20GHz",

"gpu_model": "1 x Tesla T4",

"ram_total_size": 12.681198120117188

}

]

To facilitate participation, we have prepared an example of a client application that communicates with the server. The notebook is here.

IMPORTANT NOTE 1:

We have provided a repository where you can find the evaluation script and the rest of the scripts related to the competition.

IMPORTANT NOTE 2:

For all tasks, the procedure is the same, however, as the same dataset is used for all subtasks of each task, each time a GET request is made, the round will advance for the task, whatever subtasks it is part of, as they all start from the same thread.

Example: If a team participates in subtasks 1a and 1b, they are allowed to make only one GET request for each round, before submitting their predictions for all the subtasks they are participating in. This team, with 3 runs, will have to make a total of 1 GET request and 3 POST requests for each subtask.

GET request:

URL/task1/getmessages/{token}

POST request:

URL/task1a/submit/{token}/{0}

URL/task1a/submit/{token}/{1}

URL/task1a/submit/{token}/{2}

URL/task1b/submit/{token}/{0}

URL/task1b/submit/{token}/{1}

URL/task1b/submit/{token}/{2}

If a team does not submit their predictions using the POST request before making a new GET request, their round will automatically advance and any previous predictions will be invalid. The team must submit predictions for each round using the POST request to make sure their runs are considered valid.