Performance metrics are intended to measure how well systems achieve the proposed task in terms of prediction quality. We differentiate two types of performances: classification performance and latency-based measures.
Task 1: ERDE, MAE, RMSE, F1-macro,...
Task 2: Cohen’s kappa to measure agreement between expert and participant’s system in this binary setting. Accuracy metric.
Efficiency metrics are intended to measure the impact of the system in terms of resources needed and environmental issues. We want to recognize those systems that are able to perform the task with minimal demand for resources. This will allow us to, for instance, identify those technologies that could run on a mobile device or a personal computer, along with those with the lowest carbon footprint. To this end, each submission (each prediction sent to the server) must contain the following information:
Total RAM needed
Total % of CPU usage
Floating Point Operations per Second (FLOPS)
Total time to process (in milliseconds)
Kg in CO2 emissions. For this, the Code Carbon tool will be used.
A notebook with a sample code to collect this information is here.
Each team will be provided with a token identifier when registering for the task to be used throughout this evaluation campaign. With this token, teams will be able to extract trial and test datasets as well as send their predictions.
Teams can download trial data with the token, however, the test data will be provided by the server. So the team has to connect to our server using its token and the server will iteratively provide user writings by rounds. It means that each request will be followed by a server response with a new writing per user. The server will only provide the next round of writings per user for the current round, so the team should keep a record of all the users and their writings in every round. After each request, the team has to give back to the server its prediction about each user.
NOTE: The server will be opened according to the dates indicated to simulate the evaluation phase with the trial data even if it can be downloaded.
The datasets will be available according to the dates indicated. To extract them, users must access the server using the address given with the token identifier provided to extract the necessary data according to the tasks in which they are signed up.
Trial data:
For Task1: URL/task1/download_trial/{token}
For Task2: URL/task2/download_trial/{token}
For the first GET request, the server outputs the first message of each user. To send a GET request:
Trial server: URL/{task}/getmessages_trial/{token}
Test server: URL/{task}/getmessages/{token}
Replace {task} with one of this strings: "task1" or "task2"
The output format is the following:
{
"sesion_10": {
"round": 1,
"patient_input": "Hola. Pues llevo desde hace mucho sintiendome con mucha ansiedad por basicamente casi todo"
},
"sesion_14":{
"round": 1,
"patient_input": "Estoy un poco agotada. Pero cuando paso la situación me dejan aliviada porque siento que al hacerlo he evitado que pasara algo malo.",
},
...
]
Attributes:
id: internal identifier of the session (each session belongs to a different user)
round: the number of the round (from 1 to unknown)
patient_input: message's patient
The first round contains all users in the collection (because all users have at least one message). However, after a few rounds, some users will disappear from the server's response. For example, a user with 10 messages will only appear in the first 10 rounds. Furthermore, the server does not inform the teams that a given user writing is the last one in the user's thread. The last round will be detected by the teams when they receive an empty list from the server.
After each request, the team has to run its own prediction pipeline and give back to the server its prediction about each individual.
{
"session_10": {
"round": 1,
"patient_input": "Hola. Pues llevo desde hace mucho sintiendome con mucha ansiedad por basicamente casi todo",
"option_1": "Eso debe ser muy abrumador. ¿Puedes decirme un poco más sobre lo que te causa ansiedad?",
"option_2": "Entiendo. ¿Podrías describir un poco más cuándo aparece esa ansiedad?",
"option_3": "Entiendo que la ansiedad te ha estado acompañando durante mucho tiempo."
},
"session_14": {
"round": 1,
"patient_input": "Estoy un poco agotada. Pero cuando paso la situación me dejan aliviada porque siento que al hacerlo he evitado que pasara algo malo.",
"option_1": "Entiendo. Parece que esas estrategias te dan un cierto sentido de control y te alivian en el momento, pero también te exigen mucha energía. ",
"option_2": "Entiendo. Ese sentimiento de alivio después de haber pasado la situación es un mecanismo de supervivencia muy común.",
"option_3": "Comprendo. Esa sensación de alivio es muy reconfortante, pero después, por lo que me has contado, continúas sintiéndote mal, incluso se intensifican. "
}
}
...
]
Attributes:
id: internal identifier of the session (each session belongs to a different user)
round: the number of the round (from 1 to unknown)
patient_input: message's patient
option_{1,2,3}: 3 possible response options for the therapist in that context
The first round contains all users in the collection (because all users have at least one message). However, after a few rounds, some users will disappear from the server's response. For example, a user with 10 messages will only appear in the first 10 rounds. Furthermore, the server does not inform the teams that a given user writing is the last one in the user's thread. The last round will be detected by the teams when they receive an empty list from the server.
After each request, the team has to run its own prediction pipeline and give back to the server its prediction about each individual.
Each team has a limited number of three runs for each subtask they participate in. To submit the predictions, each team needs to send a POST request. A team can skip a run if they choose not to use all three runs by not sending a POST request and instead just making a GET request to proceed to the next round. To send a POST request:
Trial server: URL/{task}/submit_trial/{token}/{run}
Test server: URL/{task}/submit/{token}/{run}
For each subtask, the prediction file to be sent has a different format.
For Task 1, systems are required to predict questionnaire responses at the item level for each patient turn. For every patient turn, the system must return three sets of predictions, corresponding to the following questionnaires: GAD-7 → 7 values, PHQ-9 → 9 values CompACT-10 → 10 values. Each predicted value must correspond to a valid response option in the respective questionnaire:
GAD-7 and PHQ-9: Integer values in the range [0, 3]
CompACT-10: Integer values in the range [0, 6]
The number of values must exactly match the number of items
[
{
"predictions":[
{
"id": "session_10",
"round": 1,
"prediction": {
"GAD-7": [3, 0, 1, 2, 2, 3, 1],
"PHQ-9": [0, 2, 3, 1, 2, 3, 0, 2, 1],
"CompACT-10": [1, 4, 6, 2, 5, 3, 0, 1, 2, 4]
}
},
{
"id": "session_14",
"round": 1,
"prediction": {
"GAD-7": [3, 0, 1, 2, 2, 3, 1],
"PHQ-9": [0, 2, 3, 1, 2, 3, 0, 2, 1],
"CompACT-10": [1, 4, 6, 2, 5, 3, 0, 1, 2, 4]
}
}
...
],
"emissions":
{
"duration": 0.01,
"emissions": 3.67552e-08,
"cpu_energy": 8.120029e-08,
"gpu_energy": 0,
"ram_energy": 5.1587e-12,
"energy_consumed": 8.1205e-08,
"cpu_count": 1,
"gpu_count": 1,
"cpu_model": "Intel(R) Xeon(R) CPU @ 2.20GHz",
"gpu_model": "1 x Tesla T4",
"ram_total_size": 12.681198120117188,
"country_iso_code": "USA"
}
}
]
For the task2, predictions will be: “option_1”, “option_2” or “option_3”. The structure would be as follows:
[
{
"predictions": [
{
"id": "session_10",
"round": 1,
"prediction": "option_1"
},
{
"id": "session_14",
"round": 1,
"prediction": "option_3"
},
...
],
"emissions":
{
"duration": 0.01,
"emissions": 3.67552e-08,
"cpu_energy": 8.120029e-08,
"gpu_energy": 0,
"ram_energy": 5.1587e-12,
"energy_consumed": 8.1205e-08,
"cpu_count": 1,
"gpu_count": 1,
"cpu_model": "Intel(R) Xeon(R) CPU @ 2.20GHz",
"gpu_model": "1 x Tesla T4",
"ram_total_size": 12.681198120117188,
"country_iso_code": "USA"
}
}
]
To facilitate participation, we have prepared an example of a client application that communicates with the server. The notebook is here.
NOTE 1:
We have provided a repository where you can find the evaluation script for 2023 and the rest of the scripts related to the competition.
NOTE 2:
Each time a team makes a GET request to the server, it will be provided with the data related to the round it is in. For the data to be updated to the next round, teams have to make all POSTs request sending their predictions (or empty predictions if they do not want to make use of the 3 runs).
However, if a team is in task 1 and 2, as the same dataset is used, the team must send six POSTs requests to update the round. Example:
GET request: URL/task1/getmessages/{token} # team get round 1 data for task 1
POST request:
URL/task1/submit/{token}/{0}
URL/task1/submit/{token}/{1}
URL/task1/submit/{token}/{2}
GET request: URL/task2/getmessages/{token} # team get round 1 data for task 2
POST request:
URL/task2/submit/{token}/{0}
URL/task2/submit/{token}/{1}
URL/task2/submit/{token}/{2}
GET request: URL/task1/getmessages/{token} # team get round 2 data
The team must submit predictions for each round using the POST request to make sure their runs are considered valid.