Task 1 Data

Dataset

The provided task corpus comprises tweets in Spanish and English from diplomats representing four international actors: China, Russia, the United States, and the European Union. These authorities include government accounts, embassies, ambassadors, and other diplomatic profiles such as consuls and missions.

Task 1: propaganda identification and characterization

Task 1 of DIPROMATS 2024 encompasses two annotated datasets, one composed of tweets in English and another one of tweets in Spanish. The tweets, which were collected through the Twitter API for Academic Research, were published between January 1st, 2020 and March 11th, 2021, coinciding this last day with the first anniversary of the declaration of the COVID-19 pandemic.

The dataset in Spanish includes 9,591 tweets published by 135 authorities and distributed as follows:

China: 2,997 tweets from 25 authorities
Russia: 1,391 tweets from 22 authorities
European Union: 2,465 tweets from 48 authorities
United States: 2,738 tweets from 40 authorities

The English dataset contains 12,012 tweets from 619 authorities, with the following distribution:

China: 3,022 tweets from 106 authorities
Russia: 2,690 tweets from 114 authorities
European Union: 2,916 tweets from 186 authorities
United States: 3,114 tweets from 216 authorities

We split the data with a temporal criterion, choosing for each dataset the date that divides positive tweets in a 70/30 proportion, with the 70% subset being the oldest and the 30% subset the newest. The first will be the training set and the second the test set. Test data will be kept private, to prevent overfitting in post-campaign experiments.

Page updated

Google Sites

Report abuse