Task 2. Automatic Detection of Narratives from Diplomats of Major Powers
Evaluation
Teams will submit their responses to the organisation that will score them according to the gold-standard, and publish them in a leaderboard.
The gold-standard won't be publicly available to avoid LLMs contamination and preserve future evaluations over the dataset.
For every tweet in the test set, systems must return a binary answer <yes, no> for each of the 24 narratives, indicating whether the tweet promotes the narrative or not.
However, the gold-standard (and the few-shot development set) will be annotated with three values:
Yes, when the reading of the tweet is clearly in favour of the narrative. It is one of its main communicative intentions.
Leaning, despite the narrative is not a primary communicative intention, there may be some reading of the tweet supporting the narrative. In other words, the narrative could be a secondary communicative intention.
No, when the tweet is completely unrelated to the narrative, or doesn't support it in any reading.
Therefore, since the system responses are binary, there will be two main evaluation measures depending on how the Leaning cases are considered.
Evaluation measures
Strict F1, measuring systems performance for the identification of tweets with narratives as primary communicative intention. Strict F1 will be calculated from the precision and recall over class Yes for each narrative. In other words, the Leaning cases in the gold-standard will be considered as class No.
Lenient F1, measuring systems performance for the identification of tweets with narratives as primary or secondary communicative intention. Lenient F1 will be calculated from the precision and recall over classes Yes and Leaning for each narrative. Cases labelled as Leaning in the gold-standard will be considered as YES or NO in the way that better fits with the model under evaluation. In this way, it can be considered as an upper bound of systems performance.
Baselines
The organization will provide the results of a zero-shot 8x7B Mixtral with 4 bits quantization.
Input format
Input data will follow a json format with a list of narratives and a list of tweets as in the following example:
{
"narratives":[
{
"n_id":"CH1",
"country":"China",
"title":"The West is immoral, hostile and decadent.",
"description":"Tweets depict the West, primarily the US, as immoral and hostile, positioning. China may appear as a victim of Western reckless behavior."
},
... ],
"dataset":[
{
"id":0,
"country":"China",
"tweet_id":"1303763377999749121",
"text":"Difamación y calumnias, intervención en los asuntos ajenos, sanciones unilaterales, provocación militar, la ineficiencia contra covid19, doble rasero d derechos humanos y aislacionismo constituyen los ¨Siete pecados capitales estadounidenses contra China¨",
"username":"embchinacuba",
"UTC":"2020-09-09 18:33:06+00:00",
"Tweet Type":"Tweet",
"rt&fav":11
},
... ]
}
Output format
The output data will be in JSON format and will consist of a list of tweets, each indicating the language (en for English and es for Spanish), along with the associated narratives it promotes.
{
"responses": [
{
"language": "es",
"id": 345,
"country":"China",
"tweet_id":"1303763397491",
"narratives": ["CH1", "CH3"],
},
{
"language": "en",
"id": 456,
"country":"China",
"tweet_id":"3763377749121",
"narratives": [],
},
...
]
}
Submission procedure
(TBA)