Tasks 1 & 5: Check-Worthiness

Don't forget to register through CLEF2020 Lab Registration before 26 April 2020, using this link. Otherwise, your submission will NOT be considered!

Task 1: Tweet Check-Worthiness

Definition

Task Definition: Given a topic and a stream of potentially-related tweets, rank the tweets according to their check-worthiness for the topic. This task will run in English and Arabic.

Tweet check-worthiness: A check-worthy tweet is a tweet that includes a claim that is of interest to a large audience (specially journalists), might have a harmful effect, etc.

Evaluation

This task is evaluated as a ranking task. The ranked list per topic will be evaluated using ranking evaluation measures (MAP, P@5,10…,P@30). Official measure is P@30 for the Arabic dataset and MAP for the English dataset.

Submission Runs

Each team can submit up to 2 manual and 4 automatic runs as follows:

  • For each run, you will have to explicitly indicate if it is “external” (i.e., uses external data) or not.
  • Pre-trained models (not labelled for fact checking, e.g., embeddings or word statistics) are not considered external.
  • At least one of the runs must be automatic without use of external data.

Submission Format

Submit one separate results file per run. Tweets per topic must be sorted by rank (from rank 1 till n). For each run, use the following format.

Arabic Dataset

The results file should include a ranking of top 500 check-worthy claims per topic. It must include one tab-separated line per tweet formatted as follows:

topicID  rank  tweetID  score  runID
CT20-AR-05  1  1219151214690041857  0.74  teamXrun1
CT20-AR-05  2  1217636592908689409  0.20  teamXrun1
CT20-AR-05  3  1218603003755798529  0.15  teamXrun1
...

Where a score is a number indicating the check-worthiness of the tweet, the rank is the rank of the tweet according to its score, and the runID is a unique ID for one of the runs of the team.

You can find the Arabic dataset and more details about it here.

English Dataset

The results file should include a ranking of top 1000 check-worthy claims per topic. It must include one tab-separated line per tweet formatted as follows:

topicID  tweetID  score  runID
covid-19 1235648554338791427 0.39 Model_1
covid-19 1235287380292235264 0.61 Model_1
covid-19 1236020820947931136 0.76 Model_1
...

Where a score is a number indicating the check-worthiness of the tweet, and the runID is a unique ID for one of the runs of the team.

You can find the English dataset and more details about it here.

Task 5: Debate Check-Worthiness

Definition

Task Definition:

Given a political debate or a transcribed speech, segmented into sentences, with speakers annotated, identify which sentence should be prioritized for fact-checking. This is a ranking task and systems are required to produce a score per sentence, according to which the ranking will be performed. This task will be run in English.

Here is an example:

CLINTON: I think my husband did a pretty good job in the 1990s.
CLINTON: I think a lot about what worked and how we can make it work again...
TRUMP: Well, he approved NAFTA...

Whereas Hillary Clinton discusses the job carried out by Bill Clinton in the past, Donald Trump fires back with a claim that is worth checking: that Bill Clinton approved NAFTA. Whether he did or not is definitively worth checking!

Let us look at another example:

CLINTON: Take clean energy
CLINTON: Some country is going to be the clean-energy superpower of the 21st century.
CLINTON: Donald thinks that climate change is a hoax perpetrated by the Chinese.
CLINTON: I think it's real.
TRUMP: I did not.

Checking if Donald Trump's thoughts about climate change are as claimed is definitively worth checking as well!

Evaluation

This task is evaluated as a ranking task. All lines of the debate needs to have a score assigned and will then be evaluated using ranking evaluation measures (MAP, P@5,10…,P@30). Official measure is MAP.

Submission Format

Submit one separate results file per run. Lines per debate must be sequential. For each run, use the following format.

line_number score
1 0.9056 
2 0.6862 
3 0.7665
...

Your result file MUST contain scores for all lines of the input file. Otherwise the scorer will return an error and no score will be computed.

Please check all the details here.