Evaluation

Task 1

The initial gold-standard for the tasks is generated using human annotation. We also plan to employ a pooling mechanism, i.e., by manually checking the top-ranked tweets of all runs submitted to the track (as commonly done in TREC tracks).

Standard IR measures such as Precision, Recall, MAP, and F-score will be used to evaluate the runs. In Task1, higher credit will be given to runs that identify more number of claim / fact-checkable tweets.

Task 2

Output Format

Each run submitted by a team should have one CSV file containing two columns - the tweetID of a tweet and the predicted class by your classifier. The first few rows of a sample output file are given below:

id,pred

1325682517148569600,AntiVax

1325768441370800128,Neutral

1325770677580918785,Neutral

1325770986571096064,ProVax

...


Evaluation

The participating teams will be ranked based on their performance on the test dataset. The overall accuracy and the macro-F1 score on the 3 classes will be used as evaluation metrics.