DATA

To get the datasets please send the scanned version of the duly filled-in "organizational-access form" found here to the email address irmidis.fire2021@gmail.com. Please mention "IRMiDis FIRE 2021" somewhere in the form. Please mention a name of the team, and the name, affiliation and email id of each participant in the email.

Task 1

The data will contain around 11,000 microblogs (tweets) from Twitter that were posted during the Nepal earthquake in April 2015. Along with the dataset, sample of few claims or fact-checkable tweets and non fact-checkable tweets will also be provided to the participating teams. The dataset would be provided as text files, in the following format-- Tweetid <||> Tweettext

Example:

592568567247212544<||>RT @NewEarthquake: 4.7 earthquake, 25km S of Kodari, Nepal. Apr 26 13:21 at epicenter (21m ago, depth 10km). http://t.co/wWqiWAQ4zr


Task 2

We will be providing two data files -

  1. Train dataset: Cotfas et. al.[1] provided a dataset containing stances of tweets towards COVID-19 vaccines, crawled between November-December 2020. From this dataset, we are providing 2,792 crawled tweets texts along with the tweet-IDs and the classes. The original dataset can be found at this link: [Click here].

  2. Test dataset: We crawled tweets between March-December 2020 with various vaccine-related keywords. We got tweets annotated by three crowdworkers. For 1600 tweets, there was at least majority agreement, i.e., at least 2 out of the 3 annotators provided the same label. The test dataset is formed of these 1600 tweets; for each tweet, we are providing the tweet ID along with the tweet text.

Note: some of the tweets in the test dataset had unanimous agreement among the 3 annotators. While the other tweets had majority agreement, i.e., 2 out of the 3 annotators gave the same label while the third annotator gave a different label. The tweets having majority agreement (but not unanimous agreement) are likely to be more subjective and are hence likely to be more difficult to classify automatically.

Any standard CSV reader can be used to read the datasets. For example, in Python, the standard “csv” library or the “pandas” library can be used to read the files.


Use of other data for training classifiers:

  • Participating teams are free to use other attributes of the tweets (apart from the text) if they want. Specifically, participants are free to crawl the tweets with the tweet IDs using the Twitter API, and then use other features such as the user-profiles.

  • Participating teams are also allowed to use other datasets for training purposes, such as the dataset by Muller et.al.[2] that has labelled vaccine tweets before the COVID-19 era. The dataset can be found at this link: [Click here].

Note: If attributes other than the text or additional datasets are used for the classification, this should be clearly stated in the “working notes” submitted by the participating teams.


References:

  • [1] Cotfas, L. A., Delcea, C., Roxin, I., Ioanăş, C., Gherai, D. S., & Tajariol, F. (2021). The Longest Month: Analyzing COVID-19 Vaccination Opinions Dynamics From Tweets in the Month Following the First Vaccine Announcement. IEEE Access, 9, 33203-33223.

  • [2] Müller, M. M., & Salathé, M. (2019). Crowdbreaks: Tracking health trends using public social media data and crowdsourcing. Frontiers in public health, 7, 81.