Data


Datasets for the competition

We have developed a new dataset for the Claim Span Identification (CSI) task called  HECSI (Hindi-English Claim-Span Identification) containing about 16K tweets in English and Hindi. The English part contains about 8K anti-vaccine posts (tweets) about COVID-19 vaccines. Detecting such anti-vaccine claims is important for understanding people's concerns about vaccines, in order to improve adaptation of vaccines. The Hindi part contains about 8K harmful social media posts (fake-news, hate-speech, etc.) in Hindi. Detecting fake-claims or hate-claims is important in order to counter such claims. The dataset has been annotated by human workers by marking the minimal span(s) which represented the claim(s).

The dataset contains several posts with multiple claim spans and also posts without any claim spans, which makes the task more challenging.

Note that, due to the nature of the posts, the datasets may contain profanity and abusive language directed towards persons and organizations. 

Train and Validation sets

The following two train sets and two validation sets will be provided after a team registers for the competition: 

Each data split will be represented by json objects as a list of individual dictionaries (for each samples). The dataset files may be opened in any text editor to visualise the data structure.

Test sets

The test sets will be released later (see the "Important Dates" page for the schedule).Teams will need to submit the predictions on the test set(s). The following three test sets will be provided.

Data access and usage policy

Data will be shared over email, post registration. Upon registering for this competition, every participant is understood to have agreed to use the data only for non-profit academic and research purposes.

Resources for beginners