The shared task includes a dataset contains 297,704 Twitter replies : the original tweets, and the response tweets (which some include an animated GIF). The response tweets include both a reply text and mp4 number, which is the original GIF code from MP4 file. Label of tweets is provided at the end of every sample, revealing whether the original tweet is a fake news or not. Positive samples are the tweets with the hashtag #fakenews. Negative samples are from EmotionGIF last year.
For more GIF information, see the example on the right: provided is the original tweet ("Tomorrow looks like...") with response tweet. The response tweets include both a reply text ("Hell yeah") and an animated GIF, which in this case belongs to the "applause" category.
The dataset is split into the following three files:
train.json: 168,522 samples with gold labels, to be used for training the model.
dev.json: 40,487 unlabeled samples used for practice.
eval.json: 88,665 unlabeled samples used for evaluation.
The files are all in JSON Lines format (each line in the file is a JSON value, representing one sample).
Each sample includes the text of the original tweet, and information about the response: the reply text, the category(ies) of the GIF response, and the label of the tweet.
Here is the description of the JSON fields in each sample:
idx: running index of the samples (the idx number started from 0 to 44000).
text: the text of the original tweet; may include mentions (@user), hashtags (#example), emojis etc. Emojis are present by Unicode, therefore participants can try to transform information from emojis.
context_idx: the index of the response tweets to the idx. There might be several different context_idx under same idx, which refer to different response numbers under a same original tweets.
reply: the text content of the response tweet. In cases where the reply only contained a GIF response, this field will be an empty string ("reply": "").
categories: The GIF category (or categories) of the GIF response which was included in the reply tweet, from a list of 44 categories. Not all of the replies contain a GIF response, thus categories may be empty.
mp4: the file name of the MP4-version of the animated GIF response; a ZIP file with all MP4s is available for download. The MP4 files are provided for completeness only. We do not expect that the participants will use the video files as part of their model features. Download the ZIP file.
The training data also contains the gold label:
label: The label of tweet. "fake" represents the original tweet is a fake news and "real" represents a true news. Therefore, the same idx (same text) will contain the same label.
Here is a sample line:
{"idx": 43999, "text": "BVTK received today, from the United States Patent and Trademark Office, a “Notice of Allowance” for the Ecrypt One patent application. The patent will issue in about 5-6 weeks.", "categories": ["popcorn"], "context_idx": 0, "reply": "it's about time the SEC finally goes after this POS company & CEO.", "mp4": "d3bfec97b06c468fbb795bf31fffd0eb.mp4", "label": "real"}
In the example above, the text of the original tweet is "BVTK received today, from the United States Patent and Trademark Office, a “Notice of Allowance” for the Ecrypt One patent application. The patent will issue in about 5-6 weeks.", and the response included the text "it's about time the SEC finally goes after this POS company & CEO." as well as an animated GIF belonging to the following categories: popcorn. The GIF is available at file the MP4 file fd3bfec97b06c468fbb795bf31fffd0eb.mp4. It is real news.
Evaluation datasets are also in JSON Lines, identical to the training data, but the label is missing. You will need to predict the label by adding the label field in your submission files. Read about the submission format.