Data

[LATEST UPDATE: The labelled test set for evaluation is available]

The dataset for the shared task includes social media posts and news articles, containing both a textual and a visual component, concerning one or more real world events that are known to have been subject to the generation of fake news. We focus on the Ukrainian-Russian war started in February 2022.
The dataset is divided in two sub-datasets, one for each subtask.

Data Download

For each subtask we provide a TSV file containing the list of IDs and URLs for each data point and its label.
Below are the links to download forms for each subtask:

Subtask 1 - Multimodal Fake News Detection

Subtask 2 - Cross-modal relations in Real and Fake News

*Subsets including data points for which we have predictions from all the participants

The data format is shown in the example below.

Subtask 1 dataset.

Subtask 2 dataset.

Download script

Participants can download the actual data including texts and images using the provided download script:

download script

The script requires several additional libraries. A requirements.txt file is provided with the necessary libraries:

requirements.txt

[IMPORTANT: the script has been updated to work also with test dataset. Please download the latest version of the script and follow the instructions below]

The script is written in Python3, and can be run via CLI. It has three parameters. Usage is shown and exemplified below

usage: download_data.py [-h] -i INPUTFILE [-o OUTPUTDIR] [-l]

optional arguments:

-h, --help show this help message and exit
-i INPUTFILE, --inputFile INPUTFILE
path to input files with IDs and URLs
-o OUTPUTDIR, --outputDir OUTPUTDIR
path to output directory for data and media.
If not provided uses current working directory.
-l, --labelled whether the input file has labels or not (optional, False if not included)

For example, if you want to download the test data for Subtask 1 you could use something like this:

python3 download_data.py \
--inputFile /yourusername/multifakedetective/data/MULTI-Fake-Detective_Task1_TEST_IDs.tsv \
--outputDir /yourusername/multifakedetective/data/actual_data/

Note that in this example the --labelled parameter is False (i.e., not specified) because the test data during the evaluation window is provided without labels.

The script will generate the new files in the path specified by --outputDir parameter (or current working directory if the path is not provided).
The script wil generate:

a TSV file. The name of the file is that of the input file, with "_IDs" replaced by "_Data";
a Media/ directory including all media related to the data points in the TSV file.

Participants can use their own code to download the data if they prefer. Note however that for the evaluation we will consider the IDs we provided for each data point.

Below we describe the generated TSV files for each Subtask.

Subtask 1

The TSV file for Subtask 1 will contain tweets and news articles, from now on referred to as data points.
For the training set, currently all of the 1057 annotated data points are available to download.
For the official test set, currently all of 199 annotated data points are available to download.
For the additional test set, currently all of 221 annotated data points are available to download.
The size of the dataset may vary depending on when participants download the data.

The TSV includes the following:

ID: a unique identifier for the data point (either a tweet or a news article)
URL: the URL of the data point
Date: the creation date of the data point (note that newspaper articles may not be provided with this information)
Type: either article or tweet, depending on the data point
Text: the full text of the data point
Media: names of the image files associated with the data point
Label (excluded from the test sets during the evaluation window): a numerical label representing one of the four possible labels (see task description). Specifically:
- Certainly Fake: 0
- Probably Fake: 1
- Probably Real: 2
- Certainly Real: 3

The "Media" directory will include all the images specified in the Media column of the TSV file. Note that the naming convention for image names is as follows: <ID>_<number>.jpg, where ID is the unique identifier specified in the TSV file, and number is an increasing value used only in the case that the ID data point has more than one media associated with it. For example, if the data point with ID 123 has two or more associated images, the images will be 123_1.jpg, 123_2.jpg and so on. Conversely, if the data point has only one media associated with it, the number is not used and the file name is 123_.jpg. The file names are provided in the Media column of the TSV, as a string representing the list of comma separated file names, i.e. for example "123_1.jpg,123_2.jpg" for two images or "123_.jpg" for one image.

A sample of the resulting dataset for Subtask 1 is provided below. Images are not shown in order to avoid copyright infringements.

Output of the download script for Subtask 1 (TSV file).

Subtask 2

The TSV file for Subtask 2 will contain tweets and news articles, from now on referred to as data points.
For the training set, currently all of the 1350 annotated data points are available to download.
For the official test set, currently all of 227 annotated data points are available to download.
For the additional test set, currently all of 246 annotated data points are available to download.
The size of the dataset may vary depending on when participants download the data.

The TSV includes the following:

ID: a unique identifier for the data point (either a tweet or a news article)
URL: the URL of the data point
Date: the creation date of the data point (note that newspaper articles may not be provided with this information)
Type: either article or tweet, depending on the data point
Text: the full text of the data point
Media: names of the image files associated with the data point
Label (excluded from the test sets during the evaluation window): a numerical label representing one of the three possible labels (see task description). Specifically:
- Misleading: 0
- Unrelated: 1
- Not Misleading: 2

The "Media" directory will include all the images specified in the Media column of the TSV file. Note that the naming convention for image names is as follows: <ID>_<number>.jpg, where ID is the unique identifier specified in the TSV file, and number is an increasing value used only in the case that the ID data point has more than one media associated with it. For example, if the data point with ID 123 has two or more associated images, the images will be 123_1.jpg, 123_2.jpg and so on. Conversely, if the data point has only one media associated with it, the number is not used and the file name is 123_.jpg. The file names are provided in the Media column of the TSV, as a string representing the list of comma separated file names, i.e. for example "123_1.jpg,123_2.jpg" for two images or "123_.jpg" for one image.

A sample of the resulting dataset for Subtask 1 is provided below. Images are not shown in order to avoid copyright infringements.

Output of the download script for Subtask 2 (TSV file).

Note that while the two dataset are kept separate for the task, some of the data points (i.e., tweets, news articles, and their associated media) may be used for both subtasks. For these cases, the ID associated with the data point will be the same across the two subtasks.

Copyright

The downloaded dataset includes tweets and news articles. The provided download script performs a coarse-grained anonymization of the data, e.g. by not providing author information for the data.

Upon download, participants are asked to agree not to share the material they receive both during and after the competition. The data for the MULTI-Fake-Detective tasks is to be used ONLY for research purposes. By receiving the data participants implicitly agree to Twitter Terms of Service, Privacy Policy, Developer Agreement, and Developer Policy for academic researchers.

Content Warning

We do not share responsibility for the contents of the dataset. Downloaded texts and images may include copyrighted material and sensitive contents.

The downloaded data and the provided labels do not reflect in any way the social and political views of the task organizers.