The task is structured as a multi-class classification problem in a multimodal setting. The problem is defined as follows: given a piece of content c = ⟨ t, v ⟩ which includes a textual component t and a visual component v (i.e., an image), classify it as being one of the following labels on a scale: "Certainly Fake", "Probably Fake", "Probably Real", "Certainly Real".
The labels refer to the informational content as a whole, and not to its single components: a fake piece of news including a real image (e.g., in a misleading context) is still probably (or certainly) fake, and vice-versa.
The labels can be interpreted as follows:
Certainly Fake: news that is most certain to be fake, whatever the context.
Probably Fake: news that is still likely to be fake, but may include some real information or at the very least be somewhat credible.
Probably Real: news that is very credible but still retains some degree of uncertainty in regards to the information provided.
Certainly Real: news that is most certain to be real and incontestable, whatever the context.
Either of the two components, or both, can be leveraged to make the final prediction. Participants are encouraged to develop multimodal models for the task.
Training and test dataset, and examples, are available in Data.
The task is aimed at assessing how the two modalities (i.e., textual and visual) relate to each other in the context of fake and real news. The goal is to understand how images and texts in fake and real news can lead to misleading interpretations of the content pertaining to the other modality and to the whole news. For example, a picture of the Pope smiling with his hands up, and the text saying "Even the Pope is cheering for Juventus' defeat in the Champions League"; vice versa, a news reporting on climate change protests accompanied by an image of rubbish in the same place, alluding to the fact that its the protesters' fault.
The task is formulated as a three-class classification problem, and is defined as follows: given a piece of content c = ⟨ t, v ⟩ which includes a textual component t and a visual component v, decide whether their combination is misleading or not misleading in the interpretation of the information provided by either component, or the two are unrelated.
The three classes are to be interpreted as follows:
Misleading: The image or the text are misleading of the information in the other modality or overall.
Not Misleading: The image and the text are related to each other, support the overall information provided, and are not used with a misleading intent.
Unrelated: The image and the text are unrelated to each other in a meaningful way. This absence of relationship do not result in varied interpretations of the information in any way.
Training and test dataset, and examples, are available in Data.
Evaluation of participating systems will be conducted in terms of model performances on a test set.
For each task, we release two different test sets:
The Official Test Set includes data from a later window than that of the training set. This will challenge the systems to classify fake news and misleading content in a more real-world scenario. Providing predictions on the Official Test Set for at least one of the two subtasks is mandatory to participate.
The Additional Test Set includes data from the same time window as the training set. This will give us a clearer picture of how participating systems are resilient to changes in the context over time. Providing predictions on the Additional Test Set is optional.
Systems will be evaluated in terms of accuracy, and micro-, macro- and weighted average precision, recall, and F1-Score.
Participants are asked to indicate the model to be considered for the final ranking. They will mark their prediction file as Primary.
See the Guidelines and the Submission page for more details.
The metric that will be used to rank participating systems is weighted average F1-score obtained on the Official Test Set by runs marked as Primary.
The same evaluation procedure and criteria will be applied to both the subtasks.
The evaluation script is available to download:
The script is written in Python3, and can be run via CLI. It has three parameters. Usage is shown and exemplified below
usage: evaluation.py [-h] [-g GOLDFILE] [-p PREDICTIONSFILE] [-t TASK]
optional arguments:
-h, --help show this help message and exit
-g GOLDFILE, --goldfile GOLDFILE
path to gold labelled file
-p PREDICTIONSFILE, --predictionsfile PREDICTIONSFILE
path to predictions file
-t TASK, --task TASK task number (either 1 or 2)
Note that the predictions file is expected to be a TSV file with no header.
For example, if you want to evaluate your performances on the official test set for task 1 you would so something like:
python3 evaluation.py \
-g /multifakedetective/DATA/TEST/OFFICIAL/Task1/MULTI-Fake-Detective_Task1_TEST_IDs_LABELLED_FILTERED.tsv \
-p /yourusername/multifakedetective/predictions/SUBTASK1-myteam-myrun.tsv \
-t 1
Since we are distributing only IDs, not all participants may have access to the exact same test datasets (e.g., due to the removal of articles/tweets or download issues).
To ensure fair competition, we plan to rank the systems on the subsets of the test sets for which all the participants provided a label.
We invite all participants to download the test data as soon as possible, in order to use as many data points as possible for the evaluation.
Participants can contact the task organizers (see Contacts) if they have serious trouble downloading the data.
Participating systems will also be compared against a baseline for each of the subtasks.
We provide two Jupyter notebooks detailing the training and evaluation of the baseline models.
Each notebook describes the baseline for one of the tasks.
Notebooks:
Baseline Models for MULTI-Fake-DetectiVE Task 1: Multimodal Fake news Detection
Baseline Models for MULTI-Fake-DetectiVE Task 2: Cross-modal relations in Fake and Real News
All baseline models are based on the SVM Classifier from SciKit-Learn.
We include three different baselines for each task:
text-only model - The model is trained only on textual features, extracted with a BERT model
image-only model - The model is trained only on image features extracted with a ResNet-18 model
multi-modal model - The model is trained on a concatenation of text and image features described above
The baseline model uses only the training data for both training and evaluation. Stratification is used to split the training dataset into training and validation, with an 80%-20% split. Baseline models evaluated on the test set will be released at the end of the evaluation window.
Participants should be able to independently run the notebooks provided that dependencies are available. Specifically, additional required libaries are:
Pandas
Numpy
SciKit-Learn
PyTorch
TorchVision
Transformers
Matplotlib
Note that to execute the notebooks on local data, participants have to manually change the path to the data (i.e., training file and media directory) in the notebooks. Instructions on where to do so are provided within the notebooks.
The data input format is the same obtained by using the script for downloading the data.