We will release the evaluation script in this repo. Feel free to use it.
Both subtasks will be evaluated and ranked using macro F1-score.
Measures such as accuracy, precision, recall, and other fine-grained metrics computed in terms of false/true negatives/positives will also be considered, but ONLY for analysis purposes on the task overview.
Statistical significance testing will be computed in both subtasks.
There will be one leaderboard per subtask, so, you can participate in any of them.
Baselines 📈
A set of functional, uni-modal, and multi-modal models based on pretrained text and image encoders will be provided:
Functional baselines:
o Majority: the most likely label across the test dataset
o Random: uniform random prediction
Uni-modal models: will treat each modality independently and the final decision will be the combination of both modalities, e.g., f(image)=generated and g(text)=human, then the prediction is image-generated. We plan to use ViT (Dosovitskiy et al., 2020) for image classification and XLM-RoBERTa (Conneau et al., 2019), Multilingual E5 (Wang et al., 2024) for text classification.
Multi-modal models: late fusion of pretrained vision encoders such as those from CLIP (Radford et al., 2021) and text encoders such as XLM-RoBERTa and Multilingual E5.
All these baselines will use default hyperparameters so that participants can explore different hyperparameter configurations of these models or explore new approaches and models. Both classical and modern machine & deep learning approaches are expected and welcome.
References 📜
Conneau, A., Khandelwal, K., Goyal, N., Chaudhary, V., Wenzek, G., Guzmán, F., ... & Stoyanov, V. (2020, July). Unsupervised Cross-lingual Representation Learning at Scale. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics (pp. 8440-8451).
Dosovitskiy, Alexey (2020). An image is worth 16x16 words: Transformers for image recognition at scale. arXiv preprint arXiv:2010.11929.
Liang Wang, Nan Yang, Xiaolong Huang, Linjun Yang, Rangan Majumder, & Furu Wei. (2024). Multilingual E5 Text Embeddings: A Technical Report.
Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, Gretchen Krueger, & Ilya Sutskever. (2021). Learning Transferable Visual Models From Natural Language Supervision.