MiMIC 🎭 

 Welcome 🤗 

MiMIC: Multi-Modal AI Content Moderation is organized under the umbrella of Iberian Language Evaluation Forum (IberLEF) 2025, which will be held in Zaragoza, Spain on September, 2025.

Introduction 📜

Generative AI has become a powerful and widely accessible tool, offering countless opportunities for innovation. At the same time, it lowers the barrier for creating and spreading fake content on a large scale. This includes disinformation, propaganda, harmful material, fake reviews, deepfakes, and combinations of these, often used for malicious purposes.

Advances in text and image generation have gained increasing popularity. These models are trained on curated internet datasets, meaning that Large Language Models (LLMs) such as GPT, Llama, and Claude can now generate fluent, coherent and plausible-looking text. Moreover, diffusion-based image generation models such as MidJourney, DALL-E, and StableDiffusion can generate realistic-looking images that often fool the human eye. The same holds for other modalities like video, speech, or code with models like Sora, StyleTTS2, and Codex, or even in multi-modal setups through models like GPT4-o.

The objective of this shared task is to study the problem of detecting whether a text-image pair is partially or completely generated. Moreover, we hope to understand whether a multi-modal framing can improve the performance of generated content detectors. 

Can you spot whether this (image, text) pair has been automatically generated? 🤔

Text: The bathing machine was a popular device from the 18th century to the early 20th century, which allowed people to change out of their everyday clothes, put on bathing suits, and enter the sea at the beaches. Bathing machines were wooden carts with roofs and walls that were rolled into the sea. The image illustrates "sea bathing in central Wales, c. 1800. Several bathing machines can be seen." 

Solution: The image has been generated, which is easily identifiable by reading the text mentioning “carros de madera con techo y paredes”. These types of bathing machines are clearly not depicted as such in the image, which is a clear indicative of misunderstanding by the image generation model. The text is real, providing descriptive details and usages of the bathing machines within an historical context, using a natural sentence structure that mirrors how a human expert would describe the entity. Contrarily, LLMs tend to generate captions like “an interior space with ornate, antique-style cabinetry or furniture lining both sides of a long, narrow room

Relevance & Novelty 🎯

Some text and image generation models have been publicly released through pre-trained checkpoints in model hubs, or through free or paid APIs. This availability boosts research and usage of these technologies to develop cutting-edge applications. However, users or bots with unethical goals can use them to spread untruthful news, reviews, or opinions as texts and images. Thus, there is great interest in detecting automatically generated content for content moderation, including detecting fake news, deepfakes, bots, and technical research. In addition, it has special relevance for AI and NLP research hubs and companies with AI Safety endeavors, where it is paramount to detect generated content to understand the impact of these models in society and develop strategies to moderate their use.

Previous efforts in detecting machine-generated content have primarily concentrated on identifying machine-generated text and low-diversity synthetic images (faces, simple objects, etc.). However, with advancements in generative AI, their capabilities go beyond generating human-looking text and faces, which challenges most of the work that has been done on detecting and moderating machine-generated content. Besides, multimodality has not been extensively explored in the literature. Most existing datasets for machine-generated content detection are focused on a single modality, disregarding valuable information across modalities. Few multi-modal datasets have been proposed, mainly containing information of human faces and English text, which limits their scope of application.

The proposed shared task, MiMIC, aims to boost the research and technology development for multimodal machine-generated content detection (image-text) in English and Spanish. To our knowledge, no available multi-modal datasets for machine-generated content detection consider the Spanish language, thus, MiMIC can greatly benefit the community of content moderation in Spanish-speaking contexts.

We provide new data collections with high impact for future research on AI content moderation, offering the potential to develop robust models for content moderation industries. These datasets will also be valuable for the community to develop models beyond the scope of current tasks. Such technologies can later be used to identify the real source of information, reinforcing user trust in content and reducing the impact of disinformation, propaganda, or spam campaigns generated through generative AI.

Expected Target Community 🫂

This shared task is proposed to incentivize the overall interest in detecting automatically generated content, a relevant and novel area which will be necessary in the immediate future because of the increasing speed with which new content generation models appear. Thus, we do not constrain the target community, allowing participation from practitioners to companies.