IberAutextification 🤖👩🏻

Welcome 🤗

The IberAuTexTification: Automated Text Identification on Languages of the Iberian Peninsula shared task will take place as part of IberLEF 2024, the 6th Workshop on Iberian Languages Evaluation Forum at the SEPLN 2024 Conference, which will be held in Valladolid, Spain on the 26th of September, 2024.

IberAuTexTification is the second version of the AuTexTification at IberLEF 2023 shared task (Sarvazyan et al., 2023). From Genaios and the UPV, we extend the previous task in three dimensions: more models, more domains and more languages from the Iberian Peninsula (in a multilingual fashion), aiming to build more generalizable detectors and attributors. In this task, participants must develop models that exploit clues about linguistic form and meaning to identify automatically generated texts from a wide variety of models, domains, and languages. We plan to include LLMs like GPT-3.5, GPT-4, LLaMA, Coral, Command, Falcon, MPT, among others. New domains like essays, or dialogues, and cover the most prominent languages from the Iberian Peninsula: Spanish, Catalan, Basque, Galician, Portuguese, and English (in Gibraltar).

A novelty from this edition is to detect in a multilingual (languages from the Iberian peninsula such as Spanish, English, Catalan, Gallego, Euskera, and Portuguese), multi-domain (news, reviews, emails, essays, dialogues, wikipedia, wikihow, tweets, emails, etc.), and multi-model (GPT, LLaMA, Mistral, Cohere, Anthropic, MPT, Falcon, etc.) setup, whether a text has been automatically generated or not, and, if generated, identify the model that generated the text.

To foster engagement and reward dedication, we will award the best participant in each subtask with 500€ sponsored by Genaios. We hope for your participation and good luck in the competition! 🍀

To reduce your workload participating in the task, we provide a repository with baselines, evaluation code, and format checkers.The code is prepared with extensibility in mind, so you can use it as basis to develop your own models and get some functionalities for free as CLI endpoints, caching, or config handling.

Introduction

The new generation of Large Language Models (LLMs) for text generation has surged and has been established in individual and company's workflows for many different tasks. These LLMs are trained to predict the next word on large-scale, curated datasets of Internet text and can now generate fluent, coherent, and plausible-looking text that creates the impression of greatness, not only in English, but in many other languages including those from the Iberian Peninsula. The most recent and prominent examples are the GPT family of language models developed by OpenAI, including GPT-3 (Brown et al., 2020), GPT-3.5 (Ouyang et al., 2022), and GPT4 (OpenAI, 2023); the LLaMA family from Meta (Touvron et al., 2023), and PaLM from Google (Google, 2023). Besides, many other companies are developing their own text generation solutions like Mistral (Jian et al., 2023), Cohere (Cohere, 2023), Anthropic (Anthropic, 2023), Falcon (Ebtesam et al., 2023), or Mosaic (Mosaic ML, 2023), while the research community is continually working with open-source base models to further improve their capabilities (Huggingface, 2023).

Considering the current impact and popularity of these technologies to develop innovative applications, it is reasonable to think that there is certain amount of people that can use them to write machine-generated text (MGT) for malicious purposes such as disinformation campaigns to spread fake news or polarised opinions, increase the credibility of phishing campaigns, etc. Besides, the number of LLMs is continually growing, and most of them are available through pre-trained checkpoints in model hubs, or through free/under-subscription APIs, which increases the amount of potential malicious users. Malicious users can attack in many different languages, domains, models, or generation strategies, thus hardening the moderation. Therefore, it is necessary to develop effective and generalizable content moderation strategies to deal with MGT, based on (i) detecting whether a text has been generated by an LLM, and (ii) identify the model that generated a MGT for further forensic purposes.

There is a great interest in addressing these tasks (Deng et al., 2022), (Tourille et al., 2022), and they have special relevance for AI, NLP research hubs and companies that see MGT as an imminent threat to its reputation. Companies are highly interested in detecting automatically generated contents to foster/hinder the reputation of products and brands, and to verify the content at news or statements. This shared task aims to foster the R&D in generated text identification, to progress at the same pace as generation technologies do. The alternative scenario would result in different entities such as companies having difficulties dealing with generated and malicious content such as spam, fake opinions, and reviews. The new generation of opinion span techniques requires its corresponding new generation of counter-techniques to moderate them.

What has already been done?

It has been shown empirically that the automatic detection of MGT from specific models and domains can be performed with very high accuracy (Jawahar et al., 2020) (Bakhtin et al., 2019) (Sarvazyan et al., 2023). However, this is not realistic, since the combinatoric is large in the wild and it is required to develop generalizable detection and attribution models. In general, LLMs introduce statistical artifacts and common structures in their texts, that can be used as clues, either by humans or by discriminator systems, to identify automatically generated texts to some extent (Ippolito et al., 2022). For instance, some language models generate syntactic or grammatical artifacts, while most of them struggle with factual hallucinations and reasoning (Liu et al., 2022) due to their incapability of memorizing all the facts (Borgeaud et al., 2022) and outdated factual knowledge (Dhingra et al., 2022). Besides, the research community has found that existing detectors exhibit poor cross-domain accuracy, that is, they are not generalizable to different writing styles and publication formats, while accuracy in in-domain settings is near to perfect (Jawahar et al., 2020) (Bakhtin et al., 2019).

Can you spot generated text? 🤔

Could you spot whether the following texts have been automatically generated? Just consider that automatically generated text could show factual, grammatical, or coherence artifacts (Massarelli et al., 2020), along with statistical abnormalities that make the distributions of automatic and human texts differ (Ippolito et al., 2020), despite having well-shaped form (Bender et al., 2020).

Adquirí el iRobot Roomba 981 en una oferta irresistible, aunque no tenía una necesidad inmediata de reemplazar mi antiguo Roomba. Desde el primer uso, noté mejoras sustanciales en su rendimiento de limpieza. La función de mapeo de la casa añade un nivel de eficiencia que mi antiguo modelo no tenía, permitiendo una cobertura completa del espacio sin repetir áreas innecesariamente

La compra del iRobot Roomba 981 va ser una decisió impulsiva motivada per un preu atractiu, tot i que ja tenia una Roomba anterior. La veritat és que ofereix una neteja més eficient i la funció de mapatge de la casa és realment pràctica. No obstant això, la seva limitació rau en la seva incapacitat per netejar habitacions específiques, ja que ho fa de manera indiscriminada. A més, cal destacar que no és precisament silenciosa. En resum, es tracta d'una Roomba de gamma mitjana que compleix amb les expectatives, tot i que amb algunes limitacions notables.

iRobot Roomba 981 erosi nuen beharrik gabe, nire Roomba aldatu beharra ez nuen, baina prezio ona aurkitu nuen, benetan hain hobeto garbitzen duela eta etxea mapeatzen duela ikusi nuen. Hala ere, ezin da esan nahi horrela garbitzen duela gelarik edo besteak, indarrean guztia garbitzen du, eta, gainera, ez da hain sendoa. Azken batean, ertaineko Roomba bat da.

A compra do iRobot Roomba 981 foi unha decisión innecesaria xa que non tiña intención de cambiar a miña Roomba actual, pero atopeina a un bo prezo. É certo que limpa moito mellor e fai un mapeo da casa, aínda que non se pode dicir que limpe unha habitación ou outra de forma selectiva, xa que limpa todo de xeito indiscriminado. Ademais, non é silenciosa. En resumo, trátase dun modelo de Roomba de rango intermedio.

I purchased the iRobot Roomba 981 without the need to replace my existing Roomba, simply because I found it at a good price. It is true that it cleans much better and maps the house efficiently. However, it lacks the ability to clean specific rooms selectively; instead, it cleans the entire space forcefully. Additionally, it is not a quiet device. In conclusion, it's a mid-range Roomba that delivers enhanced cleaning performance, yet with some limitations regarding targeted cleaning and noise level.

References

Radford, A., Wu, J., Child, R., Luan, D., Amodei, D., & Sutskever, I. (2019). Language models are unsupervised multitask learners. OpenAI blog, 1(8), 9.
Ouyang, L., Wu, J., Jiang, X., Almeida, D., Wainwright, C. L., Mishkin, P., & Lowe, R. (2022). Training language models to follow instructions with human feedback. arXiv preprint arXiv:2203.02155.
Chowdhery, A., Narang, S., Devlin, J., Bosma, M., Mishra, G., Roberts, A., & Fiedel, N. (2022). Palm: Scaling language modeling with pathways. arXiv preprint arXiv:2204.02311.
Scao, T. L., Fan, A., Akiki, C., Pavlick, E., Ilić, S., Hesslow, D., & Manica, M. (2022). Bloom: A 176b-parameter open-access multilingual language model. arXiv preprint arXiv:2211.05100.
Jawahar, G., Abdul-Mageed, M., & Lakshmanan, L. (2020). Automatic Detection of Machine Generated Text: A Critical Survey. In Proceedings of the 28th International Conference on Computational Linguistics (pp. 2296–2309). International Committee on Computational Linguistics.
Deng, R., & Duzhin, F. (2022). Topological Data Analysis Helps to Improve Accuracy of Deep Learning Models for Fake News Detection Trained on Very Small Training Sets. Big Data and Cognitive Computing, 6(3).
Tourille, J., Sow, B., & Popescu, A. (2022). Automatic Detection of Bot-Generated Tweets. In Proceedings of the 1st International Workshop on Multimedia AI against Disinformation (pp. 44–51). Association for Computing Machinery.
Rodriguez, J., Hay, T., Gros, D., Shamsi, Z., & Srinivasan, R. (2022). Cross-Domain Detection of GPT-2-Generated Technical Text. In Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (pp. 1213–1233). Association for Computational Linguistics.
Ippolito, D., Duckworth, D., Callison-Burch, C., & Eck, D. (2020). Automatic Detection of Generated Text is Easiest when Humans are Fooled. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics (pp. 1808–1822). Association for Computational Linguistics.
Uchendu, A., Le, T., Shu, K., & Lee, D. (2020, November). Authorship attribution for neural text generation. In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP) (pp. 8384-8395).
Clark, E.; August, T.; Serrano, S.; Haduong, N.; Gururangan, S.; and Smith, N. A. 2021. All That’s ‘Human’Is Not Gold: Evaluating Human Evaluation of Generated Text. In Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 1: Long Papers), 7282–7296.
Ethayarajh, K., & Jurafsky, D. (2022). How human is human evaluation? Improving the gold standard for NLG with utility theory. arXiv preprint arXiv:2205.11930.
Dugan, L., Ippolito, D., Kirubarajan, A., & Callison-Burch, C. (2020). RoFT: A Tool for Evaluating Human Detection of Machine-Generated Text. In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing: System Demonstrations (pp. 189–196). Association for Computational Linguistics.
Massarelli, L., Petroni, F., Piktus, A., Ott, M., Rocktäschel, T., Plachouras, V., Silvestri, F., & Riedel, S. (2020). How Decoding Strategies Affect the Verifiability of Generated Text. In Findings of the Association for Computational Linguistics: EMNLP 2020 (pp. 223–235). Association for Computational Linguistics.
Bender, E., & Koller, A. (2020). Climbing towards NLU: On Meaning, Form, and Understanding in the Age of Data. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics (pp. 5185–5198). Association for Computational Linguistics.