Why is there no training data?
The task is to develop a system that can detect text generated by different LLMs, across different genres. Hence, the setup is not like Machine Learning shared tasks you may be used to, with train, development, and test data from the same distribution. It is not even necessary to use Machine Learning, you could develop a system based on linguistic or statistical properties discriminating between human and LLM-generated text.
What can we use in the system we submit?
You cannot access external LLMs (or other APIs) from within your submitted system.
In principle there is no limit to system size but we should be able to run it locally (check with us if there is doubt about that)
You can use closed existing datasets to train your model (but be ready to report on their properties)
Are we expected to make our final dataset and model open?
That would be great, but it is not a requirement to participate. The system and data used should be described in detail in the report, however.
Which information do we get about the test data?
The development data is a good model for part of the test data and is not meant as complete validation data for parameter optimization: it is merely a sneak peek. It can help you inspect whether your model already generalizes well across several genres and prompt types.
The test data will in addition contain other text genres (poetry and a mystery genre) and text generated by an additional open source model.
You will not get access to the prompts used, and the human-written texts already existed before the generated texts and were not written specifically for the shared task with specific instructions.
There is no information whether the human authors used word processing help or if their texts were edited.
What kinds of prompts are being used to generate the test data?
The prompts used to generate the test data are created by 3 different members of the organizing committee and cover a range of potential use cases. Some prompts are generic in nature, where only a topic and some instructions are given. However, there are also prompts which are more adversarial in nature: they are supposed to be as misleading as possible to detection models, e.g. by giving concrete examples of human writing style to emulate etc.
As a result, it is perfectly possible that generated texts still contain some traces of human-written text: the main challenge is therefore to detect the "fingerprints" of LLMs in any shape or form. We believe this is representative of the challenging nature of the detection task and will reward the most robust detection models.
How will the results be disseminated?
We plan a special session in the CLIN33 program for the shared task in which participating teams can present their solution. As for previous CLIN shared tasks, we could also publish a shared task proceedings with short papers describing the systems. We are open to suggestions.