In the interest of transparency and with inspiration from this paper by Carla Parra Escartín, Teresa Lynn, Joss Moorkens and Jane Dunne, we would like to lay out some information about the Shared Task, participation, equity and the evaluation criteria.
Anyone can participate, though in the interest of transparency, results from teams which overlap some of the organizers or annotators of any of the included datasets will be denoted as such once the results are published. If one of the authors of your system is either an organizer, annotator of one or more of the datasets, or both, you must state so in a footnote in the camera ready version of your paper.
Participation itself is free of charge and anyone can submit a system, however you must register for the CODI 2025 workshop in order to submit a paper and have your system evaluated. You can also send an e-mail to disrpt_chairs@googlegroups.com and we will add you to the distribution list for shared task updates (Alternatively, if you cannot successfully send an email to the email above, please send a note to Chloé Braud [Chloe.Braud@irit.fr] or Yang Janet Liu [y.liu1@lmu.de] so we can add you from our end). We also recommend subscribing to the shared task repository on GitHub at https://github.com/disrpt/sharedtask2025.
We believe that evaluation and analysis are very important, and therefore we require all systems to be accompanied by a paper using the ACL template. All papers will be submitted to the CODI workshop and marked as Shared Task papers. Accepted papers will be published in the Shared Task section of the proceedings of the workshop.
During paper submission, authors will be asked to provide a link to their system, including all necessary resources which are not trivially available (for example, there is no need to provide pre-trained models available from huggingface, etc.). All systems must include code to retrain the system from scratch, so that evaluators can test aspects of the system’s performance and reproduce reported scores, as well as a detailed README file explaining how to train the system. Systems which cannot be run in the evaluation phase will not be accepted.
Please also make sure to use seeds to keep performance as reproducible as possible!
There are five overall rankings which will be published at the end of the shared task:
Discourse Unit Segmentation - from tokenized text (.tok for non-pdtb style corpora)
Discourse Unit Segmentation - from treebanked data (.conllu for non-pdtb style corpora)
Connective Detection - from tokenized text (.tok for pdtb style corpora)
Connective Detection - from treebanked data (.conllu for pdtb style corpora)
Relation classification - from treebanked data (.rels for all corpora)
The overall system rankings in each category will be determined by the macro average score across treebanks, where each treebank score is decided by the micro-averaged metric.
For segmentation, micro-averaged positive class f-score is used as the metric in each treebank, while for relation classification the simple accuracy score is used. In all cases, the official Shared Task scorers, available from the GitHub repository, will be used.
Please note that only systems participating in the closed track will be ranked, and their total parameter count must be under 4 billion.
During system results reproduction, we will only train your system on the DISRPT tasks for the tracks you have submitted to. This means that you cannot incorporate training steps on tasks other than the ones in the Shared Task. However, as all neural language models are underlyingly trained on a variety of natural language tasks, you may utilize any publicly available models (e.g. transformer word embeddings), which can be trained on any number of tasks. Needless to say, none of those tasks should involve accessing the Shared Task test data or its source datasets, nor test data from previous editions of the DISRPT shared task.
If you choose to train your own embeddings for the Shared Task, you must make them publicly available so that we can easily reproduce your system, for example by uploading them to Hugging Face. Your model will have to be available at the time of system submission. If you would like to use another way of sharing your pretrained embeddings, please contact the organizers to verify this.
As noted above, we will only train systems on the tasks you submit to. However if you wish to augment the data in some way, you can submit an additional folder with data in the Shared Task format and we will train your system on that data as well. The volume of augmented data may not exceed the size of the shared task dataset itself, and you may not use the test or development sets in any way to create the augmented data.
No. Training with dev is not allowed. The final scores of a system used in the overall ranking would be obtained from a model trained solely on training data. (One could do so (e.g. as an experiment) and report the resulting scores in their paper, but such results will not be considered / reported as the official scores of the system in the overall ranking. ) Of course, you can also use the dev data for hyperparameter optimization, error analysis, etc. - just not for system training.
Yes. We believe that negative results can bring the field forward, especially when they are accompanied by insightful analysis about why a certain approach does not work. However negative results are not guaranteed to be accepted, and will be reviewed based on the contribution that their analyses can provide to the field.
Yes. You may ask for your results to either not appear at all in the overall ranking, or to appear anonymously.
System outputs should be identical to the gold standard data in each format. The official scorer will also expect this format, so if it is running correctly and outputting the score you expect, your system should be fine.
We are aware that not all teams may have LDC subscriptions - in the interest of promoting equity regardless of access and funding status, we will evaluate submitted systems on the closed LDC datasets for you - even for authors who cannot test their systems on these datasets themselves. We will report scores to authors so that they can add them to their papers for the camera ready version.
We believe that equity is an important part of the shared task and while we cannot make computing resources available to participants, we are considering reporting a score for the best non-neural system in each category (depending on whether such systems are submitted).
You can find the relevant resources (e.g. papers, annotation manuals etc.) in the README.md in each data directory in the GitHub repository.