Task Description:
SCI-CHAT's shared task will focus on simulating intelligent conversations; participants will be asked to submit (access to the APIs of) automated dialogue agents with the aim of carrying out nuanced conversations over multiple dialogue turns. Participating systems will be interactively evaluated in a live human evaluation. All data acquired within the context of the shared task will be made public (i.e. model and user interactions), providing an important resource for improving metrics and systems in this research area. Please note that there is no obligation for participating systems to be made publicly available.
Participating Models:
To promote accessibility and encourage participation, participants may use any pre-trained (or not) model for their tasks; we provide a baseline model in the form of DialoGPT-Medium fine-tuned on Freakonomics podcast transcripts (see dataset below) on our Git repository. Participants are allowed to use pre-trained models that are not freely accessible to the public, but to ensure fairness, participants must inform organizers of such.
Dataset (Dialogue Corpora):
Participants are encouraged to use the "Podcast" dataset (textual script) for their experiment. The Podcast dataset is free and publicly accessible at https://freakonomics.com/. To assure consistency of data, please refer to the Git repository for a Python script to crawl the podcast data. The repository also includes the instructions to download prerequisite libraries required by the script. A sample script of podcast script is also available on the Git repository for participants understanding.
Participants are permitted to use any data for system training, including the provided "Podcast" dataset, but also other available datasets such as: Personachat, Switchboard, MultiWOZ, etc. However, we require participants to acknowledge all datasets used in the system description.
Evaluation Criteria:
This evaluation process aims to provide valuable insights into the performance of the AI system in generating human-like conversation. Human assessment will be used as the primary/official results of the competition, and this human evaluation will be carried out using the Direct Assessment method adapted for Open-domain dialogue described in Ji et al. (2022).
Evaluation is based on the "Content and Relevance" responses generated by the model, e.g. “During human evaluation, judges will be provided with a specific topic (e.g. “New Technologies Always Scare Us. Is AI Any Different?”) and will be encouraged to have discussions with the participating systems over that topic.”
System submission:
In order to be included in the competition, participants should submit an API that can be used in a live evaluation of models by January 20th 2024 using the Google form. Optionally, participants can submit their API earlier (by January 13th 2024). This will help us test our access to the API and resolve any formatting issues early, but participating models will not be officially evaluated until after January 20th, allowing participants to still update the model in the meantime. Any updates to the model after this deadline will result in its disqualification.
The submission link will be available soon.
System description submission:
In order to be included in the competition results, all participants are expected to submit a system description paper of at most 8 pages (excluding references and bibliography). The submission follows the direct submission for the system paper.
Direct paper submissions must be submitted through SoftConf submission link:
https://softconf.com/eacl2024/SCI-CHAT-2024/
System submission (API) deadline: January 20th 2024
System description paper via SoftConf: January 26th 2024
References:
Ji, Tianbo, Yvette Graham, Gareth Jones, Chenyang Lyu, Qun Liu (2022) Achieving Reliable Human Evaluation of Open-domain Dialogue Systems. Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics.