The First Workshop on

Evaluations and Assessments of Neural Conversation Systems (EANCS)

Co-located with EMNLP 2021

Task Description

Dialogue Off-Policy Evaluation (OPE) is the task of estimating human evaluation scores for dialogue systems without real interaction between human and dialogue systems. We provide a shared task by releasing a new benchmark dataset that compares evaluation methods.


Datasets

OPE is usually done with experience data, which is the historical conversations between human and dialogue systems. The dialogue systems used for collecting experience data are also called behavior models/behavior agents.

Note that the experience data can be different from the training data, which usually contains only high-quality human-to-human conversations. As the human-to-human conversations do not cover failure modes of dialogue systems, they are not appropriate for OPE.

We provide experience data & pre-trained dialogue agents for two tasks: Convai2 and AirDialogue in Github repo with more details.


Evaluation Metrics

For each type of human evaluation score, we will compare the estimated and ground-truth human evaluation scores for different agents using Pearson's correlation and Spearman's correlation.


Leaderboard

You can submit your result to the leaderboard here: https://competitions.codalab.org/competitions/33769


Baseline System

A random baseline and starting kit is in Github.


Paper Submission

We encourage each team to submit a paper describing their system to the workshop in order to be included in the workshop proceedings. It could be either a long paper (8 pages) or a short paper (4 pages) following the EMNLP 2021 templates and formatting requirements. The submission site is: https://www.softconf.com/emnlp2021/EANCS/.


Q&A

Q: Are we considering per-conversation / per-turn / agent-level evaluation?

A: Each conversation is evaluated at the end, e.g., the task completion score / the fluency score of the whole conversation. We aim to evaluate the performance dialogue systems at the agent level, which is the expected (averaged) per-conversation scores. This is the same setting as [1].

Q: Can we use additional data?

A: Yes, you can use other datasets / pre-trained models, except for the new human-model / human-human chat logs for ConvAI2/AirDialogue. The only human chat logs that can be used are: 1) the original agent training data; and 2) the data within each target agents' folder (see details here).

Q: Is it possible not to use the experience data?

A: Yes. For example, you can use self-play evaluation. Example for AirDialogue.

Q: There are four settings. What's the difference?

A: For each task, we consider 4 settings: 1) Full experience data; 2) & 3) Experience data subsampled by 50% & 10%; 4) Filtered experience data. In settings 2) and 3), we randomly remove part of the experience data to test the effectiveness of the OPE algorithms when the experience coverage is low. Due to the similarity between the pre-trained models, the behavior of the target model and the behavior model for some conversations may be very similar. In setting 4), we filtered out these highly overlapping conversations to mimic settings where the target model and the behavior model are not very similar.

Contacts

For further questions regarding the workshop shared task, please contact our shared task chair Dr. Haoming Jiang (jianghm.ustc@gmail.com)

Reference

[1] Jiang, Haoming, et al. "Towards Automatic Evaluation of Dialog Systems: A Model-Free Off-Policy Evaluation Approach."

[2] Wei, Wei, et al. "Airdialogue: An environment for goal-oriented dialogue research."

[3] Abigail See, Stephen Roller, Douwe Kiela, Jason Weston. "What makes a good conversation? How controllable attributes affect human judgments."