Italian DIALogue systems evaluation
Motivations and Background
Conversational Agents are one of the most impressive evidence of the recent resurgence of Artificial Intelligence. In fact, there is now a high expectation for a new generation of dialogue systems that are able to naturally interact and assist humans in a number of scenarios, including virtual coaches, personal assistants and automatic help desks. However, despite the growing commercial interest for various task oriented conversational agents, there is still a general lack of methodologies for their evaluations.
This evaluation exercise targets both private companies and public institutions willing to assess the quality of a dialogue system that uses Italian as language for the interaction and that has a high degree of operation on a specific task. Examples include a conversational banking frontend, a virtual assistant for managing appointments, a chat-bot suggesting sport activities, a dialogue system for travel booking, and a chat-bot helping to buy products on an e-commerce web site.
The task intends to develop and apply evaluation protocols for the quality assessment of a dialogue system for the Italian language. We target the evaluation of existing task-oriented dialogue systems (both industrial and academic prototypes), which are on operation at the date of the test period (September 2018). Through the application of an evaluation protocol, which will be published well in advance before the test phase, a number of standard and internationally recognized metrics will be applied to assess the quality of the dialogue system. The output of the evaluation will not be a ranking (which would not be possible given the potential heterogeneity of the application domains and of the interaction modalities that will be submitted by participants). Rather, we will provide a qualitative assessment for each participating system, based on detailed and coherent set of technological and interactive characteristics of the system. As side effect of the evaluation exercise, we do expect a public discussion of the applied protocols, which, eventually, could led to their improvements for a next evaluation. We will guarantee a proper level of anonymization both of the data provided through the interaction with the dialogue systems and of the evaluation results. Given the peculiar nature of the evaluation, which will be carried on by humans, this task does not require neither training or testing data.