The spoken language understanding (SLU) is a key component of spoken dialogue system (SDS), parsing user's utterances into corresponding semantic concepts. For example, the utterance ``Show me flights from Boston to New York" can be parsed into (fromloc.city_name=Boston, toloc.city_name=New York). Building a robust semantic parser system of multi-turn task-oriented spoken dialogue system is challenging, as it faces three main problems: variety of spoken language expression, uncertainty of automatic speech recognition (ASR) and adaption of dialogue domain.
Firstly, compared to written language, spoken language is much harder to be handled by a language understanding system. Since spoken language contains more complex linguistic phenomenons: the unnecessary repetitions, false starts, repairs and other disfluent situations. These phenomenons make it difficult to build a semantic parser system. Further more, spoken language processing always need automatic speech recognition (ASR) which converts speech to text. ASR errors make SLU become a challenging task. To improve robustness to ASR errors, audio information will be essential. A dialogue example is shown below.
Example of an ASR 1-best (top hypothesis), transcription and semantic annotation for a dialogue in the "music searching" domain. It shows how speech recognition errors destroy spoken language understanding.
Secondly, it is also hard to obtain enough labelled data of new dialogue domain, since data collection and annotation in the flow of dialogue are very expensive and time-consuming. Therefore, domain adaptation of SLU became important, which tries to train a semantic parser on some source domains and then adapt it to the target domain. Different from the music domain in the table above, a dialogue example of video domain is shown in the table below.
Example of an ASR 1-best (top hypothesis), transcription and semantic annotation for a dialogue in the "video searching" domain.
To fully investigate these problems and promote application of spoken dialogue system, we will release a multi-turn task-oriented Chinese spoken dialog dataset (as shown in the table below) and organize the first open, audio-text based Chinese Task Oriented Spoken Language Understanding Challenge. This challenge consists of two sub-challenges.
Statistics of CATSLU dataset.
Perform a slot filling system in a single domain. A large number of training dialogues related to music search and map navigation will be released (20% utterances will be randomly selected as test data). The data was collected from dialogues between users and a manageable spoken dialogue system (human-computer interaction), which happened in the real world. Both audio and text information are very important for understanding users. Therefore, audio features will also be provided as well as text features.
Adapt the SLU model of source domain to target domain. We will set music and map domain as source domains, while leave video and weather as target domains (20% utterances will be randomly selected as seed data and the rest is used for evaluation). Participants can use the seed data plus the music and map data from the first sub-challenge for adaptive training.
Data is split into train/development/test sets. Two baseline systems and evaluation scripts are provided. The one baseline is just simple string matching rule-based while the other is based on neural networks. Results are as follows (on the development). For more details, please refer to the handbook below.