Spoken Utterances Guiding chef’s Assistant Robots


In the last few years, Human-Machine interaction systems have been in the spotlight, as far as computer science and linguistics are concerned, resulting in many applications such as Virtual Assistants and Conversational Agents. The possibility to use such Artificial Intelligence technologies in domestic environments is increasingly becoming a reality. In order to ensure the future possibility of making such systems even more intelligent, further researches are needed. As it has been the case with Apple SIRI and Google Assistant technologies, recent approaches transformed the former dialogue systems in direct action actuators, removing or reducing, as much as possible, clarification requests that may arise in presence of ambiguous commands. In this view, Spoken Language Understanding (SLU) is nowadays one of the major challenges of the field.

Task Description and Data Annotation

The SUGAR task's goal is to train a voice-controlled robotic agent to act as a cooking assistant. For this purpose, a train corpus of spoken commands is collected and annotated. To collect the corpus, we designed a 3D virtual environment reconstructing and simulating a real kitchen where users can interact with a robot which receives commands to be performed in order to accomplish some recipes. User’s orders are inspired by silent cooking videos shown in the 3D scene, thus ensuring the naturalness of the spoken production. Videos are segmented into elementary portions (frames) and sequentially proposed to the speakers who will utter a single sentence after each seen frame. In this view, speakers watch at video portions and then give instructions to the robot to emulate what seen in the frame. The collected corpus then consists of a set of commands, whose meaning derives from the various combination of actions, items (i.e. ingredients), tools and different modifiers. Audio files will be captured in a real acoustic environment, with a microphone posed at about 1 mt of distance from the speakers. The resulting corpus contains audio files, for each of which speakers’ voice is segmented into sentences representing isolated commands.

Actions will be represented as a finite set of predicates accepting an open set of parameters. For example, the action of putting may refer to a pot being placed on the fire

put(pot, fire)

or to an egg being put in a bowl

put(egg, bowl)

The annotation process results in determining the optimal action predicate corresponding to each command. The train set consists of audio file and predicate description pairs, where the predicate serves as an interpretation of the intention to be performed by the robot. For these scenarios, the audio file will be always mapped on a single interpretative predicate.

The train set consists of 1500 utterances produced by 20 different speakers annotated by experts in creating the train set to be handed out in May 2018. The test set consists of 500 audio files containing uttered commands and will be made available in September 2018.

Differently from the training set, actions in the test set will not be expressed in the regular sequence that leads to the completion of a given recipe: isolated commands will be randomly extracted by original sequences - however single actions will be the same ones found in the training set, while the objects, on which such actions may be applied to, will vary (i.e. different recipes, ingredients, tools...). For each recorded command, a file describing the context in which the command was issued will be provided. The context file will contain a trace of the previous history of the interaction, composed of utterances and previous actions taken by the robot. The context will be represented as the correct command history up to 3 previous steps.


Participants will be evaluated on the basis of correctly interpreted commands, represented in the form of predicates.

The evaluation protocol will cover the following possibilities:

• The proposed system correctly detects the requested action and all its parameters

• The proposed system asks for repetition

• The proposed system correctly detects the requested action but it assigns the wrong parameters

• The proposed system misses the action

The possibility of asking for repetitions is left to participants to avoid forcing them to provide an answer in uncertain conditions. In this case, the evaluation protocol will assign a weaker penalization than the one considered for missing the parameters or the action. The collected corpus will not, however, contain situations in which the system asks for repetitions. The details of the evaluation procedure will be designed to align the considered scenario to metrics reported in the literature.