Team NICT for WRS Virtual Space

Motivation

In our aging society, the production age population that can physically support people with disabilities is insufficient, and there are even cases where their families are forced to quit their jobs. As a solution to these social problems, domestic service robots can improve the Quality of Life (QoL) for people with physical disabilities, and can release their families from temporal constraint, and can lead to productivity improvement of society as a whole.


Team Members

Our team NICT has solid experiences in domestic service robots competitions. As one of the founders of team eR@sers, we won Robocup@Home League in 2008 and 2010, and reached the second place in 2009 and 2012. We also reached the second place in RoboCup@Home Domestic Service Platform League in 2017. In parallel, we achieved 1st place of RoboCup Japan Open@Home League continuously from 2008 to 2012. Moreover, we won the 1st place in Toyota HSR Hackathon 2015. We are regularly involved in Robocup@Home as participant as well as regular organizers.

Team NICT has the following members:

Team Leader: Komei Sugiura

Staffs: Aly Magassouba, Yusuke Omori

Students: Trinh Quoc Anh



Scientific Contributions

Recently, domestic service robots (DSRs) hardware is being standardized and many studies have been conducted. However, in most DSRs, the communication ability is still very limited. Team NICT members rely principally on MLU to overcome this limitation as reflected in many of our publications.

Understanding Ambiguous Commands

Natural instructions are given to the robot in the Handy Man task. The difficulty lies in understanding ambiguous instructions.We developed several methods to disambiguate users instructions during manipulation tasks.

In this work, we addressed the case when the instruction does not contains any verb (e.g., ``Bottle please''), which allows the robot to predict the feasibility of the possible physical actions according to the initial instruction.

Existing instruction understanding methods usually estimate missing information only from non-grounded knowledge; therefore, whether the predicted action is physically executable or not is unclear. In this work, a grounded instruction understanding method is introduced to estimate appropriate objects given an instruction and situation using a Generative Adversarial Nets-based classifier using latent representations.

In a context similar to WRS ``Handy Man'' task, we addressed the case when there are ambiguities on the area where to place the object . We address the case of ambiguous instructions, that is when the target area is not specified. For instance ``Put away the milk and cereal.'' is a natural instruction where there is ambiguity on the target area, considering daily life environments. Conventionally, this instruction can be disambiguated from a dialogue system, but at the cost of time and cumbersomeness. Deep learning techniques are used to predict the task feasibility according to HSR physical limitations and the space available on the different target areas. Instead, we propose a MLU approach, where the instructions are disambiguated from the robot state and environment context. We developed MultiModal Classifier Generative Adversarial Network (MMC-GAN) to predict the likelihood of the different target areas considering the robot physical limitation and the target clutter.

Left: Rospeex's architecture. Right: The human subject is executing the motion according to the command generated by ELLR

To build conversational robots, roboticists are required to have deep knowledge of both robotics and spoken dialogue systems. Although they can use existing cloud services that were built for other services, e.g., voice search, it will be difficult to share robotics-specific speech corpora obtained as server logs, because they will get buried in non-robotics-related logs. Building a cloud platform especially for the robotics community will benefit not only individual robot developers but also the robotics community since we can share the log corpus collected by it. This is challenging because we need to build a wide variety of functionalities ranging from a stable cloud platform to high-quality multilingual speech recognition and synthesis engines which is shown in the above left-hand figure . Rospeex was originally developed as a speech module for team eR@sers in RoboCup@Home competitions, and has been used by 50,000 unique users as of January 2018.

Active-Learning-Based Sentence Generation

The central issue in Human Navigation is strongly related to Natural Language Generation (NLG) studies. We already built a language generation method using active learning. In a human-robot spoken dialogue, a robot may misunderstand an ambiguous command from a user, such as ``Bring me a cup,'' potentially resulting in an accident. The command is ambiguous when there are multiple cups, and asking a confirmation question such as ``Do you mean the blue cup on the table ?'' can decrease the risk. Moreover, although making confirmation questions before all motion execution will decrease the risk of this failure, the user will find it more convenient if confirmation questions are not made under trivial situations. In this work, we proposed an active-learning-based dialogue management method in multimodal dialogues. The interactions were based on multimodal information such as speech, motion, and visual information. The above right-hand figure of shows a sample utterance generated by the method. The method has a user model that predicts how likely an generated robot utterance will be understood by the user. Bayesian logistic regression (BLR) is used for learning the user model. Active learning is used for selecting the optimal utterances to generate, which effectively train the user model. As an active learning method, we used Expected Log Loss Reduction (ELLR). From the user model, we can also obtain expected utility used for make decisions as to whether to ask questions.

Dialogue-oriented Text to Speech

Text-to-Speech (TTS) systems are not optimized for communication but for text-reading. Here we developed a non-monologue speech synthesis for robots. We collected a speech corpus in a non-monologue style in which two professional voice talents read scripted dialogues. The corpus is published for free for research purposes. The corpus is the largest corpus for dialogue-based TTS. Hidden Markov models (HMMs) were then trained with the corpus and used for speech synthesis. Our method was shown to outperform conventional text-reading TTS.

All these methods have been designed to be applied to real-world situations. Furthermore, we can emphasize some strong similarities between our work about addressing ambiguous language instructions during manipulation task and the WRS tasks: WRS challenges give an opportunity to combine and extend these methods.

Development

Multimodal Language Understanding

Left: Ranking of likelihood of target predicted by our MLU module in a real-world situation. A rule-based sentence generation example in Human Navigation task.

The HandyMan task is closely related to the General Purpose Service Robot (GPSR) in the RoboCup@Home task, where a random sentence generator generates commands to retrieve objects. In 2010, we developed and published the first version of GPSR sentence generator, therefore we have sufficient experience on the sentence generation. However, in contrast to the rule-based GPSR generator, the generator used in WRS is likely to generate more natural and unconstrained sentences. Therefore, the main focus of the HandyMan task is on understanding unconstrained and ambiguous commands to retrieve objects.

Currently, we have developed several MLU modules. These MLU methods uses linguistic features that are the operator instruction and the robot context. Additionally, visual features are used to disambiguate the user instructions.

A ranking is then based on the feasibility of the task, as illustrated in the above left-hand figure, in the case where the robot should place an object in given environment. These MLU methods will be extended to map objects and the different target areas of the environment. This mapping should characterize the likelihood of an object being on a given piece of furniture or in a given area.

Sentence Generation

In Human Navigation, human referee is asked to retrieve objects. The main focus is on NLG to describe actions, target objects, and target locations. As described in Section 3, we have already built a language generation method using active learning. For WRS, we will also build two more sentence generation methods. The first one is simple rule-based shown in the above right-hand figure. In generating linguistic descriptions from multimodal inputs, more attempts are being done to use Encoder-Decoder network using bi-directional LSTM with convolutional neural network (CNN). For generating commands, we build a large-scale corpus for training models. The cost for building the corpus is mainly on the multiple annotators working half a year. We will compare the above three methods and use the best method in the competition.

Object Detection

For object detection and classification, we used You Only Looked Once (YOLO) for real-Time Object Detection. The systems segment images int regions and predict bounding boxes and objects probabilities from a CNN. By training YOLO with specific objects used in our work, we obtain the results illustrated in the below figure. Currently, in our architecture, YOLO runs at 25 fps on a laptop PC. We plan to train YOLO model to learn the objects in SigVerse specified in the rulebook.

Left:YOLO detection and categorization of objects. Training with objects in SigVerse is ongoing. Right: Pose estimation result for pointing gesture.

Pose Estimation

The main focus on the Interactive Clean Up task is reference resolution using pose estimation. For example, the robot has to disambiguate ``put that there'' in 3D. For detecting pointing gesture of the avatar, we need to know the pointing direction. Based on the RGB camera image, we estimate the avatar's pose using OpenPose. The right-hand figure shows the pose estimation result of the avatar.

The pointed object can be estimated by combining OpenPose and YOLO. This module can predict likelihood of objects in the pointed direction in a 2D space. Although disambiguation in 2D might be sufficient, we are also planning to extend this to 3D space.


Motion

The motion and navigation systems are directly based on standard path planning and navigation stacks in ROS.