Learning Visual-Audio Representations for Voice-Controlled Robots

Peixin Chang, Shuijing Liu, D. Livingston McPherson, and Katherine Driggs-Campbell

Human-Centered Autonomy Lab

University of Illinois, Urbana-Champaign


Published at IEEE International Conference on Robotics and Automation (ICRA), 2023

[Paper]    [Code]    [Slides]    [Poster]

Abstract

Based on the recent advancements in representation learning, we propose a novel pipeline for task-oriented voice-controlled robots with raw sensor inputs. Previous methods rely on a large number of labels and task-specific reward functions. Not only can such an approach hardly be improved after the deployment, but also has limited generalization across robotic platforms and tasks. To address these problems, our pipeline first learns a visual-audio representation (VAR) that associates images and sound commands. Then the robot learns to fulfill the sound command via reinforcement learning using the reward generated by the VAR. We demonstrate our approach with various sound types, robots, and tasks. We show that our method outperforms previous work with much fewer labels. We show in both the simulated and real-world experiments that the system can self-improve in previously unseen scenarios given a reasonable number of newly labeled data.

Motivation

While previous works made noticeable progress, we found that it is cost prohibitive for non-experts to fine-tune these methods. Ideally, when an intelligent voice-controlled robot encounters unseen scenarios such as new speakers or new room layouts, it should be customizable and can continually improve its interpretation of language and skills from non-experts in daily life. However, the performance of the models degrades due to domain shift. Improving the system by non-experts is usually unlikely. We propose visual-audio representation (VAR) which is the first representation that unifies automatic speech recognition (ASR), natural language understanding (NLU), and the grounding module.

Visual-audio representation

The VAR is a three-branch Siamese network optimized with triplet loss. The latent space of the VAR is a unit hypersphere such that vector embeddings of images and audios of the same intent are closer than other intents. We first collect visual audio triplets of the form (I, S+, S-), where I is the current RGB image from the robot's camera, S+ is the positive sound command, and S- is the negative sound command. Then, we encode both auditory and visual modalities into a joint latent space using the triplet loss.


Reinforcement learning with visual-audio representation

We model a robotic task as a Markov Decision Process. Given the current image It, sound command Sg, and robot state Mt. The state is xt=[It, fI(It), fS(Sg), Mt], and the reward is the similarity between the current image and the sound command rt= fI(It)fS(Sg). We use PPO to train the policy network on the left.

Fine-tuning

Visualization of the VAR

Visualizations of the VAR in the Kuka and the TurtleBot environments. The colors indicate the ground truth labels of sound and image data. The embeddings of images and sound are marked by circles and triangles, respectively. The black stars are the vector embeddings of the 8 images. For the Kuka environment, Block 1 is the leftmost block. The “Empty” class consists of empty sounds and images of the gripper tip above none of the blocks. 

The VAR is able to map the images with ambiguous labels to meaningful locations on the spheres 

Task Execution

Our method is able to work with various types of sound commands (e.g. environmental sound and speech) and robots (e.g. manipulators and mobile robots). The method also reduces sim2real gap when we deploy the learned policy to a real Kinova Gen3 robot arm. 

Demo

ICRA 2023 Presentation