A Data-Efficient Visual-Audio Representation with Intuitive Fine-tuning for Voice-Controlled Robots

Peixin Chang, Shuijing Liu, Tianchen Ji, Neeloy Chakraborty,

  Kaiwen Hong, and Katherine Driggs-Campbell

Human-Centered Autonomy Lab

University of Illinois, Urbana-Champaign


Published in Conference on Robot Learning (CoRL), 2023

[Paper]    [Code]    [Slides]    [Poster]

Abstract

A command-following robot that serves people in everyday life must continually improve itself in deployment domains with minimal help from its end users, instead of engineers. Previous methods are either difficult to continuously improve after the deployment or require a large number of new labels during fine-tuning. Motivated by (self-)supervised contrastive learning, we propose a novel representation that generates an intrinsic reward function for command-following robot tasks by associating images with sound commands. After the robot is deployed in a new domain, the representation can be updated intuitively and data-efficiently by non-experts without any hand-crafted reward functions. We demonstrate our approach on various sound types and robotic tasks, including navigation and manipulation with raw sensor inputs. In simulated and real-world experiments, we show that our system can continually self-improve in previously unseen scenarios given fewer new labeled data, while still achieving better performance over previous methods.

Motivation

Learning-based language grounding agents were proposed to perform tasks according to visual observations and text/speech instructions. However, these approaches often fail to completely solve a common problem in learning-based methods: performance degradation in a novel target domain, such as the real world. Fine-tuning the models in the real world is often expensive due to the requirement of expertise, extra equipment and large amounts of labels, none of which can be easily provided by non-expert users in everyday environments. Without enough domain expertise or abundant labeled data, how can we allow users to customize such robots to their domains with minimal supervision?

We first learn a joint Visual and Audio Representation, which is Data-efficient and can be Intuitively Fine-tuned (Dif-VAR).  In the second stage, we use the representation to compute intrinsic reward functions to learn various robot skills with reinforcement learning (RL) without any reward engineering.  When the robot is deployed in a new domain, the fine-tuning stage is data efficient in terms of label usage and is natural to non-experts, since users only need to provide a relatively small number of images and their corresponding sounds. 

Dif-VAR

The Dif-VAR is a two-branch network optimized with supervised contrastive loss (SupCon). The latent space of the Dif-VAR is a unit hypersphere such that vector embeddings of images and audios of the same intent are closer than other intents. We first collect visual audio pairs of the form (I, S, y), where I is the current RGB image from the robot's camera, S is the sound command, and y is the intent ID. Then, we encode both auditory and visual modalities into a joint latent space using the SupCon loss.

Reinforcement learning with Dif-VAR

We model a robotic task as a Markov Decision Process. Given the current image It, sound command Sg, and robot state Mt. The state is xt=[It, fI(It), fS(Sg), Mt], and the reward is the similarity between the current image and the sound command rt= fI(It)fS(Sg)

We proposed a hierarchical system that contains multiple policy networks individually designed to fulfill a subset of intents in an environment. The Dif-VAR selects a policy by measuring the similarity between the sound command and the cluster centroid of each intent. We use PPO to train the policy network on the left.

Fine-tuning

After the deployment, non-experts collect visual-audio pairs by

The Dif-VAR is updated with the new pairs

From the updated reward,

Common questions

What are the benefits of using audio commands?

We list some of the benefits of using audio commands for robotic tasks.


Does Dif-VAR use any pre-trained foundation models in the experiments?

In our experiments, we did not use any pre-trained large models for Dif-VAR. We do this intentionally to show that the effectiveness of our method does not come from using pre-trained foundation models or specific network architectures. However, in practice, the image branch and the sound branch of the Dif-VAR can be replaced with appropriate large pre-trained models. 


Broader impact

We list the relevance of our work and some challenging problems in the communities. 


Demo