A Data-Efficient Visual-Audio Representation with Intuitive Fine-tuning for Voice-Controlled Robots

Peixin Chang, Shuijing Liu, Tianchen Ji, Neeloy Chakraborty,

Kaiwen Hong, and Katherine Driggs-Campbell

University of Illinois, Urbana-Champaign

Published in Conference on Robot Learning (CoRL), 2023

[Paper] [Code] [Slides] [Poster]

Abstract

A command-following robot that serves people in everyday life must continually improve itself in deployment domains with minimal help from its end users, instead of engineers. Previous methods are either difficult to continuously improve after the deployment or require a large number of new labels during fine-tuning. Motivated by (self-)supervised contrastive learning, we propose a novel representation that generates an intrinsic reward function for command-following robot tasks by associating images with sound commands. After the robot is deployed in a new domain, the representation can be updated intuitively and data-efficiently by non-experts without any hand-crafted reward functions. We demonstrate our approach on various sound types and robotic tasks, including navigation and manipulation with raw sensor inputs. In simulated and real-world experiments, we show that our system can continually self-improve in previously unseen scenarios given fewer new labeled data, while still achieving better performance over previous methods.

Motivation

Learning-based language grounding agents were proposed to perform tasks according to visual observations and text/speech instructions. However, these approaches often fail to completely solve a common problem in learning-based methods: performance degradation in a novel target domain, such as the real world. Fine-tuning the models in the real world is often expensive due to the requirement of expertise, extra equipment and large amounts of labels, none of which can be easily provided by non-expert users in everyday environments. Without enough domain expertise or abundant labeled data, how can we allow users to customize such robots to their domains with minimal supervision?

We first learn a joint Visual and Audio Representation, which is Data-efficient and can be Intuitively Fine-tuned (Dif-VAR). In the second stage, we use the representation to compute intrinsic reward functions to learn various robot skills with reinforcement learning (RL) without any reward engineering. When the robot is deployed in a new domain, the fine-tuning stage is data efficient in terms of label usage and is natural to non-experts, since users only need to provide a relatively small number of images and their corresponding sounds.

Dif-VAR

The Dif-VAR is a two-branch network optimized with supervised contrastive loss (SupCon). The latent space of the Dif-VAR is a unit hypersphere such that vector embeddings of images and audios of the same intent are closer than other intents. We first collect visual audio pairs of the form (I, S, y), where I is the current RGB image from the robot's camera, S is the sound command, and y is the intent ID. Then, we encode both auditory and visual modalities into a joint latent space using the SupCon loss.

Reinforcement learning with Dif-VAR

We model a robotic task as a Markov Decision Process. Given the current image It, sound command Sg, and robot state Mt. The state is xt=[It, fI(It), fS(Sg), Mt], and the reward is the similarity between the current image and the sound command rt= fI(It)⋅fS(Sg).

We proposed a hierarchical system that contains multiple policy networks individually designed to fulfill a subset of intents in an environment. The Dif-VAR selects a policy by measuring the similarity between the sound command and the cluster centroid of each intent. We use PPO to train the policy network on the left.

Fine-tuning

After the deployment, non-experts collect visual-audio pairs by

- taking photos to get I
- saying the intent (S)

The Dif-VAR is updated with the new pairs

From the updated reward,

- the agent samples a collected sound command as the goal self-improves its skill
- no reward engineering or state estimation
- only an RGB camera and a microphone are needed

Common questions

What are the benefits of using audio commands?

We list some of the benefits of using audio commands for robotic tasks.

Simplified fine-tuning. Our method does NOT assume that a sound-to-text module is perfect. In practice, the transcribed text from ASR and similar sound recognition modules is not always perfect and could be erroneous. Fine-tuning ASR and subsequent intent understanding module requires expertise and ground-truth transcriptions, which is difficult for untrained non-expert users to provide. In contrast, by combining the sound processing unit into the representation, Dif-VAR allows end-to-end visually-guided updates of the sound recognition using the user’s own voice and knowledge without the need for transcriptions.
Reduced intermediate errors and better performance. Our choice of skipping sound-to-text transcriptions is inspired by recent research in end-to-end spoken language understanding (E2E-SLU). An E2E-SLU extracts user intent directly from speech without using the text. Such a system outperforms the ASR+NLU pipeline, coinciding with our results of the ANR baseline. This work intends to introduce this end-to-end intent understanding to the robotics community.
Wider applications. In our work, voice commands are not limited to speech or language. Our work has the potential to be applied but not limited to the following settings:
- Understanding environmental sounds like alarm clock sounds and dog-barking sounds.
- Interpreting emotions and background noise in a speech that may indicate a command.
- Safety and privacy. The robot only responds to the voice of a specific person.
- Multilingual understanding: Dif-VAR learns the underlying meaning and association between images and audio, and there is no need for translation.

Does Dif-VAR use any pre-trained foundation models in the experiments?

In our experiments, we did not use any pre-trained large models for Dif-VAR. We do this intentionally to show that the effectiveness of our method does not come from using pre-trained foundation models or specific network architectures. However, in practice, the image branch and the sound branch of the Dif-VAR can be replaced with appropriate large pre-trained models.

Broader impact

We list the relevance of our work and some challenging problems in the communities.

Domain Adaptation/Sim2real: We introduce a voice-guided reward function that could minimize the impact of domain shift after the deployment of a robotic system.
RL reward engineering: Dif-VAR offers a way to learn a reward function that reflects the success and failure of an agent without any reward engineering, which is a pain for RL and especially real-world RL. In addition, the reward function can be finetuned efficiently and intuitively in the real world.
Continual/Life-Long Learning: Dif-VAR is designed to keep improving the quality of the reward function after it receives more data from the end-users, which leads to higher performance of the downstream RL agent.
Human-in-the-loop Learning: Our pipeline aims to train an accurate reward with minimum cost by integrating human knowledge and experience while requiring minimal domain expertise. Thus, our method has the potential to be improved by the large-scale data collected from the general public.
End-to-end Spoken Language Understanding: Our work introduces the idea of end-to-end spoken language understanding to the robotics community, which encourages the cooperation of robotics and speech communities.
Multi-modal Learning: our method can be applied to other modalities and provide a reward function for other goal-based multi-modal robot tasks.

Demo

Page updated

Google Sites

Report abuse