Robot Sound Interpretation

People interpret sounds they hear and interact with the world according to their interpretations. When a human maps sound to meaning, his/her brain forms conceptual representations of the sound. Can we mimic how humans interpret sound using AI-based methods? We explore whether the sound command can be directly interpreted by the robots for visual-based decision making rather than being transcribed into text and symbols. State-of-art algorithms use automatic speech recognition (ASR) to translate sound to text, and then use language models to process text information. In contrast, our model actively learns and builds its own numerical interpretation of sounds. We call this process the Robot Sound Interpretation (RSI).

The figure above shows the difference between voice-controlled robots and RSI

Scenarios

Given a single-word sound command, the Kuka (the arm robot) needs to move its gripper tip to an area right above the red block corresponding to the command and the TurtleBot (the mobile robot) needs to explore and approach the corresponding object among four objects in an arena, using the images from a single RGB camera and robot state information.

Problem Formulation

We model this interaction as a Markov Decision Processes. At each time step t, the state x_t={S,I_t,M_t} of the agent consists of three parts: an one-time sound feature (S) representing the command, an image (I_t) from its camera, and a robot state vector (M_t) which includes information such as manipulators' end-effector location or locomotors' odometry. The agent then takes an action according to its policy π parameterized by θ. In return, the agent receives a reward and transits to the next state.

The process continues until t exceeds the maximum episode length T and the next episode starts.

Network Architecture

The network has three components: a sound encoder (blue), a visual-motor integrator (yellow and green), and a policy learner (purple).

The sound encoder accepts sound features and outputs a vector which we called the “concept vector”. The information contained in the concept vector is the agent’s own interpretation of the sound and is learned through its interaction with the world. The concept vector mimics the conceptual representation formed in humans' brains when humans map sounds to meanings. The sound feature S is encoded only at the first timestep of the episode and the generated concept vector is cached and shared by the rest timestep in an episode. There are two reasons for doing so:

1. Even though the sound is transient, the concept can persist;

2. Caching the vector can help the model achieve real-time performance since the computation-heavy biLSTM and attention layers do not after the first timestep.

We add an auxiliary loss, L_s, for feature extraction.

The output from the CNN is shared by two branches. The upper branch, denoted as U, is combined with the feature from the robot state vector. It is mainly for visual-motor skills. The LSTM is useful in a partially observable environment. The lower branch merges with the information from the sound encoder. L_o is a multi-label classification that predicts which objects are in the view. L_tis a binary classification which predicts whether the target is in the view. L_o and L_t are only needed for the TurtleBot scenario.

We choose to use proximal policy optimization (PPO) for the policy learner. It is a model-free on-policy policy gradient algorithm. 8 instances of the environment are running in parallel for fast and stable learning. The entire history from 4 out of 8 instances are used for an update.

Demo

The video below shows the demo of our work: