The network has three components: a sound encoder (blue), a visual-motor integrator (yellow and green), and a policy learner (purple).
The sound encoder accepts sound features and outputs a vector which we called the “concept vector”. The information contained in the concept vector is the agent’s own interpretation of the sound and is learned through its interaction with the world. The concept vector mimics the conceptual representation formed in humans' brains when humans map sounds to meanings. The sound feature S is encoded only at the first timestep of the episode and the generated concept vector is cached and shared by the rest timestep in an episode. There are two reasons for doing so:
1. Even though the sound is transient, the concept can persist;
2. Caching the vector can help the model achieve real-time performance since the computation-heavy biLSTM and attention layers do not after the first timestep.
We add an auxiliary loss, Ls, for feature extraction.
The output from the CNN is shared by two branches. The upper branch, denoted as U, is combined with the feature from the robot state vector. It is mainly for visual-motor skills. The LSTM is useful in a partially observable environment. The lower branch merges with the information from the sound encoder. Lo is a multi-label classification that predicts which objects are in the view. Lt is a binary classification which predicts whether the target is in the view. Lo and Lt are only needed for the TurtleBot scenario.
We choose to use proximal policy optimization (PPO) for the policy learner. It is a model-free on-policy policy gradient algorithm. 8 instances of the environment are running in parallel for fast and stable learning. The entire history from 4 out of 8 instances are used for an update.