Interpretable Latent Spaces for Learning from Demonstration
Yordan Hristov, Alex Lascarides, Subramanian Ramamoorthy
University of Edinburgh
Conference on Robot Learning 2018, Zürich, Switzerland
Abstract: Effective human-robot interaction, such as in robot learning from human demonstration, requires the learning agent to be able to ground abstract concepts (such as those contained within instructions) in a corresponding high-dimensional sensory input stream from the world. Models such as deep neural networks, with high capacity through their large parameter spaces, can be used to compress the high-dimensional sensory data to lower dimensional representations. These low-dimensional representations facilitate symbol grounding, but may not guarantee that the representation would be human-interpretable. We propose a method which utilises the grouping of user-defined symbols and their corresponding sensory observations in order to align the learnt compressed latent representation with the semantic notions contained in the abstract labels. We demonstrate this through experiments with both simulated and real-world object data, showing that such alignment can be achieved in a process of physical symbol grounding.
Modified dSprites Dataset
The controlled data-generative factors of variation of the modified dSprites dataset (see above) make it suitable for exploring how the two baselines compare to the proposed full model. The dataset is of size 1800 images - 72 objects (see above) with spatial x/y variations in the image). We perform two experiments with the same underlying dataset - experiment 1 and experiment 2 - but different sets of symbols, in order to demonstrate how the user’s preference is encoded in the latent space.
Real Objects Dataset
In order to demonstrate the application of the framework to real-world human-robot interaction scenarios, a second dataset of objects on a table-top is gathered from a human demonstration. The task the human performs is to separate a set of observed objects by their function - juggling balls vs orbs, and then by their color - red vs yellow vs blue. Lego blocks and whiteboard pins are also present in the scene, but they are not manipulated and no label information is given about them from the expert.
At test time the agent has to repeat the task, with new objects being present in the scene that were previously unobserved—green objects and a yellow rubber duck. Each object image is augmented , resulting in a dataset of size 3000 images of 15 objects (see above) with added spatial variation. With this dataset we perform a third experiment - experiment 3.