Interpretable Latent Spaces for Learning from Demonstration

Yordan Hristov, Alex Lascarides, Subramanian Ramamoorthy

University of Edinburgh

Conference on Robot Learning 2018, Zürich, Switzerland

Abstract: Effective human-robot interaction, such as in robot learning from human demonstration, requires the learning agent to be able to ground abstract concepts (such as those contained within instructions) in a corresponding high-dimensional sensory input stream from the world. Models such as deep neural networks, with high capacity through their large parameter spaces, can be used to compress the high-dimensional sensory data to lower dimensional representations. These low-dimensional representations facilitate symbol grounding, but may not guarantee that the representation would be human-interpretable. We propose a method which utilises the grouping of user-defined symbols and their corresponding sensory observations in order to align the learnt compressed latent representation with the semantic notions contained in the abstract labels. We demonstrate this through experiments with both simulated and real-world object data, showing that such alignment can be achieved in a process of physical symbol grounding.

Code: https://github.com/yordanh/interp_latent_spaces

Paper: https://arxiv.org/abs/1807.06583v2

Modified dSprites Dataset

The controlled data-generative factors of variation of the modified dSprites dataset (see above) make it suitable for exploring how the two baselines compare to the proposed full model. The dataset is of size 1800 images - 72 objects (see above) with spatial x/y variations in the image). We perform two experiments with the same underlying dataset - experiment 1 and experiment 2 - but different sets of symbols, in order to demonstrate how the user’s preference is encoded in the latent space.

Real Objects Dataset

In order to demonstrate the application of the framework to real-world human-robot interaction scenarios, a second dataset of objects on a table-top is gathered from a human demonstration. The task the human performs is to separate a set of observed objects by their function - juggling balls vs orbs, and then by their color - red vs yellow vs blue. Lego blocks and whiteboard pins are also present in the scene, but they are not manipulated and no label information is given about them from the expert.

At test time the agent has to repeat the task, with new objects being present in the scene that were previously unobserved—green objects and a yellow rubber duck. Each object image is augmented , resulting in a dataset of size 3000 images of 15 objects (see above) with added spatial variation. With this dataset we perform a third experiment - experiment 3.

Visualisation of the two axes - Z0 and Z1 - which, after the learning process, represent visual variations corresponding to variations in the concept groups of the user-defined labels.

Visualisation of the two axes - Z0 and Z1 - which, after the learning process, represent visual variations corresponding to variations in the concept groups of the user-defined labels.

Visualisation of the two axes - Z0 and Z1 - which, after the learning process, represent visual variations corresponding to variations in the concept groups of the user-defined labels.