Supplements Experimental Results in Section 5 of the Paper.
Overview of Experiments
We clarify some of the details around our "language-only" baseline. This baseline, which we refer to as an imitation learning baseline, is learned directly via behavioral cloning from pairs of language instructions and corresponding demonstrations.
Looking to the results in our paper, one can ask a valid question about why the performance of the language-only baseline is so low compared to LILA's shared autonomy approach. In our paper, we hinted at an answer — the low sample efficiency of imitation approaches — and even attempted to mitigate this by training our imitation learning models on 2x the demonstrations as LILA, with 3x the data augmentation! This still isn't enough to get good performance, so we explore this sample efficiency hypothesis further here.
We provide additional experiments and technical intuition below. We first present empirical results that:
Show videos of LILA and Imitation Learning in an apples-to-apples comparison, given the same number of demonstrations and data augmentation.
Show the behavior of Imitation Learning as we drastically increase data augmentation (noise-augmenting states during training).
Show the behavior of Imitation Learning as we drastically increase the number of demonstrations seen per task.
We also:
Provide a sketch of a technical argument addressing low-level implementation details that demonstrate why the sample efficiency of Imitation Learning may be worse than expected.
We frame this argument relative to other language-conditioned robotics baselines grounded in simulation as in this paper by Stepputtis et. al. and other work around simulated vision-and-language navigation and manipulation. It's worth noting the # of demos of these latter datasets are in the 100K range, not the < 100 we have in our work!).
Additional Imitation Learning Experiments
We provide empirical results to better characterize the behavior of the Imitation Learning baseline. Each of the following rows presents results with various versions of imitation learning, first establishing a baseline, then ramping up the amount of data augmentation and number of training demonstrations respectively.
Training Details:
We sample 3 of the tasks from the paper (to make demonstration collection a little less time consuming):
Pick Cereal: Pick up the green cereal bowl, and put it on the tray.
Pick Fruit Basket: Pick up the fruit basket and place it on the tray.
Pour Cup: Pick up the blue cup with marbles, and pour the marbles into the black coffee mug.
We provide a reference point first; LILA and Imitation Learning trained on the same number of demonstrations.
For LILA, the controller (in these videos) is one of the authors; this is just for convenience -- please look to the paper and main supplemental page for the user study results.
We assume the same language instruction for each task throughout, displayed underneath the reference videos.
All other details are the same as the paper unless otherwise specified.
Imitation Learning w/ 10 Demos per Task & 3x Data Augmentation
Recap of Data Augmentation Procedure: For each state (joint angles) s_1 in a demonstration, randomly add noise from N(0, 0.01) and compute the action (joint velocity) by taking s_2 - noise(s_1) – you then add this to your dataset. You can do this K times (repeatedly noise states). This is a strong form of data augmentation, as it re-computes actions based on the noised state, similar in fashion to DART.
By default (for LILA, and the above Imitation Learning videos), K = 1. In the following, K = 3, effectively showing the Imitation Learning model 3x more unique (state, action) pairs than LILA sees.
Imitation Learning w/ 10 Demos per Task & 5x Data Augmentation
We stretch this further, and set K = 5 and run the same experiment. Imitation Learning now sees 5x more unique (state, action) pairs.
Imitation Learning w/ 20 Demos per Task & 3x Data Augmentation
Data augmentation alone does not suffice. As in the paper, we double the number of demonstrations, with triple the data augmentation and see how the imitation learning model performs.
Imitation Learning w/ 30 Demos per Task & 3x Data Augmentation
Punchline of Robot Experiments
Despite the amount of demonstrations we collected, and amount of data augmentation used, this standard imitation learning baseline (behavioral cloning) is still not able to get complete task success. It's successfully grounding the language instructions as evidenced by the motions (especially as augmentation and # of demonstrations increase). Semantically, the robot is attempting to perform the correct task – reaching for the cereal bowl, or trying to grasp the cup; it knows where everything is, and what the user's objective is. It just fails to execute on it.
However, as minor errors accumulate, and the robot starts to drift out of distribution, the decision points are no longer clear, and it fails to execute on the complete motion – stopping *just before* reaching the objects. However, at a coarse level, it's clear that the method is capable, but may require much much more data (possibly with states sampled at a much higher frequency) to work. While we think there is some way to get this to work, we hope this shows that the benefits of using latent actions style models is clear. Not only do they require much much less data, but they also are more robust (as we see across rows here, the behavior of the imitation learning model isn't predictable; much is left to randomness!
In the following, we present an argument that illustrates why Imitation Learning and LILA may be *even worse* than expected in our real-robot setting, due to imperfect communication between the top-level Python process (housing the learned models) and the low-level C++ robot controller. We'll do this by walking through a simplified trajectory collection + imitation learning example!
Consider the shown sinusoidal trajectory of a simplified end-effector through 2D space. The robot's motion is continuous, but we can sample states when collecting this "demonstration" at fixed intervals.
Notably, this is an assumption present implicitly in most simulators (fixed frame rate, or fixed control iterations/sec) – this is also usually true when operating with discrete action spaces (move forward/right/left).
However, this fixed interval assumption may not be true when collecting data on a real robot, depending on the implementation. More on this in a bit, but for now, let's assume our demonstration data consists of evenly spaced (state, action) pairs with this fixed interval between samples.
Consider training an imitation learning agent via behavioral cloning, where the policy is parameterized as a neural network. It's not clear what a NN will do given a state outside it's training distribution (could be arbitrarily bad) , but to simplify, let's assume given a new state, this policy will predict an action based on retrieving the "nearest-neighbor" from it's training demo.
Assume there's some noise in the reset (this is representative – there's always *some* noise in the initial joint states - this is represented in simulators like Mujoco and PyBullet). If you roll out the imitation learning policy, you get behavior like that shown – critically, assuming the same constant sampling rate, the state-error grows minimally over time.
Most simulators for continuous robotics are implemented this way – notably those in related work like that of Stepputtis et. al. and in traditional work in Shared Autonomy: for example "Shared Autonomy via Deep Reinforcement Learning" and "Residual Policy Learning for Shared Autonomy") use the simulated Lunar Lander environment (with fixed frame rate) and a discrete-action quadcopter environment, exhibiting this property.
For continuous state and continuous action robotics grounded in a real-world robot, this assumption does not exist, which leads to the following point!
With our implementation on a 7-DoF Franka Emika Panda robot, we noticed that despite our best efforts, we are not able to ensure states/actions are dispatched at a constant sampling rate.
The result (based on our straw man nearest-neighbors NN argument from above – though note the real world behavior, especially in higher dimensions will be much much worse!) is as shown on the right.
With slippage in the read/publish times of states and actions, we can read states too early or too late, execute a "bad" action, and cascade to arbitrarily bad final states over the course of execution (even completely breaking in the middle of execution).
The reasoning for this is simple; with both publish-subscribe approaches, and native light-weight sockets approaches (which our implementation uses to establish communication between our high-level learning code in Python, and the libfranka control code in C++) – we cannot ensure that messages are read and published over the network at a set frequency, especially in the presence of other heavy-processes on the same hardware – like running inference with a NN (especially a language-conditioned one that uses a large model like BERT). Details like this feel under-represented in HRI research, and implementation best practices in light of these claims warrant further research and discussion. We're happy to provide more detail about our robotics stack if that would be helpful!
There are two remaining questions that are worth answering:
Imitation works in practice, why not here? – this is true! IL approaches can absolutely work (especially in the fixed interval paradigm) provided they sample enough demonstrations (have enough data) to cover this arbitrary shift. As one collects more demonstrations, you end up densely sampling the "in-between" points in the above trajectories (and compensating for the distribution in initial state noise) – this takes a lot of data for even simple simulation methods learned end-to-end (e.g., the above vision-language navigation work), but requires even more data on a real-robot to compensate for the noisy communication!
Why does LILA work? – LILA, and latent action approaches fundamentally provide a human user control over the robot behavior. Even if the communication is noisy, provided the demonstration data is diverse enough, it can useful control axes to prevent the robot from getting too out-of-distribution. If you think of a latent actions model as doing a version of PCA – it's learning to fit the "high-variance" components of the data; one of those components (1-DoF) is the component necessary to "do the demonstration" – the other component could fit noise in the said demonstrations, allowing users to correct during execution (e.g., in the above 2D plot, move the end-effector up/down so it returns to "in-distribution" states.
Punchline of Technical Argument
Due to fundamental problems with continuous control, we've found that language-conditioned imitation learning requires a large number of samples per task to work robustly.
Our existing language-conditioned imitation learning baseline (behavioral cloning) is standard and state-conditioned, however, as the above argument lays out, it suffers from these fundamental problems.
LILA as a shared autonomy approach, is able to compensate for this by providing users a controllable space which they can use to recover from arbitrary errors induced by communication and/or inherent noise. Indeed, this quality of latent-actions style approaches to be incredibly sample efficient even in complex, real-world deployments is a further benefit of using such approaches.