No-Language Baseline (Pure Latent Actions)

Augments the Experimental Setup in Section 5 of the Paper, justifying omission from the user study.

Overview of Experiments

One main contribution of our paper is the addition of language to enable prior work on learning latent actions to help in multi-task environments. First, we note that we include a comparison with a no-language baseline in the supplement (Figure 3).

Specifically, we visualized trajectories comparing the original method for learning latent actions (without language) against LILA for
two tasks, which demonstrated the extremely poor performance of a no-language baseline. Videos of controlling the robot with this baseline can be seen below:

no-lang-grab-the-cereal-bowl (1).mp4

User Intent: "Pick up the fruit basket and put it on the tray"

no-lang-pour-blue-cup-into-the-black-mug (1).mp4

User Intent: "Put the banana in the purple basket"

This method for learning latent actions from state alone (no language!) is a vacuous baseline, and as such we omit it from our user study.

Specifically:

  1. It can be trivially shown that without any additional input, latent actions are incapable of being useful for multi-task environments, as they are limited by total degrees of freedom of the controller (shown with experiments below).

  2. Adding language is a rich way of providing the additional input that is necessary for multi-task behavior, as it is at minimum equivalent to adding additional control degrees of freedom (but much more natural and easier to adapt to the many task setting).

  3. The goal of our user study is to compare methods that could be useful for successful task completion. If included, the suggested baseline would be uninformative as it would be impossible for any user to achieve above a 0% task completion rate.

While the purpose of Figure 2 (re-produced) was to demonstrate these points theoretically, we trained a no-language LA for the cross example to demonstrate these points empirically, which we will also include in the updated paper.

Formal Argument & Additional Experiments

Figure 2 from our paper shows a simple "cross" example, where starting from the mid-point, the goal is to be able to navigate towards all 4 possible directions.

As the standard LA framework only takes in the controller inputs, it is impossible for a user to succeed with only 1 degree-of-freedom (DoF). To solve this task, we need an additional axis to condition on (at least 2-DoF)!

To provide empirical evidence for our above point, we train 3 different latent action models for the above cross disambiguation task: A 1 DoF without language baseline, LILA (our proposed approach) with 1 DoF, and a 2 DoF without language baseline. Each model is trained on a dataset of 100 demonstrations collected across the 4 tasks. We then visualize movement trajectories by controlling the latent action (z for 1 DoF, z1 and z2 for 2 DoF).

As expected, without any additional input such as language, a 1-DoF controller is incapable of task disambiguation, with a clear 0% task success rate for our simple cross setting. With both our 1-DoF controller w/ language input and a 2 DoF controller, task disambiguation is possible – highlighting the necessity of additional information of any modality. However, note that in large multi-task environments with dozens of tasks, re-designing a controller can be difficult – language is a much more flexible and natural way to add this information.

We include these results showing the limitations of no-language approaches in our revision. However, we emphasize that because the goal of our user study was to evaluate different methods for achieving multiple tasks, we believed it would be inappropriate to include a method that fundamentally would be unable to be useful .

Punchline of No-Language Baseline

In the supplemental of our paper, we include two trajectory visualizations that we hoped showed that the no-language baseline, in "averaging" all the demonstrations across all the tasks, created a 2-DoF latent action space that was unintuitive, incapable of making progress, and generally a vacuous baseline.

In the above, not only do we show additional evidence (videos of the no-language baseline and the unintuitive behavior it generates), but a technical argument showing that without extra conditioning information (e.g., language that describes the task), a 2-DoF latent space formally is not expressive enough to get task-solving behavior, justifying our decision to omit this from our user study.