RoboAgent

Towards sample efficient robot manipulation
with semantic augmentation and action chunking

The grand aim of having a single robot that can manipulate arbitrary objects in diverse settings is at odds with the paucity of robotics datasets. Acquiring and growing such datasets is strenuous due to manual efforts, operational costs, and safety challenges. A path toward such a universal agent is in need of an efficient framework capable of extracting generalization but within a reasonable data budget. In this paper, we develop an efficient framework (MT-ACT) for training universal agents capable of multi-task manipulation skills using (a) semantic augmentations that can rapidly multiply existing datasets and (b) action representations that can extract performant policies with small yet diverse multi-modal datasets without overfitting. In addition, reliable task conditioning and an expressive policy architecture enables our agent to exhibit a diverse repertoire of skills in novel situations specified using task commands. Using merely 7500 demonstrations, we are able to train a single agent capable of 12 unique skills and demonstrate its generalization over 30 tasks spread across common daily activities in diverse kitchen scenes. On average MT-ACT outperforms prior methods by over 40% in unseen situations, while being more sample efficient.

MT-ACT agent capable of 12 manipulation skills demonstrated in 100s of scenes

MT-ACT Framework

Summary of MT-ACT framework showing the two main stages: 1) Scene augmentation for multiplying data, and 2) learning efficient action representations for ingesting multi-model multi-task data into a single agent.

Semantic Augmentations

Data augmentations that we develop to rapidly multiply scarce robot datasets with semantic scene variations

Generalization Axes

We study the MT-ACT agent across four axes of generalizations, visualized on the Figure on Left.

As can be seen to solve increasing levels of generalization requires the policy to be highly robust.

Quantitative Results

L1 Generalization Results

Figure on the left compares our proposed MT-ACT policy representation against several imitation learning architectures. For this result we use environment variations that include object pose changes and some lighting changes only. Somewhat similar to previous works, we refer to this as L1-generalization. From our results we can clearly see that using action-chunking to model sub-trajectories significantly outperforms all baselines, thereby reinforcing the effectiveness of our proposed policy representation for sample efficient learning.

L1 - L3 Generalization Results

Figure on the left shows the different levels of generalization we test our approach on. Recall, Figure above visualizes levels of generalization. Herein, we show how each method performs on these levels of generalization. In a rigorous evaluation study under, we observe that MT-ACT significantly outperforms all other methods especially on harder generalization levels (L3). Interesting, we can see that while most baseline methods perform comparatively well for L1 generalization (max: 30%), their performance decreases rapidly (max: 3%) for L3 generalization.

Scaling Properties for Semantic Augmentations

We also evaluate how RoboAgent performs with increasing levels of semantic augmentations. We evaluate this on one activity (5-skills). Figure on the left shows that with increased data (i.e. more augmentations per frame) the performance significantly improves across all generalization levels. Importantly, the performance increase is much larger for the harder tasks (L3 generalization).

Qualitative Results

Here, we compare MT-ACT in the bottom row with MT-ACT (w/o aug) in the top row. We can see that by virtue of being trained with diverse semantic augmentations, MT-ACT generalizes to different scenes with significant variations

L2 generalization

L4 generalization

L3 generalization

L2 generalization

Showcasing the different abilities of RoboAgent

Video on the right shows the different skills as well as scenarios RoboAgent was evaluated on.

Live Scene Resetting

Here we show live-resetting where in a human resets the scene complete and the learned policy needs to perform the same skill as before. This shows that our learned policy is significantly robust to different distractors in the scene.

Manual Perturbation

Here, in the video on the right, we show results for manually perturbing a scene while the agent is performing the task. This results shows that even though a human hand was never seen by the agent during training it is robust to occlusion from the hand and task perturbation performed by the human.

Example Skill Variations

Pick Butter Variations

Reproducibility Results

Extreme generalization and robustness to geographically different locations

Page updated

Google Sites

Report abuse

RoboAgent

Towards sample efficient robot manipulation with semantic augmentation and action chunking

MT-ACT Framework

Semantic Augmentations

Generalization Axes

Quantitative Results

L1 Generalization Results

L1 - L3 Generalization Results

Scaling Properties for Semantic Augmentations

Qualitative Results

Showcasing the different abilities of RoboAgentVideo on the right shows the different skills as well as scenarios RoboAgent was evaluated on.

Live Scene Resetting

Manual Perturbation

Example Skill Variations

Pick Butter Variations

Pick Butter Variations

Reproducibility Results

Towards sample efficient robot manipulation
with semantic augmentation and action chunking

Showcasing the different abilities of RoboAgent

Video on the right shows the different skills as well as scenarios RoboAgent was evaluated on.