The grand aim of having a single robot that can manipulate arbitrary objects in diverse settings is at odds with the paucity of robotics datasets. Acquiring and growing such datasets is strenuous due to manual efforts, operational costs, and safety challenges. A path toward such a universal agent is in need of an efficient framework capable of extracting generalization but within a reasonable data budget. In this paper, we develop an efficient framework (MT-ACT) for training universal agents capable of multi-task manipulation skills using (a) semantic augmentations that can rapidly multiply existing datasets and (b) action representations that can extract performant policies with small yet diverse multi-modal datasets without overfitting. In addition, reliable task conditioning and an expressive policy architecture enables our agent to exhibit a diverse repertoire of skills in novel situations specified using task commands. Using merely 7500 demonstrations, we are able to train a single agent capable of 12 unique skills and demonstrate its generalization over 30 tasks spread across common daily activities in diverse kitchen scenes. On average MT-ACT outperforms prior methods by over 40% in unseen situations, while being more sample efficient.
MT-ACT agent capable of 12 manipulation skills demonstrated in 100s of scenes
Summary of MT-ACT framework showing the two main stages: 1) Scene augmentation for multiplying data, and 2) learning efficient action representations for ingesting multi-model multi-task data into a single agent.
Data augmentations that we develop to rapidly multiply scarce robot datasets with semantic scene variations
We study the MT-ACT agent across four axes of generalizations, visualized on the Figure on Left.
As can be seen to solve increasing levels of generalization requires the policy to be highly robust.
Figure on the left compares our proposed MT-ACT policy representation against several imitation learning architectures. For this result we use environment variations that include object pose changes and some lighting changes only. Somewhat similar to previous works, we refer to this as L1-generalization. From our results we can clearly see that using action-chunking to model sub-trajectories significantly outperforms all baselines, thereby reinforcing the effectiveness of our proposed policy representation for sample efficient learning.
Figure on the left shows the different levels of generalization we test our approach on. Recall, Figure above visualizes levels of generalization. Herein, we show how each method performs on these levels of generalization. In a rigorous evaluation study under, we observe that MT-ACT significantly outperforms all other methods especially on harder generalization levels (L3). Interesting, we can see that while most baseline methods perform comparatively well for L1 generalization (max: 30%), their performance decreases rapidly (max: 3%) for L3 generalization.
We also evaluate how RoboAgent performs with increasing levels of semantic augmentations. We evaluate this on one activity (5-skills). Figure on the left shows that with increased data (i.e. more augmentations per frame) the performance significantly improves across all generalization levels. Importantly, the performance increase is much larger for the harder tasks (L3 generalization).
Here, we compare MT-ACT in the bottom row with MT-ACT (w/o aug) in the top row. We can see that by virtue of being trained with diverse semantic augmentations, MT-ACT generalizes to different scenes with significant variations
L2 generalization
L4 generalization
L3 generalization
L2 generalization
Here we show live-resetting where in a human resets the scene complete and the learned policy needs to perform the same skill as before. This shows that our learned policy is significantly robust to different distractors in the scene.
Here, in the video on the right, we show results for manually perturbing a scene while the agent is performing the task. This results shows that even though a human hand was never seen by the agent during training it is robust to occlusion from the hand and task perturbation performed by the human.
Extreme generalization and robustness to geographically different locations