Norman Di Palo¹ ², Leonard Hasenclever*², Jan Humplik*², Arunkumar Byravan*²
Imperial College London¹, Google DeepMind²
*senior authors
We propose Diffusion Augmented Agents, a framework based on an interplay between language-vision models and diffusion models that improves transfer learning and efficient exploration in embodied agents.
We address the problem of sample-efficiency when training instruction-following embodied agents using reinforcement learning in a lifelong setting, where rewards may be sparse or absent. Our framework leverages a large language model (LLM), a vision language model (VLM), and a pipeline for using image diffusion models for temporally and geometrically consistent conditional video generation to hindsight relabel agent’s past experience. Given a video-instruction pair and a target instruction, we ask the LLM if our diffusion model could transform the video into one which is consistent with the target instruction, and, if so, we apply this transformation. We use such hindsight data augmentation to decrease the amount of data needed for 1) fine-tuning a VLM which acts as a reward detector as well as 2) the amount of reward-labelled data for RL training. The LLM orchestrates this process, making the entire framework autonomous and independent from human supervision, hence particularly suited for lifelong reinforcement learning scenarios. We empirically demonstrate gains in sample-efficiency when training in simulated robotics environments, including manipulation and navigation tasks, showing improvements in learning reward detectors, transferring past experience, and learning new tasks, key abilities for efficient, lifelong learning agents.
We test our method on two robotics environments: a stacking environment and a room navigation environment.
RGB Stacking
RGB Stacking
Room
Room
We propose a diffusion pipeline that improves geometrical and temporal consistency, and can modify objects in videos. On the left we have examples of real observations, and on the right diffusion-augmented observations.
Real
Real
Diffusion-Augmented
Diffusion-Augmented
Real
Real
Diffusion-Augmented
Diffusion-Augmented
Real
Real
Diffusion-Augmented
Diffusion-Augmented
Diffusion-Augmented
Diffusion-Augmented
Diffusion-Augmented
Here are some example augmentations from data taken from the Open X-Embodiment Dataset and more.
Original
Diffusion-Augmented
Original
Diffusion-Augmented
Original
Diffusion-Augmented
Original
Diffusion-Augmented
Original
Diffusion-Augmented
Original
Diffusion-Augmented
Original
Diffusion-Augmented
Original
Diffusion-Augmented
Original
Diffusion-Augmented
Original
Diffusion-Augmented
Here we highlight the geometrical and temporal consistency of our pipeline on a series of real world videos.
Original
Diffusion-Augmented
Diffusion-Augmented
Original
Diffusion-Augmented
Diffusion-Augmented
Here we highlight the effect of both the geometrical consistency and temporal consistency of our method, by removing them on some example videos.
Original
Geom + Time
Geom, No Time
No Geom, No Time
Original
Geom + Time
Geom, No Time
No Geom, No Time
Original
Geom + Time
Geom, No Time
No Geom, No Time
Original
Geom + Time
Geom, No Time
No Geom, No Time
We show here some additional results on background/room augmentations that we did not include in our submission, but indicate directions for future work. Thanks to our diffusion pipeline, when also augmented with optical flow estimation, we can modify entire rooms by keeping their geometrical structure. We can therefore go from low-fidelity simulations to photorealistic scenes.
Simulator Observations
Diffusion-Augmented
Diffusion-Augmented
Simulator Observations
Diffusion-Augmented
Diffusion-Augmented
NeRF Observations
Diffusion-Augmented
NeRF Observations
Diffusion-Augmented
Limitations
While our work demonstrated the positive impact foundation and generative models can have on embodied AI, there are a series of limitations in our current work that can be address with future research.
1) The temporal consistency is obtained via fixing noise maps and adding frame level cross-attention to single-frame diffusion models, like Stable Diffusion 1.5. While this technique led to convincing results, as shown above, in the future the use of Video Diffusion Models could bring the visual fidelty to an even higher standard.
2) To achieve geometrical consistency, we input to a ControlNet a series of visual inputs obtained from the RGB observation, such as depth maps, normal maps, canny edges, and segmentation maps. To mimic a real world scenario, we obtain all of these from off-the-shelf models. These models can however produce noisy or incorrect estimations. The performance is therefore sometimes bottlenecked by the performance of these models (e.g. in the above videos some visual errors due to incorrect segmentation can be seen). However, the performance of monocular depth estimation, segmentation, etc. has constantly improved over the year. We therefore believe the errors stemmed from incorrect outputs from these models will reduce substantially over time.
3) We currently apply diffusion augmentation by visually modifying an object into another one with the same geometrical structure. This is an important constraint in order to assume the original trajectory of actions would also work on the new objects. However, recent improvements in 3D object generation may lead to the ability to also generate and use geometrically similar, but not identical objects, therefore making the method able to learn even more general behaviors from the same data.
4) When robot manipulators grasp objects, they may become strongly occluded, and therefore our pipeline cannot recognise them and therefore modify them into different objects via diffusion.
Models and Main Hyperparameters