Weiyao Wang and Gregory D. Hager
Johns Hopkins University
Deploying machine learning algorithms for robot tasks in real-world applications presents a core challenge: overcoming the domain gap between the training and the deployment environment. This is particularly difficult for visuomotor policies that utilize high-dimensional images as input. A common method to tackle this issue is through domain randomization, which aims to broaden the span of the training distribution to cover the test-time distribution. However, this approach is only effective when the domain randomizations encompass the actual shifts in the test-time distribution. We instead take a different approach, where we make use of a single demonstration (a prompt) to learn policy that adapts to the testing target environment. Our proposed framework, PromptAdapt, leverages the Transformer architecture's capacity to model sequential data to learn demonstration-conditioned visual policies, allowing for in-context adaptation to a target domain that is distinct from training. Our experiments in both simulation and real-world settings show that PromptAdapt is a strong domain-adapting policy that outperforms baseline methods by a large margin under a range of domain shifts, including variations in lighting, color, texture, and camera pose.
Left: We first train a teacher policy using privileged ground truth state information.
Right: We then distill the learned policy to a visual input-only student policy through imitation learning. For the student policy, our PromptAdapt architecture leverages a Transformer network to condition on a single demonstration, efficiently adapting to the target domain during testing. Same domain randomization function is applied to images in demonstration and per step observation to enable demonstration conditioned adaptation to visual appearance changes.
Real-world experiment rollouts under sim-to-real settings:
UR5 Reach ID
UR5 Reach OD
UR5 Push ID
UR5 Push OD