Policy Decorator: Model-Agnostic Online Refinement for Large Policy Model

Overview Video (4m 43s)

Click the "cc" button at the lower right corner to show captions.

output_new_new.mp4

Abstract

Recent advancements in robot learning have used imitation learning with large models and extensive demonstrations to develop effective policies. However, these models are often limited by the quantity, quality, and diversity of demonstrations. This paper explores improving offline-trained imitation learning models through online interactions with the environment. We introduce Policy Decorator, which uses a model-agnostic residual policy to refine large imitation learning models during online interactions. By implementing controlled exploration strategies, Policy Decorator enables stable, sample-efficient online learning. Our evaluation spans eight tasks across two benchmarks—ManiSkill and Adroit—and involves two state-of-the-art imitation learning models (Behavior Transformer and Diffusion Policy). The results show Policy Decorator effectively improves the offline-trained policies and preserves the smooth motion of imitation learning models, avoiding the erratic behaviors of pure RL policies.

Improves Various SOTA Policy Models

Our framework, Policy Decorator, improves various state-of-the-art policy models, such as Behavior Transformer and Diffusion Policy, boosting their success rates to nearly 100% on challenging robotic tasks. It also significantly outperforms top-performing baselines from both finetuning and non-finetuning method families.

Combines Advantages of Base Policy and Online Learning

An intriguing property of Policy Decorator is its ability to combine the advantages of the base policy and online learning.

peg_base.mp4

Base Policy

(w/o Online Learning)

The offline-trained base policies can reproduce the natural and smooth motions recorded in demonstrations but may have suboptimal performance.

peg_ours.mp4

Ours

(Base Policy + Online Residual)

Policy Decorator (ours) achieves remarkably high success rates while preserving the favorable attributes of the base policy.

peg_rl.mp4

Online RL Policy

(w/o Base Policy)

Policies solely learned by RL, though achieving good success rates, often exhibit jerky actions, rendering them unsuitable for real-world applications.

Method Overview

Policy Decorator learns a residual policy via reinforcement learning with sparse reward. On top of it, a set of controlled exploration mechanisms is implemented.

Controlled exploration (Progressive Exploration Schedule + Bounded Residual Actions) enables the RL agent (both base policy and residual policy) to continuously receive sufficient success signals while exploring the environments.

Ablation Study

Importance of Each Component

We examined the relative importance of Policy Decorator's main components: 1) residual policy learning; 2) progressive exploration schedule; and 3) bounded residual action. We thoroughly evaluated all possible combinations of these components, with results shown in the right figure.

Conclusion: Each component greatly contributes to the overall performance, both individually and collectively. While residual policy learning establishes the foundation of our framework, using it alone does not sufficiently improve the base policy. Bounded residual action is essential for effective residual policy learning, and the progressive exploration schedule further enhances sample efficiency.

Effect of H in Progressive Exploration Schedule

The hyperparameter H controls the rate at which we switch from the base policy to the residual policy.

Conclusion: A too-small H can lead to failures due to aggressive exploration, while a large H may result in relatively poor sample efficiency. Therefore, tuning H can enhance sample efficiency and ensure stable training. However, using a large H is generally a safe choice if sample efficiency is not the primary concern.

Effect of Bound α of Residual Actions

The hyperparameter α determines the maximum adjustment that can be made by the residual policy.

Conclusion: α is crucial for the performance. If α is too small, the final performance can be adversely affected. Conversely, if α is too large, it may lead to poor sample efficiency during the training process.

Google Sites

Report abuse