VRL3: A Data-Driven Framework for Visual Deep Reinforcement Learning

Che Wang, Xufang Luo, Keith Ross, Dongsheng Li

NYU | NYU Shanghai | Microsoft Research Asia

NeurIPS 2022 paper |

TL;DR:

We combine ImageNet pretraining, offline RL, and online RL, to massively improve sample efficiency on Adroit by up to 12x (24x finetuned) compared to the previous SOTA, with 10x faster computation, and 3x fewer parameters.

Main Idea:

In order to achieve strong sample efficiency, fully exploit all the data available.

Visual control robotic tasks such as Adroit can be very challenging for 3 major reasons: 1) visual input, 2) sparse reward, and 3) high-dimensional action space. VRL3 uses 3 training stages to fully exploit 3 types of data sources, as shown in the figure above:

S1: pretrain an encoder with large non-RL visual datasets such as ImageNet.
S2: use offline RL to train the encoder and the actor/critic networks on human demonstrations (or other offline RL data).
S3: transition into standard online RL.

Although using pretraining to improve performance is not new. VRL3 is the first work that successfully combines non-RL, offline RL, and online RL data to achieve this level of superior performance on robotic manipulation tasks.

Main Results:

The figure below shows performance comparison with RRL (previous SOTA), DrQv2fD (DrQv2 with demonstrations in buffer), FERM (our implementation of FERM, which has much stronger performance on Adroit than its original codebase). VRL3 has the best overall performance. FERM and DrQv2fD can learn in the first 3 tasks, but fail completely in the hardest Relocate task.

On Adroit, VRL3 achieves a new level of SOTA sample efficiency (780% better on average), parameter efficiency (3 times better), and computation efficiency (10 times faster to solve the hardest relocate task). When using a wider encoder, we can even reach 2440% better sample efficiency on the hardest Relocate task compared to RRL. VRL3 is also competitive on DMC (details in the paper).

On the hardest Relocate task with a single V100 GPU, prior SOTA takes 11M data and 50-60 hours to reach 90% success. VRL3 only takes 0.9M (0.45M finetuned) data and 5-6 hours.

Summary of Insights:

Use off-policy training for a higher data reuse rate.
Visual pretraining helps DRL learning, despite the domain gap.
Offline RL data (expert demonstrations) can provide a benefit that cannot be replaced by other stages
In stage 2, conservative RL (learns both representation and the task) might be a superior option compared to just contrastive representation learning.
In stage 2, when offline RL is applied, additional BC has no effect on performance.
Reduced encoder learning rate helps encoder finetuning.
When the encoder is learned with data augmentation + RL, an additional contrastive loss does not help much.
With limited RL data, task-agnostic representation from the pretrained encoder gives better performance.
With enough RL data, finetuning the encoder with RL data gives better performance.
On our DrQv2 backbone, a simple constraint on the maximum Q value allows smooth offline-online transition, without any fancy techniques.
Learning rate, exploration action noise, and encoder learning rate scale (used to make encoder finetuning slower) are the only 3 hyperparameters that are both important and sensitive.

Core Design Decisions:

The table below shows how the core design decisions of VRL3 compare to other popular methods. VRL3 fully utilizes non-RL, offline RL, and online RL datasets. DA refers to data augmentation, Con for contrastive learning, and Offline for offline RL. Note that offline RL data can be used to learn representation (encoder), or the task (actor, critic), or both.

Effect of Each Stage:

The figure below shows (averaged over 4 Adroit tasks):

(a) the Effect of each training stage. The numbers after “S” indicates the stages that are enabled. S1 can make learning faster. S2 is critical due to the sparse reward setting. S3 is always enabled to achieve non-trivial performance.

(b) Effect of enabling encoder training in each stage. We achieve the best results when all 3 stages are utilized. Task-specific features are more useful, however, when RL data is scarce, non-RL pretraining provides better results despite the large domain gap} (S3> S1 > S2).

Stage Transitions:

Stage 1 -> stage 2: we use convolutional channel expansion (CCE), a simple technique to expand the first conv layer of the encoder so that we can now take in multiple video frames in the RL task (details in paper).

Stage 2 -> stage 3: we use the Safe Q target technique, which simply constrains the maximum Q value. This allows smooth offline-online transition with minimal modification to the backbone algorithm.

The figure below (averaged over 4 Adroit tasks):

(a) Effect of BC and offline RL updates in stage 2. Only applying BC or disabling stage 2 training entirely leads to poor performance.

(b) and (c) Taking naive RL updates in stage 2 (S2 Naive) leads to severe overestimation, which can be mitigated by safe Q (S2 Naive Safe). Taking conservative updates in stage 3 (S3 Cons) leads to underestimation. VRL3 gives the best performance with conservative update only in stage 2 and safe Q. Note in our setting, we use a discount of 0.99, and reshape the per-step max reward to 1, so the maximum reasonable Q value should be around 100.

(d) Effect of different encoder learning rates (e.g., 0.01 means the encoder's learning rate is 100 times slower than that for the policy and Q networks.)

Long-Term Performance:

The figure below shows how VRL3 compares to the long-term performance of RRL. VRL3 has much stronger sample efficiency short-term, and slightly stronger long-term performance. RRL results are from the authors.

Hyperparameter Sensitivity and Robustness:

If we were to apply VRL3 to a different task, how do we know which hyperparameters to finetune? Which components are important, which are robust, and which ones are sensitive and should be tuned with priority?

The following 2 figures show an extensive hyperparameter study. For each figure, the caption shows the type of hyperparameter, X-axis shows different hyperparameter values, Y-axis shows the average success rate in the first 1M frames of training, averaged over three seeds. Each dot is the average success rate of one seed, error bar shows one standard deviation. We use (S) to denote critical hyperparameter that is relatively sensitive and should be tuned with high priority when applied to a new task. (R) denotes critical hyperparameter that is robust or easy to tune.

Note that the encoder learning rate scale is the only important and sensitive hyperparameter introduced by VRL3, while the other 2 are from the backbone algorithm.

Competitive on DMC:

Our backbone algorithm is DrQv2 (thanks to the authors for providing a very clean and efficient codebase), which is a SOTA algorithm on DMC. And since we only make the minimal required modification to the backbone to produce VRL3, it makes sense that VRL3 should still work in DMC.

The following figures show a performance comparison on 24 DMC tasks. Here VRL3, DrQv2fD, and FERM use 25K stage 2 data (collected by a trained DrQv2 agent).

Although VRL3 has the overall best performance, we should not consider VRL3 a better method than DrQv2 on DMC because this is an online RL benchmark. These results only aim to show that VRL3 still works on the popular DMC benchmark and can achieve the same level of SOTA performance.

Also note that the performance gap between VRL3 and FERM/DrQv2fD is smaller than in Adroit. It requires additional investigations to determine how much we can benefit from realistic image data pretraining, in tasks with less realistic visuals, such as the Atari games.

For a ton of other interesting results and technical details, please check out the paper!

Page updated

Report abuse