A Workflow for Offline Model-Free Robotic RL

{Aviral Kumar*, Anikait Singh*}, Stephen Tian, Chelsea Finn, Sergey Levine
UC Berkeley, Stanford University (*Equal Contribution)

Contact emails: {aviralk , asap7772}@berkeley.edu

TL, DR: We propose a set of metrics and guidelines for tuning certain aspects of
offline RL methods for robotics tasks.

Main Paper + Appendix: Paper Link DR3 used in this paper: DR3 preprint

Project Summary Video

Details, discussion of the method, experimental results are provided on this webpage below.

Details of the training datasets (example trajectories) are on the "Dataset Details" page of this website.

Our Proposed Workflow

Our approach chooses conservative Q-learning (CQL) as our base algorithm and characterizes the behavior of the method into underfitting or overfitting. Underfitting is said to happen when the training error (temporal difference error or the value of the regularizer that CQL adds) is large and does not decrease to smaller values during training. Overfitting is said to happen when the average Q-value learned by the algorithm initially increases over the course of training and then decreases with more training gradient steps. We propose to perform early stopping (policy selection) near the point with the peak in Q-values when overfitting and then apply regularization (e.g., variational information bottleneck) to address this overfitting. In scenarios with underfitting, we propose to utilize high-capacity networks or apply capacity-increasing regularization.

A schematic of our proposed workflow for deciding regularization, model-capacity and policy selection for CQL.

Tuning CQL in Real-World Domains

Scenario #1. Real WidowX pick and place task

In this task, we follow our workflow and first compute the Q-values and the value of TD error and find that the base CQL algorithm is overfitting.

Policy selection: We utilize the policy checkpoint at 50K gradient steps for deployment guided by our proposed workflow since this checkpoint appears right after the Q-value peak.

Correcting overfitting via regularization: We also correct for overfitting using the VIB regularizer, and it attains Q-values shown in brown in the figure above. Note that the Q-values do not decrease anymore and we can utilize any checkpoint for evaluation.

As shown on the plots on the right, while base CQL (200K)
(top) attains 3/9, success rate (orange dashed line), using policy selection (at the green dashed line in the top plot) improves the success rate to 7/9, and using the information bottleneck (bottom plot) gives success rates in 7/9-8/9 for any checkpoint.

Below we show the behaviors learned by our overfitting correction (CQL + VIB) that attains a success rate of 8/9.

Runs for tuned CQL (with overfitting correction via VIB):

Scenario #2. Sawyer Manipulation tasks (put lid on pot, open drawer)

In this task, the goal is to train a Sawyer robot to open a drawer and put a lid on a pot, both in the presence of visually diverse distractor objects. We first run base CQL on these tasks and find that Q-values do not initially increase and then decrease, indicating an absence of overfitting. On the contrary, Q-values only increase (dashed lines in the plot on the top right) and the temporal difference error is large and increasing (dashed lines in the plot on the bottom right), which indicates that CQL is underfitting. To mitigate underfitting, we utilize a larger ResNet architecture (shown in solid lines on both the plots) and observe that this stabilizes both the Q-values as well as leads to small values of TD error.

Now, for understanding the efficacy of our workflow, we evaluate the resulting policies for the base CQL algorithm and the tuned CQL algorithm with the higher capacity ResNet architecture and find that using this ResNet architecture improves the performance of CQL from 0/12 successes on either of the two tasks to 9/12 success rate on the put lid on pot task and 8/12 success rate on the open drawer task. We present sample evaluations on these tasks below, both for base CQL as well as our tuned CQL.

Base CQL

Tuned CQL (w/ ResNet)

Tuned CQL (w/ ResNet)

Tuning CQL in Simulated Robotic Environments

Scenario #1. Tasks with varying amounts of offline data.

We consider the two simulated domains with a varying number of trajectories (50, 100, 500, 10k) and run base CQL on these. We find an initially increasing and then decreasing trend in Q-values (second row of the plot on the right), indicative of overfitting. Our workflow suggests selecting the policy checkpoint which occurs near the peak of Q-values indicated by the light-colored vertical lines. Only for evaluation purposes, we evaluate the performance of each policy checkpoint on the top and find that the policy checkpoint found by our policy selection criterion actually attains good performance compared to other policy checkpoints in the same run, validating the effectiveness of this guideline.

Since these runs exhibit overfitting, we apply a variational information bottleneck (VIB) regularizer to address overfitting and find that not only does this address the drop in Q-values with more training (shown in the plots below), but also leads to much better and stable performance, validating the effectiveness of the proposal to add a capacity-decreasing regularizer to combat overfitting.

Scenario #2. Pick and place with multiple objects.

In this task, we modify the pick and place task to utilize many more objects (1, 5, 10, 20, 35). In this case, we find that for runs with more objects, the value of TD error is large on an absolute scale as shown in the plot in the top row below (>= 1.0 on average), while the Q-values exhibit a stable, non-decreasing trend with more training, which implies that the method is underfitting.

To address underfitting in the case of 35 objects, we utilize a higher-capacity ResNet architecture to represent the policy and use a DR3 regularizer (details of DR3 here) to address underfitting. In this case (bottom row in the above plots), we find that the TD error reduces to values lower than 1.0 (compare red vs blue lines for TD error), and thus our workflow suggests using this bigger architecture + DR3 penalty. For visualization purposes, we plot the performance of the policy over the course of training and find that indeed (1) the performance is stable throughout training, and (2) the performance is highest for the underfitting correction that minimizes the TD error the most (i.e., the red line, DR3 + ResNet). This indicates the efficacy of our workflow in detecting and addressing underfitting.