Perceiver-Actor-Critic

Offline Actor-Critic Reinforcement Learning Scales to Large Models

Jost Tobias Springenberg*, Abbas Abdolmaleki*, Jingwei Zhang*, Oliver Groth*, Michael Bloesch*, Thomas Lampe*,
Philemon Brakel*, Sarah Maria Elisabeth Bechtle*, Steven Kapturowski*, Roland Hafner, Nicolas Heess, Martin Riedmiller
(*) core contributors

Key Contributions

Apply offline RL to scalable transformer networks without additional costs.
Efficient action and value sampling using Perceiver-style cross-attention.
Seamless transition between behavioral cloning (BC) and reinforcement learning (RL) at any stage of the training pipeline.
Indications that model performance of offline RL scales better with increasing compute than pure BC.

State-of-the-art low-level control

Multi-modal: processes vision, proprioception and language simultaneously
Scalable: up to 1B parameters
Fast: 20 Hz inference speed on a local GPU (RTX 3090)
General: solves 78 continuous control tasks with 78% average success rate
Self-improving: able to achieve ‘task mastery’ (>90% success rate on a target task) using iterative offline RL

Context

Large transformers have been used for control tasks via behavior cloning (BC), similar to how large language models are pre-trained. In real-world application domains like robotics however, we often don't have a lot of expert data available. Offline reinforcement learning (RL), in contrast to BC, can use non-expert data but it's not clear how well offline actor-critic methods in particular scale to larger models. For this project, we set out to investigate this and perform a scaling analysis. We found that our proposed offline actor-critic system does indeed scale to large models and has a couple of advantages over previous BC-based methods.

Approach

Our training objective is a combination of offline RL (based on the MPO [1] algorithm) and behavioral cloning terms. The latter is used to ensure stable training and a scaling parameter allows us to interpolate smoothly between BC and RL as is illustrated by the following equation:

Or in more detail:

A schematic overview of our model architecture can be seen in the figure below:

Our actor-critic architecture outputs both action and value predictions. To process multiple input modalities, we combine cross- and self-attention blocks like in a Perceiver [2] model. This allows for an elegant integration of encodings of proprioception, images and language data while avoiding the quadratic compute complexity of a standard transformer. The action encodings are cross-attended to at the later stages of processing to ensure that they can sufficiently influence the value predictions. The cross-attentions at the decoding stage allow the actions and values to be predicted in a single step (in contrast to prior transformer-based approaches in which the actions were predicted one-by-one as a sequence) and are also particularly useful when evaluating values for multiple input actions in parallel. To allow the single-step prediction, zero-padding of actions and observations is used to make sure that we can use data from different domains where actions and observations might have different dimensions.

The efficiency of the cross-attention blocks and single-step formulation of action prediction allows our models to process high-dimensional data at speeds that are still practical enough for real robotics settings. For example, we were able to run a 1B model processing ~2.6K tokens per timestep at 20 Hz controlling embodiments with up to 38 DoFs. In contrast, the Q-Transformer architecture [3] is about 30x smaller and can only do ~5 Hz due to its quadratic complexity for input processing and output decoding.

We train models of various sizes that we refer to by identifiers ranging from XXS (32M) to L (988M). We also train two different versions of the model: one in which the action-value function is estimated (PAC) and one in which the state value function is estimated (PAC+V).

Results and Videos

Our data set contains a wide variety of task domains:

GATO Control Suite data [4] consists of records of 32 diverse continuous control tasks (featuring up to 223 proprioceptive and 38 action dimensions) being solved with an RL algorithm from scratch.
CHEF data [5] consists of records of a 5-DoF Sawyer robot learning to stack two objects across five different object sets in simulation and in the real world from scratch using an RL algorithm as well.
RoboCat Tower + Pyramid datasets [6] consist of data from a 7-DoF Panda robot learning to build three-object towers and pyramids with object sets from the RGB Stacking benchmark [7] using an RL algorithm.
RoboCat Insertion (based on NIST [8]) features the same robot inserting three differently sized gears onto pegs, but is collected by human teleoperators.

In total, we train our models on a data mix of 4.024M episodes, containing a total number of 2.84T tokens across all modalities (vision, proprioception, language and action).

Here are a couple of videos comparing the behaviors of PAC and BC models on these task domains:

BC (set2)

RL (set2)

BC (set3)

RL (set3)

BC (set1)

RL (set1)

Policy success rates across tasks in each task family for 100 evaluations per task. The average success rate in the training data is reported as pD. For GATO: Control, the percentage of achieved expert average reward and the standard-error-based 95% CIs are reported. For all other task families, the average success rates and their corresponding Wilson score intervals for α=0.05 are reported. Best results (within CI of the best mean) in each row are bold († cited from [4]; ★ cited from [6]).

From the table above we want to highlight the following results:

GATO / RC is outperformed by all the other methods on all tasks, except for RC:Pyramid, where performance is on par. Since this includes experiments where our architecture was also trained with BC, this result indicates that the difference may be caused by the changes in the architectural backbone (e.g., the input cross-attention).
PAC shines when the data has a low average success rate, as can be seen for the CHEF:sim task where the average success rate is only 28%. This highlights that our method fulfills one of the main promises of offline-RL: it can learn successful policies even from severely sub-optimal data.
𝛼-PAC, which uses a different alpha for each dataset based on the proportion of successful episodes in each of them, achieves the best performance on almost all task domains.

RL by Design

As explained above, the choice of our training objective makes it easy to transition from BC to RL training by changing the value of 𝛼. On the CHEF:real domain, BC+Q (𝛼=0, β>0, so BC but with a Q-value head that is trained with a temporal difference loss even though it isn't used during pre-training) performance after first training with 𝛼=1 for 3M steps is initially very low at 7.1%. If we then follow this by training for 3M steps with 𝛼=0, performance increases significantly to 61.9%. This demonstrates that we can safely transition from BC to RL at any point during the training process.

We also investigated the benefits of closing the loop by extending our dataset with data generated by evaluation of the pre-trained model itself. This can be seen as one batch of off-policy RL and we refer to this setup as RLFT. As shown in the table below, the real robot task benefits a lot from further and repeated training with self-generated data.

RL vs BC at Scale

To get more insight into the scaling behavior of PAC and BC models, we trained models of various different sizes and performed a scaling analysis similar to the one of the Chinchilla model [9]. By performing curve fitting, we obtain functions which graphs represent relations between various quantities of interest like the average return and the amount of computation in FLOPs. The figure below shows that the average returns obtained by PAC scale better with the total number of FLOPs than those of the BC models.

Return profile comparison between BC+Q and PAC for various model sizes.

When we constrain the data budget to a single epoch of 2.45T tokens, the curve fits suggest to train a 1.33B parameter model in the BC case whereas in the case of PAC a smaller model of only 954M parameters is suggested. Data wise, BC and PAC scale nearly the same according to our analysis, but the RL objective seems to benefit more from additional parameters as the compute budget increases compared BC. This suggests that the capacity needed for the Q-function is larger than for BC prediction.

For both the token and parameter scaling plots (left, right) for PAC, we indicate the scaling trend with a dashed red line. The green intersection represents the optimality point when training on a single epoch of our data while the teal intersection represents the optimal data and parameter trade-off for a FLOP budget of 1E+21.

Another way to visualize the scaling behavior is by plotting iso-return contours where the average return is constant for different numbers of parameters and FLOPs. As the figure below shows, the contours for PAC models are shifted to the top left and this difference becomes more pronounced for higher average return levels. This indicates that for fixed FLOP budgets, the PAC models achieve higher average return plateaus.

Iso-reward contour plots comparing BC+Q and PAC.

Conclusion

To conclude, we apply offline RL to scalable transformer networks without additional costs with an architecture that allows efficient action and value sampling using Perceiver-style cross-attention. This architecture is natively multi-modal and processes vision, proprioception and language inputs simultaneously. It is also efficient enough to run with 1B parameters at 20 Hz on a local GPU, enabling precise real robot control. Our setup allows seamless transition between BC and RL without model or objective changes during any state of the training pipeline. We show empirically that our models are general and can learn to solve 78 continuous control tasks with 78% average success rate. When fine-tuned on rollouts generated by the model itself, one of our models also achieved 'task mastery' (>90% success rate on the target task). Finally, our scaling analysis suggests that the model performance of offline RL scales better with increasing compute than pure BC.

References

[1] Abdolmaleki, Abbas, et al. "Maximum a Posteriori Policy Optimisation." International Conference on Learning Representations. 2018.

[2] Jaegle, Andrew, et al. "Perceiver: General perception with iterative attention." International conference on machine learning. PMLR, 2021.

[3] Chebotar, Yevgen, et al. "Q-transformer: Scalable offline reinforcement learning via autoregressive q-functions." Conference on Robot Learning. PMLR, 2023.

[4] Reed, Scott, et al. "A generalist agent." arXiv preprint arXiv:2205.06175 (2022).

[5] Lampe, Thomas, et al. "Mastering Stacking of Diverse Shapes with Large-Scale Iterative Reinforcement Learning on Real Robots." arXiv preprint arXiv:2312.11374 (2023).

[6] Bousmalis, Konstantinos, et al. "RoboCat: A Self-Improving Foundation Agent for Robotic Manipulation." arXiv preprint arXiv:2306.11706 (2023).

[7] Lee, Alex X., et al. "Beyond pick-and-place: Tackling robotic stacking of diverse shapes." Conference on Robot Learning. PMLR, 2022.

[8] https://www.nist.gov/el/intelligent-systems-division-73500/iros-2017robotic-grasping-and-manipulation-competition

[9] Hoffmann, Jordan, et al. "Training compute-optimal large language models." arXiv preprint arXiv:2203.15556 (2022).

Citation

@misc{springenberg2024pac,
title={Offline Actor-Critic Reinforcement Learning Scales to Large Models},
author={Jost Tobias Springenberg and Abbas Abdolmaleki and Jingwei Zhang and Oliver Groth and Michael Bloesch and Thomas Lampe and Philemon Brakel and Sarah Maria Elisabeth Bechtle and Steven Kapturowski and Roland Hafner and Nicolas Heess and Martin Riedmiller},
year={2024},
eprint={2402.05546},
archivePrefix={arXiv},
}

Page updated

Google Sites

Report abuse