Bigger, Regularized, Optimistic:
scaling for compute and sample-efficient continuous control
NeurIPS'24 (spotlight)
NeurIPS'24 (spotlight)
Ideas NCBR, 2. Faculty of Mathematics, Informatics, and Mechanics; University of Warsaw, 3.Faculty of Electronics and Information Technology; Warsaw University of Technology, 4. Polish Academy of Sciences, 5.Nomagic
Sample efficiency in Reinforcement Learning (RL) has traditionally been driven by algorithmic enhancements. In this work, we demonstrate that scaling can also lead to substantial improvements. We conduct a thorough investigation into the interplay of scaling model capacity and domain-specific RL enhancements. These empirical findings inform the design choices underlying our proposed BRO (Bigger, Regularized, Optimistic) algorithm. The key innovation behind BRO is that strong regularization allows for effective scaling of the critic networks, which, paired with optimistic exploration, leads to superior performance. BRO achieves state-of-the-art results, significantly outperforming the leading model-based and model-free algorithms across 40 complex tasks from the DeepMind Control, MetaWorld, and MyoSuite benchmarks. BRO is the first model-free algorithm to achieve near-optimal policies in the notoriously challenging Dog and Humanoid tasks.
Figure 3: BRO sets new state-of-the-art outperforming model-free (MF) and model-based (MB) algorithms on 40 complex tasks covering 3 benchmark suites. Y-axes report interquartile mean calculated on 10 seeds, with 1.0 representing the best possible performance. We use 1M environment steps.
Extensive empirical analysis - we conduct an extensive empirical analysis focusing on critic model scaling in continuous deep RL. By training over 15,000 agents, we explore the interplay between critic capacity, replay ratio, and a comprehensive list of design choices.
BRO algorithm - we introduce the BRO algorithm, a novel model-free approach that combines regularized BroNet architecture for critic scaling with domain-specific RL enhancements. BRO achieves state-of-the-art performance on 40 challenging tasks across diverse domains.
Scaling & regularization - we offer several insights, with the most important being that regularized critic scaling outperforms replay ratio scaling in terms of performance and computational efficiency; the inductive biases introduced by domain-specific RL improvements can be largely substituted by critic scaling, leading to simpler algorithms.