The BRO algorithm is based on the well-established Soft Actor-Critic (SAC) [1] and is composed of the following key components:
Bigger - BRO uses a scaled critic network with the default of ~5M parameters, which is approximately 7 times larger than the average size of SAC; as well as scaled training density with a replay ratio of 10 for BRO, and 2 for the BRO (Fast) version.
Regularized - the BroNet architecture, intrinsic to the BRO approach, incorporates strategies for regularization and stability enhancement, including utilizing Layer Normalization after each dense layer, weight decay, and full-parameter resets [2].
Optimistic - BRO uses dual policy optimistic exploration [3] and non-pessimistic quantile Q-value approximation [4] for balancing exploration and exploitation.
The key contribution of this paper is showing how to enable scaling the critic network. We recall that naively increasing the critic capacity does not necessarily lead to performance improvements and that successful scaling depends on a carefully chosen suite of regularization techniques. The BroNet resembles the FFN sub-layer utilized in modern LLM architectures, differing primarily in the placement of the Layer Norms. Crucially, we find that BroNet scales more effectively than other architectures. However, the right choice of architecture and scaling is not a silver bullet. As shown below, when these are plugged into the standard SAC algorithm naively, the performance is weak. The important elements are additional regularization (weight decay and network resets) and optimistic exploration. Interestingly, we did not find benefits from scaling the actor networks.
Figure 1: Scaling the critic parameter count for vanilla dense, spectral normalization ResNet, and our BroNet for BRO (left), and SAC (right). We conclude that to achieve the best performance, we need both the right architecture (BroNet) and the correct algorithmic enhancements encapsulated in BRO. We report interquartile mean performance after 1M steps, with error bars indicating 95% CI from 10 seeds. On the X-axis, we report the approximate parameter count.
Increasing replay ratio is another axis of scaling. We investigate mutual interactions by measuring the performance across different model scales (from 0.55M to 26M) and RR settings (from RR=1 to RR=15). Figure below reveals that the model scaling has a strong impact plateauing at ~5M parameters. For example, a 26M model with RR=1 achieves better performance than a small model with RR=15, even though the 26M model requires three times less wallclock time. Importantly, model scaling and increasing replay ratio work well in tandem and are interchangeable to some degree. We additionally note that the replay ratio has a bigger impact on wallclock time than the model size. This stems from the fact that scaling replay ratio leads to inherently sequential calculations, whereas scaling model size leads to calculations that can be parallelized. For these reasons, BRO (Fast) with RR=2 and 5M network offers an attractive trade-off, being already very sample efficient and fast at the same time.
Figure 2: To account for sample efficiency, we report the performance averaged at 250k, 500k, 750k, and 1M environment steps across different 5 replay ratios and 5 critic model sizes. All agents were evaluated in tasks listed in 40 tasks, and 10 random seeds. The left figure shows performance scaling with increasing replay ratios (shapes) and model sizes (colors). The right figure examines the tradeoff between performance and computational cost when scaling replay ratios versus critic model sizes. Increasing model size leads to substantial performance improvements at lower compute costs compared to increasing the replay ratio.
BRO utilizes two mechanisms to increase optimism. We observe significant improvements stemming from these techniques in both BRO and BRO (Fast) agents (Figure below). The first mechanism involves deactivating Clipped Double Q-learning (CDQ), a commonly employed technique in reinforcement learning aimed at mitigating Q-value overestimation [5]. This is surprising, perhaps, as it goes against conventional wisdom. However, recent work has already suggested that regularization might effectively combat the overestimation [6]. Our analysis indicates that using risk-neutral Q-value approximation in the presence of network regularization unlocks significant performance improvements without value overestimation. The second mechanism is optimistic exploration. We implement the dual actor setup [3], which employs separate policies for exploration and temporal difference updates. The exploration policy follows an optimistic upper-bound Q-value approximation, which has been shown to improve the sample efficiency of SAC-based agents. In particular, we optimize the optimistic actor towards a KL-regularized Q-value upper-bound [3], with Q-values upper-bound calculated with respect to epistemic uncertainty calculated according to the methodology presented in [4].
Figure 3: Impact of removing various BRO components on its performance. We report the percentage of the final performance for BRO (left) and BRO (Fast) (right). The y-axis shows the components that are ablated: -Scale denotes using a standard-sized network, +CDQ denotes using pessimistic Clipped Double Q-learning (which is removed by default in BRO), +RR=1 uses the standard replay ratio, -Dual Pi removes optimistic exploration, and -Quantile and -WD stand for removing quantile Q-values and weight decay, respectively. We report the interquartile mean and 95\% CIs with 10 random seeds for each task. The results indicate that the Scale, CDQ, and RR=1 components are the most impactful for BRO. Since BRO (Fast) has RR=2 by default, reducing it to one does not significantly affect its performance.