Is Bang-Bang Control All You Need?
Solving Continuous Control with Bernoulli Policies

Abstract

Reinforcement learning (RL) for continuous control typically employs distributions whose support covers the entire action space. In this work, we investigate the colloquially known phenomenon that trained agents often prefer actions at the boundaries of that space. We draw theoretical connections to the emergence of bang-bang behavior in optimal control, and provide extensive empirical evaluation across a variety of recent RL algorithms. We replace the normal Gaussian by a Bernoulli distribution that solely considers the extremes along each action dimension - a bang-bang controller. Surprisingly, this achieves state-of-the-art performance on several continuous control benchmarks - in contrast to robotic hardware, where energy and maintenance cost affect controller choices. Since exploration, learning, and the final solution are entangled in RL, we provide additional imitation learning experiments to reduce the impact of exploration on our analysis. Finally, we show that our observations generalize to environments that aim to model real-world challenges and evaluate factors to mitigate the emergence of bang-bang solutions. Our findings emphasise challenges for benchmarking continuous control algorithms, particularly in light of potential real-world applications.

Performance of Bang-Bang and Gaussian policies

Distribution of action samples for converged MPO

Bang-Bang MPO

Gaussian MPO

Bang-Bang MPO

Gaussian MPO

Bang-Bang MPO

Gaussian MPO

Bang-Bang MPO

Gaussian MPO

Imitation Learning

The Bang-Bang and Gaussian policies achieve similar performance on several continuous control tasks from the DeepMind Control Suite. Furthermore, analyzing the distribution of action samples along trajectories for trained Gaussian policies with MPO indicates strong bang-bang behavior. To further investigate the similarity of converged Gaussian and Bang-Bang policies, we consider imitation learning with a Gaussian teacher. We find that on the majority of tasks the Bang-Bang policy is capable of learning high performant behaviors, further indicating the bang-bang action selection exhibited by the Gaussian teacher.

Learning from a Gaussian teacher

Robustness to modified domains and transfer under disturbances

Robustness

We evaluate robustness of Bang-Bang and Gaussian policies under task variations:

(1) learning in perturbed environments (left)
(2) transfer under sensor disturbances (right)

These formulations are based on the Real-World RL Challenge framework [1]. We find that the Bang-Bang policies are robust to these changes and perform on par with their Gaussian counterparts. Therefore, performance similarity is not locally limited to ideal environment formulations and extends across a multitude of learning setups.

Bang-Bang MPO - Large Pole

Bang-Bang MPO - Dropped Sensors

Bang-Bang MPO - Long Thighs

Gaussian MPO - Dropped Sensors

Bang-Bang MPO - Long Torso

Bang-Bang MPO - Low Friction

Bang-Bang MPO - Long Shins

Gaussian MPO - Low Friction

Introducing Action Costs

Action penalties can mitigate emergence of bang-bang behavior. We compare the effect of (1) quadratic value, (2) quadratic change, and (3) quadratic value and change action penalties. On the Quadruped task, we observe that action penalties improve smoothness of the resulting Gaussian policies. On the Walker task, the constrained solution furthermore marks an alternative optimum that reaches performance close to that under the un-penalized reward function. However, on the Quadruped task, the Gaussian policy trades-off task performance for action minimization. We also observe that action penalization slightly improve robustness to transfer under disturbances.

Impact of action penalties on action distribution and performance

Pendulum swing-up with different action costs and policy types

Optimal Parameterization

From an optimal control perspective, the underlying reward structure often correlates with the optimal policy parameterization. Under certain conditions,

(1) minimum state problems without action penalties can induce bang-bang solutions (2) absolute value action penalties can induce bang-off-bang solutions
(3) quadratic penalties yield unconstrained continuous optimal policies

We evaluate performance of Bang-Bang, Bang-Off-Bang, and Gaussian policies on a pendulum swing-up task with three different reward structures aligned with (1)-(3). For each reward structure, the optimal parameterization is highlighted in green. Bang-Bang policies converge quickly due to extremal action selection and thus strong passive exploration. Bang-Off-Bang policies also select 0 actions, reducing exploration. The ideal parameterization might not yield the best exploration dynamics.

Action Cost & Exploration

While action penalties limit maximum performance of Bang-Bang policies, they can also negatively affect exploration in Gaussians or Bang-Off-Bang policies whenever reward maximization is traded-off to minimize action cost.

We consider three variation of the same tasks that differ in their reward structure:
(1) dense rewards without action cost
(2) sparse rewards without action cost
(3) sparse reward with quadratic action cost

We observe that all three policy types perform similarly on the dense tasks versions. On the sparse versions, the Bang-Bang controller converges faster due to its strong passive exploration. Action penalties can exacerbate this, where the Gaussian and Bang-Off-Bang controllers aim to minimize action cost instead and only the Bang-Bang controller solves the tasks. This example further highlights the intricate interplay between avoiding bang-bang behavior and enabling sufficient exploration

Impact of reward sparsity and action cost on exploration capabilities

References

  1. G. Dulac-Arnold, N. Levine, D. J. Mankowitz, J. Li, C. Paduraru, S. Gowal, and T. Hester. An empirical investigation of the challenges of real-world reinforcement learning. arXiv preprint arXiv:2003.11881, 2020