Strength Through Diversity:

Robust Behavior Learning
via Mixture Policies

Diversity

Efficiency in robot learning is closely linked to hyperparameter selection. Most approaches to hyperparameter optimization require either sequential or parallel repetition, strongly increasing environment interactions and computational cost. We propose a training method that only relies on a single experiment. This is achieved by training mixture policies with diverse components, enabling us to reduce the impact of design choices - providing low-level policies with different sets of hyperparameters and distribution types. Our approach yields robust and data-efficient learning by letting the agent select controller structure conditioned on the task, while exploiting synergies between individual controller designs .

Continuous Control

Continuous control enables representation of intricate transitions in state-action space to yield highly-optimized behaviors through local exploration or to generate smooth references for a low-level tracking controller. Asymmetric policies can guard against undesirable state-action half-spaces while exploring the remaining interactions. Here, we consider Gaussian and Kumaraswamy policy heads. The probability density of the latter approximates the Beta distribution, while being easier to reparameterize.

Discrete Control

Discrete control can leverage reduced resolution for coarse exploration and easily encodes bang-bang responses to switching dynamics . We consider Categorical and Discrete Gaussian policy heads. The latter is implemented as a Categorical with probabilities parameterized by a Gaussian distributions to embed relational structure. In mixed continuous-discrete distributions, we introduce action tolerances (bin width) to map out-of-distribution samples into the support. This improves sharing of gradient information by enabling discrete distributions to train samples generated by continuous distributions.

Specialization

Our approach allows for combining diverse distribution types and can learn to exploit their synergies. Combining a narrow Gaussian (N) with a bang-bang Categorical (C) enables component specialization according to specific phases of the task. On a Cartpole swing-up task, the agent leverages bang-bang control for swing-up and continuous control for stabilization. On a Cheetah locomotion task, continuous control helps to coordinate the intricate contact phase and bang-bang control quickly retracts the limbs during the flight phase. Temporal activation patterns and t-SNE dimensionality reduction along state trajectories confirm consistency of these component specialization.

Gaussian (N)

Kumaraswamy (K)

Categorical (C)

Discrete Gaussian (D)

Diverse Mixture (NKCD)

Diverse Mixtures

Diverse mixtures further enable robust behavior learning by exploiting well-suited components while guarding against component failure. We compare a diverse mixture against its individual components on two challenging locomotion tasks. On the torque-controlled humanoid, the Kumaraswamy policy fails to learn entirely, while the Discrete Gaussian is unable to learn proper position targets for the ANYmal robot. On both tasks, the diverse mixture learns to coordinate its low-level components to generate stable locomotion patterns.