Offline diversity maximization under state-only imitation constraints can turn demonstration data into a set of distinct policies, potentially improving robustness without additional environment interaction. Prior offline imitation-constrained approaches, however, typically rely on mutual-information objectives with a learned skill discriminator and can become brittle under the non-stationary rewards induced by alternating Lagrangian updates. We introduce Dual-Force, an offline algorithm that replaces the discriminator with a DICE-based, off-policy Van der Waals diversity signal derived from successor features, and mitigates reward non-stationarity by conditioning the value function and policy on a pre-trained Functional Reward Encoding (FRE). The FRE latent also serves as a compact key for recalling previously encountered skills, so the set of recoverable skills can grow over training rather than being fixed a priori. On two Solo12 simulation tasks (locomotion and obstacle navigation) Dual-Force recovers diverse, effective skills that respect the imitation constraints in practice and improve robustness under challenging obstacle perturbations.
Prior Work Limitations (DOI)
Dual-Force (Our method)
Diversity Objective
Requires learning a skill-discriminator
i) hard to train it offline
ii) InfoGain helps but quickly vanishes
Van der Waals (VDW) + Successor Features
No need to learn a skill-discriminator
Provides strong diversity signal
Non-Stationary Rewards
DICE assumes stationary reward,
violating it makes Value training unstable
Handles non-stationary rewards by conditioning Value function on FRE embedding
Dependance on num_skills
Scales linearly with the num_skills
Learning large set of skills is prohibitive
Independent of the num_skills
All observed skills during training are invocable
Obstacle Navigation SFs
(a)
UMAP Obstacle Navigation SFs
(b)
Locomotion SFs
(c)
UMAP Locomotion SFs
(d)
(a,c) Successor Features pair-wise ℓ2 distances between skills.
1st row is SMODICE-expert and all other rows are learned skills.
(b,d) UMAP projection of SFs into 2D. The blue triangle is the SMODICE-expert and the colored dots are the learned skills.
While the multi-modal SMODICE-expert prefers passing over the box, the set of learned skills capture all modalities {left, right, top, mixed}.
A performance benchmark of skills learned in the obstacle navigation task, where the SMODICE-expert is initialized in front of a box of height 0.2m and reaches a target position behind the box. The learned skills exhibit diverse behaviors that cover various modalities of expert and offline datasets.
Straight
Left
Right
Video visualization of learned skills for an obstacle navigation task.
The learned skills find all base-height movements (high, middle, low) and have different angular velocity.
The SMODICE-expert has middle base-height.
A selected subset of skills learned by the Dual-Force algorithm.
High
Middle
Low
Video visualisation of a locomotion task.
Learned skills outperform SMODICE-expert (left, right) or perform on par with it (both left and right).
A performance benchmark with additional fence obstacles, of height 0.6m, partially blocking the path from (a) the left side, (b) the right side, or (c) both the left and right sides. Among the diverse skills learned, there are several that outperform the SMODICE-expert in (a,b) and perform on par with the SMODICE-expert in (c).
[DOI] M. Vlastelica, J. Cheng, G. Martius, P. Kolev: “Offline Diversity Maximization Under Imitation Constraints”, RLC’24
[FRE] K. Frans, S. Park, P. Abbeel, S. Levine: “Unsupervised Zero-Shot Reinforcement Learning via Functional Reward Encodings”, ICLM’24
[VDW] T. Zahavy, Y. Schroecker, F. Behbahani, K. Baumli, S. Flennerhag, S. Hou, S. Singh: “Discovering Policies with DOMiNO: Diversity
Optimization Maintaining Near Optimality”, ICLR’23
[SMODICE] Y. Ma, A. Shen, D. Jayaraman, O. Bastani: “Versatile Offline Imitation from Observations and Examples via Regularized
State-Occupancy Matching”, ICML’22
[DICE] O. Nachum, B. Dai: “Reinforcement learning via fenchel-rockafellar duality”, arXiv’20
[WGAN] I. Gulrajani, F. Ahmed, M. Arjovsky, V. Dumoulin. A. Courville: “Improved Training of Wasserstein GANs”, NeurIPS’17