Dual-Force: Enhanced Offline Diversity Maximization under Imitation Constraints

Pavel Kolev Marin Vlastelica Georg Martius

University of Tübingen Tübingen AI Center ETH Zürich

Abstract

Offline diversity maximization under state-only imitation constraints can turn demonstration data into a set of distinct policies, potentially improving robustness without additional environment interaction. Prior offline imitation-constrained approaches, however, typically rely on mutual-information objectives with a learned skill discriminator and can become brittle under the non-stationary rewards induced by alternating Lagrangian updates. We introduce Dual-Force, an offline algorithm that replaces the discriminator with a DICE-based, off-policy Van der Waals diversity signal derived from successor features, and mitigates reward non-stationarity by conditioning the value function and policy on a pre-trained Functional Reward Encoding (FRE). The FRE latent also serves as a compact key for recalling previously encountered skills, so the set of recoverable skills can grow over training rather than being fixed a priori. On two Solo12 simulation tasks (locomotion and obstacle navigation) Dual-Force recovers diverse, effective skills that respect the imitation constraints in practice and improve robustness under challenging obstacle perturbations.

Problem Formulation

Method

Main Contribution

Prior Work Limitations (DOI)

Dual-Force (Our method)

Diversity Objective

Requires learning a skill-discriminator
i) hard to train it offline
ii) InfoGain helps but quickly vanishes

Van der Waals (VDW) + Successor Features
No need to learn a skill-discriminator
Provides strong diversity signal

Non-Stationary Rewards

DICE assumes stationary reward,
violating it makes Value training unstable

Handles non-stationary rewards by conditioning Value function on FRE embedding

Dependance on num_skills

Scales linearly with the num_skills
Learning large set of skills is prohibitive

Independent of the num_skills
All observed skills during training are invocable

Theoretical Insights

Pseudocode: Dual-Force Algorithm

Experiments

Successor Features

Obstacle Navigation SFs

(a)

UMAP Obstacle Navigation SFs

(b)

Locomotion SFs

(c)

UMAP Locomotion SFs

(d)

(a,c) Successor Features pair-wise ℓ2 distances between skills.
1st row is SMODICE-expert and all other rows are learned skills.

(b,d) UMAP projection of SFs into 2D. The blue triangle is the SMODICE-expert and the colored dots are the learned skills.

Obstacle Navigation

While the multi-modal SMODICE-expert prefers passing over the box, the set of learned skills capture all modalities {left, right, top, mixed}.

A performance benchmark of skills learned in the obstacle navigation task, where the SMODICE-expert is initialized in front of a box of height 0.2m and reaches a target position behind the box. The learned skills exhibit diverse behaviors that cover various modalities of expert and offline datasets.

Straight

straight.pdf

straight.mp4

Left

left.pdf

left.mp4

Right

right.pdf

right.mp4

Video visualization of learned skills for an obstacle navigation task.

Locomotion

The learned skills find all base-height movements (high, middle, low) and have different angular velocity.

The SMODICE-expert has middle base-height.

A selected subset of skills learned by the Dual-Force algorithm.

High

high.pdf

high.mp4

Middle

middle.pdf

middle.mp4

Low

low.pdf

low.mp4

Video visualisation of a locomotion task.

Robustness: Fence Obstacles

Learned skills outperform SMODICE-expert (left, right) or perform on par with it (both left and right).

A performance benchmark with additional fence obstacles, of height 0.6m, partially blocking the path from (a) the left side, (b) the right side, or (c) both the left and right sides. Among the diverse skills learned, there are several that outperform the SMODICE-expert in (a,b) and perform on par with the SMODICE-expert in (c).

References

[DOI] M. Vlastelica, J. Cheng, G. Martius, P. Kolev: “Offline Diversity Maximization Under Imitation Constraints”, RLC’24

[FRE] K. Frans, S. Park, P. Abbeel, S. Levine: “Unsupervised Zero-Shot Reinforcement Learning via Functional Reward Encodings”, ICLM’24

[VDW] T. Zahavy, Y. Schroecker, F. Behbahani, K. Baumli, S. Flennerhag, S. Hou, S. Singh: “Discovering Policies with DOMiNO: Diversity

Optimization Maintaining Near Optimality”, ICLR’23

[SMODICE] Y. Ma, A. Shen, D. Jayaraman, O. Bastani: “Versatile Offline Imitation from Observations and Examples via Regularized

State-Occupancy Matching”, ICML’22

[DICE] O. Nachum, B. Dai: “Reinforcement learning via fenchel-rockafellar duality”, arXiv’20

[WGAN] I. Gulrajani, F. Ahmed, M. Arjovsky, V. Dumoulin. A. Courville: “Improved Training of Wasserstein GANs”, NeurIPS’17

Page updated

Google Sites

Report abuse