Offline Diversity Maximization
Under Imitation Constraints

Abstract

There has been significant recent progress in the area of unsupervised skill discovery, utilizing various information-theoretic objectives as measures of diversity. Despite these advances, challenges remain: current methods require significant online interaction, fail to leverage vast amounts of available task-agnostic data and typically lack a quantitative measure of skill utility. We address these challenges by proposing a principled offline algorithm for unsupervised skill discovery that, in addition to maximizing diversity, ensures that each learned skill imitates state-only expert demonstrations to a certain degree. Our main analytical contribution is to connect Fenchel duality, reinforcement learning, and unsupervised skill discovery to maximize a mutual information objective subject to KL-divergence state occupancy constraints. Furthermore, we demonstrate the effectiveness of our method on the standard offline benchmark D4RL and on a custom offline dataset collected from a 12-DoF quadruped robot for which the policies trained in simulation transfer well to the real robotic system.

Method

We compute expert importance ratios by running SMODICE on the datasets (expert and offline). These expert ratios are then used in the alternating scheme described to obtain the importance ratios for each skill. Specifically, the skill ratios are computed by a DICE-like offline policy evaluation algorithm on input a reward that balances skill diversity and expert imitation.

Algorithm

Discovered Skills on Solo 12

Trotting with wave trunk motion and low base height

Trotting with middle base height

Trotting with high base height

Trot and stop