CIC: Contrastive Intrinsic Control for Unsupervised Skill Discovery

TLDR: CIC is an exploration algorithm that discovers structured skills
 without extrinsic rewards. It resolves fundamental issues that have prevented prior mutual infomax exploration algorithms from performing well, and achieves SOTA results on the Unsupervised Reinforcement Learning Benchmark (URLB). CIC outperforms prior comparable methods by 1.78x and all prior exploration methods by 1.19x on URLB in terms of interquartile mean performance.

1. What is competence-based exploration?

There are three categories that describe most reward-free exploration algorithms known to date - knowledge-based, data-based, and competence-based. Knowledge-based methods maximize the prediction error or uncertainty of a predictive model (e.g. Curiosity, RND), data-based methods maximize the diversity of observed data (e.g. APT, count-based methods), competence-based methods maximize the mutual information between states and some latent vector often referred to as the "skill" or "task" vector (e.g. DIAYN, DADS). This vector is usually sampled from noise.

Unsupervised Reinforcement Learning

Competence-based Exploration

2. Why do prior competence-based methods perform poorly on the unsupervised RL benchmark?

Recently, the Unsupervised Reinforcement Learning Benchmark (URLB) has been proposed as a way to benchmark unsupervised RL exploration algorithms. In URLB agents are pre-trained for 2M steps in each domain. After that a single pre-trained agent is fine-tuned for 100k steps to solve various downstream tasks within the domain. It was observed that competence-based exploration algorithms perform substantially worse than other exploration methods.

In this work, we ask why is this the case?

Competence-based exploration underperforms on URLB

3. Why competence-based methods need to support large skill spaces

We argue that the main issue has to do with weak discriminators in competence-based exploration. When decomposing the mutual information, I(s;z) = H(z) - H(z|s) = H(s) - H(s|z) the conditional entropy estimator is called the discriminator. In prior works, the discriminators are either classifiers over discrete skills or regressors over continuous skills. The problem is that classification and regression tasks need an exponential number of diverse data samples to be accurate. But in complex environments, there can be a very large number of skills and we therefore need discriminators capable of supporting large skill spaces. This tension between the need to support large skill spaces and the limitation of current discriminators leads us to propose Contrastive Intrinsic Control (CIC).

4. Contrastive Intrinsic Control - a new contrastive objective between states and skills for a powerful discriminator

Contrastive Intrinsic Control (CIC) introduces a new contrastive density estimator to approximate the conditional entropy (the discriminator). Unlike visual contrastive learning, this contrastive objective operates over state transitions and skill vectors. This allows us to bring the powerful representation learning machinery from vision to unsupervised skill discovery.

The CIC architecture

5. CIC resolves issues with prior methods and achieves SOTA results compared prior exploration algorithms

With explicit exploration through the state-transition entropy term and new contrastive discriminator, CIC adapts extremely efficiently to downstream tasks - outperforming prior competence-based approaches by 1.78x and all prior exploration methods by 1.19x on URLB.

CIC is the first competence-based method to achieve SOTA on URLB

5. Behaviors produced with CIC

Due to both state-transition entropy maximization and state-skill representation learning, CIC produces diverse exploratory behaviors.







5. Behaviors produced with DIAYN

In the absence of early resets (see paper for details) DIAYN skills map to static poses limiting exploration.






7. Both entropy maximizing and representation learning are needed for CIC to work

We run reward-free pre-training experiments on Quadruped to understand which factors contribute to CIC performance. The first (blue) is the full CIC algorithm. The second (orange) is CIC without state-skill representation learning. The third (green) is CIC but without state entropy maximization. We evaluate by reading out the zero-shot extrinsic reward for the Quadruped stand task to sense check CIC. While the policy in the two ablations collapses to static modes, the reward profiles for full CIC are dynamic and non-zero suggesting that both components are needed for the algorithm to work.

8. Skill-related architectural details that matter

We provide ablations to show what parts of the CIC architecture contribute most to its effective performance. First, we find that a skill projection head is an essential part of the architecture (similar to the prediction head in SimCLR). Second, we find that the skill dimension matters and larger skill dimensions result in stronger performance (up to a point). Third, we find that spending the initial 4k steps of fine-tuning for skill selection leads to better performance.

Ablations showing how architectural choices contribute to CIC performance