The Differentiable Cross Entropy Method (DCEM)

Supplementary Material

How to interpret these videos

Each frame of these videos is solving a model-based control optimization problem of the form

Each frame then shows the following information:

Cheetah Run

We start with the cheetah.run task from the DeepMind control suite with a frame skip of 4 and show the videos of 10 random evaluation episodes.

CEM over the full action space

  • We are currently not normalizing the observation space
  • The reward is often over-predicted by the model, indicating the controller has found an action sequence that generates unrealistically high expectations, but because of the online training process, the learned model is at a point where the over-predictions are usually not too harmful
  • The action sequences are closer to the zero control sequences compared to what DCEM learns
cheetah-cem.mp4

DCEM over the latent action space

  • The controls are much closer to the boundaries
  • Sometimes the reward is extremely over-predicted and causes the cheetah to unrecoverably fall-over
cheetah-dcem.mp4

DCEM over the latent action space + PPO fine-tuning

  • Reward is now fine-tuned by PPO and may not match the true reward, but helps the controller stay in safer parts of the state space and not fall as often
cheetah-dcem-ppo.mp4

Walker Walk

Next we look at walker.walk task with a frame skip of 2 and show the videos of 10 random evaluation episodes.

CEM over the full action space

walker-cem.mp4

DCEM over the latent action space

walker-dcem.mp4

DCEM over the latent action space + PPO fine-tuning

walker-dcem-ppo.mp4