The Differentiable Cross Entropy Method (DCEM)
Supplementary Material
How to interpret these videos
How to interpret these videos
Each frame of these videos is solving a model-based control optimization problem of the form
Each frame then shows the following information:
Cheetah Run
Cheetah Run
We start with the cheetah.run task from the DeepMind control suite with a frame skip of 4 and show the videos of 10 random evaluation episodes.
CEM over the full action space
CEM over the full action space
- We are currently not normalizing the observation space
- The reward is often over-predicted by the model, indicating the controller has found an action sequence that generates unrealistically high expectations, but because of the online training process, the learned model is at a point where the over-predictions are usually not too harmful
- The action sequences are closer to the zero control sequences compared to what DCEM learns

DCEM over the latent action space
DCEM over the latent action space
- The controls are much closer to the boundaries
- Sometimes the reward is extremely over-predicted and causes the cheetah to unrecoverably fall-over

DCEM over the latent action space + PPO fine-tuning
DCEM over the latent action space + PPO fine-tuning
- Reward is now fine-tuned by PPO and may not match the true reward, but helps the controller stay in safer parts of the state space and not fall as often

Walker Walk
Walker Walk
Next we look at walker.walk task with a frame skip of 2 and show the videos of 10 random evaluation episodes.
CEM over the full action space
CEM over the full action space

DCEM over the latent action space
DCEM over the latent action space

DCEM over the latent action space + PPO fine-tuning
DCEM over the latent action space + PPO fine-tuning
