The Differentiable Cross Entropy Method (DCEM)
Supplementary Material
How to interpret these videos
How to interpret these videos
Each frame of these videos is solving a model-based control optimization problem of the form
Each frame then shows the following information:
Cheetah Run
Cheetah Run
We start with the cheetah.run task from the DeepMind control suite with a frame skip of 4 and show the videos of 10 random evaluation episodes.
CEM over the full action space
CEM over the full action space
- We are currently not normalizing the observation space
- The reward is often over-predicted by the model, indicating the controller has found an action sequence that generates unrealistically high expectations, but because of the online training process, the learned model is at a point where the over-predictions are usually not too harmful
- The action sequences are closer to the zero control sequences compared to what DCEM learns
cheetah-cem.mp4
DCEM over the latent action space
DCEM over the latent action space
- The controls are much closer to the boundaries
- Sometimes the reward is extremely over-predicted and causes the cheetah to unrecoverably fall-over
cheetah-dcem.mp4
DCEM over the latent action space + PPO fine-tuning
DCEM over the latent action space + PPO fine-tuning
- Reward is now fine-tuned by PPO and may not match the true reward, but helps the controller stay in safer parts of the state space and not fall as often
cheetah-dcem-ppo.mp4
Walker Walk
Walker Walk
Next we look at walker.walk task with a frame skip of 2 and show the videos of 10 random evaluation episodes.
CEM over the full action space
CEM over the full action space
walker-cem.mp4
DCEM over the latent action space
DCEM over the latent action space
walker-dcem.mp4
DCEM over the latent action space + PPO fine-tuning
DCEM over the latent action space + PPO fine-tuning
walker-dcem-ppo.mp4