The Differentiable Cross Entropy Method (DCEM)
How to interpret these videos
Each frame of these videos is solving a model-based control optimization problem of the form
Each frame then shows the following information:
We start with the cheetah.run task from the DeepMind control suite with a frame skip of 4 and show the videos of 10 random evaluation episodes.
CEM over the full action space
- We are currently not normalizing the observation space
- The reward is often over-predicted by the model, indicating the controller has found an action sequence that generates unrealistically high expectations, but because of the online training process, the learned model is at a point where the over-predictions are usually not too harmful
- The action sequences are closer to the zero control sequences compared to what DCEM learns
DCEM over the latent action space
- The controls are much closer to the boundaries
- Sometimes the reward is extremely over-predicted and causes the cheetah to unrecoverably fall-over
DCEM over the latent action space + PPO fine-tuning
- Reward is now fine-tuned by PPO and may not match the true reward, but helps the controller stay in safer parts of the state space and not fall as often
Next we look at walker.walk task with a frame skip of 2 and show the videos of 10 random evaluation episodes.