DCEM

The Differentiable Cross Entropy Method (DCEM)

Each frame of these videos is solving a model-based control optimization problem of the form

Each frame then shows the following information:

We start with the cheetah.run task from the DeepMind control suite with a frame skip of 4 and show the videos of 10 random evaluation episodes.

We are currently not normalizing the observation space
The reward is often over-predicted by the model, indicating the controller has found an action sequence that generates unrealistically high expectations, but because of the online training process, the learned model is at a point where the over-predictions are usually not too harmful
The action sequences are closer to the zero control sequences compared to what DCEM learns

cheetah-cem.mp4

The controls are much closer to the boundaries
Sometimes the reward is extremely over-predicted and causes the cheetah to unrecoverably fall-over

cheetah-dcem.mp4

Reward is now fine-tuned by PPO and may not match the true reward, but helps the controller stay in safer parts of the state space and not fall as often

cheetah-dcem-ppo.mp4

Next we look at walker.walk task with a frame skip of 2 and show the videos of 10 random evaluation episodes.

walker-cem.mp4

walker-dcem.mp4

walker-dcem-ppo.mp4

Page updated

Google Sites

Report abuse