On the model-based stochastic value gradient for continuous RL
State and reward predictions for all tasks
We show predictions of the agent's future states (in blue) and rewards (in red). The ground-truth future states and rewards are shown in black. We argue that even though the dynamics of most of the MuJoCo locomotion tasks are complex across the entire agent's state space, they are smooth, stable, and easier to model around the optimal trajectories. We do control on top of these accurate short-horizon rollouts and learn a neural network policy that amortizes the solution to the control optimization problem.
More detailed videos: How to interpret
Each video frame shows the current state of the agent and the control sequence to be executed on the system at the bottom (the left-most one is the one that will be executed). The state predictions on the right show the state predictions made by the dynamics model (in blue) if the entire sequence is executed compared to what would happen (in black). We show the reward predictions in red and compare them to the true rewards that the agent would obtain in gray, and in the "Rewards" box we also show the predicted termination condition.