On this project page, we visualize examples of policies from different parts of the return landscape. We report the mean and standard deviation of the post-update return distribution for that policy. The examples are selected to illustrate that distributions with similar mean can exhibit significantly different variability, corresponding to qualititatively different behaviors learned by the agent.
Halfcheetah (TD3)
Mean: 5152
Standard Deviation: 442
Mean: 5142
Standard Deviation: 2042
In Halfcheetah, the primary risk to the agent is flipping over on its back and getting stuck. The policy with small standard deviation adopts a gait which hugs the ground, preventing it from flipping over, while the policy with large standard deviation performs a risky behavior which frequently tips up into the air, eventually causing it to fail.
Hopper (SAC)
Mean: 1713
Standard Deviation: 117
Mean: 1721
Standard Deviation: 820
In Hopper, the policy with low standard deviation of the post-update return distribution performs a stable, well-balanced, upright gait. The policy with the high standard deviation corresponds to an unstable curved gait -- we see this policy eventually trip and fall.
Ant (SAC)
Mean: 2548
Standard Deviation: 536
Mean: 2609
Standard Deviation: 1659
In Ant, the agent incurs the risk of flipping over and causing termination. The policy on the right goes too fast without appropriately balancing against flips. This causes the mean of the post-update return distribution to be the similar.
Walker2d (TD3)
Mean: 2241
Standard Deviation: 94
Mean: 2229
Standard Deviation: 922
In walker, the agent has to walk without falling. The agent on the left has a more upright gait, while the agent on the right leans backwards to make potentially wider steps but being unsafe in a way that lets it fall at the end of the trajectory.