Policy optimization in a noisy neighborhood:

on return Landscapes in continuous control

On this project page, we visualize examples of policies from different parts of the return landscape. We report the mean and standard deviation of the post-update return distribution for that policy. The examples are selected to illustrate that distributions with similar mean can exhibit significantly different variability, corresponding to qualititatively different behaviors learned by the agent.

Halfcheetah (TD3)

Mean: 5152

Standard Deviation: 442

Mean: 5142

Standard Deviation: 2042

In Halfcheetah, the primary risk to the agent is flipping over on its back and getting stuck. The policy with small standard deviation adopts a gait which hugs the ground, preventing it from flipping over, while the policy with large standard deviation performs a risky behavior which frequently tips up into the air, eventually causing it to fail.

Hopper (SAC)

Mean: 1713

Standard Deviation: 117

Mean: 1721

Standard Deviation: 820

In Hopper, the policy with low standard deviation of the post-update return distribution performs a stable, well-balanced, upright gait. The policy with the high standard deviation corresponds to an unstable curved gait -- we see this policy eventually trip and fall.

Ant (SAC)

Mean: 2548
Standard Deviation: 536

Mean: 2609
Standard Deviation: 1659

In Ant, the agent incurs the risk of flipping over and causing termination. The policy on the right goes too fast without appropriately balancing against flips. This causes the mean of the post-update return distribution to be the similar.

Walker2d (TD3)

Mean: 2241
Standard Deviation: 94

Mean: 2229
Standard Deviation: 922

In walker, the agent has to walk without falling. The agent on the left has a more upright gait, while the agent on the right leans backwards to make potentially wider steps but being unsafe in a way that lets it fall at the end of the trajectory.

Page updated

Google Sites

Report abuse