Expected SARSA

SARSA (Sutton and Barto, 2018) is an on-policy method that learns the following Q function

where a' is the recorded action that taken in the environment. Notice the difference compared to standard Max Q learning, which utilizes a max operator and potentially out-of-distribution action a'

While Max Q learns the Q function for an optimal policy, SARSA learns a Q function for the policy that generated a'. Given these considerations, SARSA is not expected to converge to an optimal policy like Max Q is. Only as π approaches the Max Q policy does expected SARSA predict optimal Q values. Although counter-intuitive, (Sutton and Barto 2018) shows how this suboptimal policy can be beneficial in the cliff walking scenario. In this scenario, the Max Q policy learns an optimal, but dangerous policy walking near the edge of a cliff to more quickly arrive at the goal. On the other hand, SARSA learns a safer policy that moves away from the cliff at the expense of a slightly reduced return. Even though Max Q should perform better in theory, in practice the Q function is imperfect and Max Q performs worse than SARSA (Sutton and Barto, 2018).

The figure on the right comes from (Sutton and Barto, 2018). It shows that SARSA learns a safer policy than Max Q on the cliff walking scenario. Max Q learns an optimal policy that remains close to the cliff edge, but reaches the goal more quickly. A single mistake can send the Max Q policy over the edge. In this case, SARSA samples uniform actions with probability $0.1$ and behaves greedily otherwise, making it beneficial to create a gap between the agent and the edge. This results in a safer but less optimal policy. Even though Max Q should produce more optimal policies in theory, SARSA can outperform Max Q in practice.

While the SARSA update utilizes actions a' from the dataset, Expected SARSA uses the policy action distribution instead

Page updated

Google Sites

Report abuse