In value-based actor-critic reinforcement learning, the actor is trained to maximize the critic (Q-function) via gradient ascent. However, in complex tasks like dexterous manipulation, the Q-function landscape has several locally optimal actions. This makes the actor susceptible to getting stuck at local optima, leading to sample-inefficient training and a suboptimal policy on convergence. In this paper, we aim to build actor agents that reliably find actions with better Q-values. To this end, we develop a novel algorithm that progressively eliminates local optima by repeatedly simplifying the landscape of the Q-function. In a diverse set of tasks ranging from restricted locomotion to dexterous manipulation, and large discrete-action space recommender systems, we demonstrate that our approach finds optimal actions more frequently than alternative actor architectures, thereby achieving higher overall task performance.
SAVO outperforms the baselines across discrete and continuous action space environments.
SAVO learns successive surrogate Q-value landscapes that gave fewer local optima, and thus easier to optimize.
Improvement due to SAVO keeps increasing with longer chains of successive actors and surrogates, until saturation.
SAVO improves the sample efficiency on Adroit dexterous manipulation tasks over 2 algorithms: TD3 and REDQ.