Improving Actor-Critic by Simplifying Q-Landscape

tl;dr: Actors often fail to find the action that maximizes the Q-value

In value-based actor-critic reinforcement learning, the actor is trained to maximize the critic (Q-function) via gradient ascent. However, in complex tasks like dexterous manipulation, the Q-function landscape has several locally optimal actions. This makes the actor susceptible to getting stuck at local optima, leading to sample-inefficient training and a suboptimal policy on convergence. In this paper, we aim to build actor agents that reliably find actions with better Q-values. To this end, we develop a novel algorithm that progressively eliminates local optima by repeatedly simplifying the landscape of the Q-function. In a diverse set of tasks ranging from restricted locomotion to dexterous manipulation, and large discrete-action space recommender systems, we demonstrate that our approach finds optimal actions more frequently than alternative actor architectures, thereby achieving higher overall task performance.

Insight: Successively prune value-function surface to make optimization by actor easier.

Architecture: Successive actors maximize their surrogate landscapes and output better actions.

Quantitative Results

SAVO outperforms the baselines across discrete and continuous action space environments.

Qualitative Results

SAVO learns successive surrogate Q-value landscapes that gave fewer local optima, and thus easier to optimize.

Improvement due to SAVO keeps increasing with longer chains of successive actors and surrogates, until saturation.

Adroit Results

SAVO improves the sample efficiency on Adroit dexterous manipulation tasks over 2 algorithms: TD3 and REDQ.

Google Sites

Report abuse