Monte Carlo Actor Critic for Sparse Reward Deep Reinforcement Learning from Suboptimal Demonstrations

Albert Wilcox, Ashwin Balakrishna, Jules Dedieu, Wyame Benslimane, Daniel Brown, Ken Goldberg

Abstract

Providing densely shaped reward functions for RL algorithms is often exceedingly challenging, motivating the development of RL algorithms that can learn from easier-to-specify sparse reward functions. This sparsity poses new exploration challenges. One common way to address this problem is using demonstrations to provide initial signal about regions of the state space with high rewards. However, prior RL from demonstrations algorithms introduce significant complexity and many hyperparameters, making them hard to implement and tune. We introduce Monte Carlo Actor-Critic (MCAC), a parameter free modification to standard actor-critic algorithms which initializes the replay buffer with demonstrations and computes a modified Q-value by taking the maximum of the standard temporal distance (TD) target and a Monte Carlo estimate of the reward-to-go. This encourages exploration in the neighborhood of high-performing trajectories by encouraging high Q-values in corresponding regions of the state space. Experiments across 5 continuous control domains suggest that MCAC can be used to significantly increase learning efficiency across 6 commonly used RL and RL-from-demonstrations algorithms.

Monte Carlo Augmented Actor-Critic (MCAC)

The idea behind MCAC is to encourage initial optimism in the neighborhood of successful trajectories, and progressively reduce this optimism during learning so that it can continue to explore new behaviors. To operationalize this idea, MCAC introduces two modifications to existing actor-critic algorithms.

  1. Initialize the replay buffer with task demonstrations.

  2. Compute a modified target Q-value for critic updates by taking the maximum of the standard temporal distance targets used in existing actor critic algorithms and a Monte Carlo estimate of the reward-to-go.

The scatter plots above show Bellman, GQE and MCAC Q estimates on the entire replay buffer, including offline demonstrations, for SAC learners with and without the MCAC modification after 50000 timesteps of training. We see that the agent trained without MCAC fails to learn a good enough Q function to complete the task, but the MCAC estimate is still able to propagate reward signal. The agent trained with MCAC learns a good Q function and reliably completes the task.

Experiments

We evaluate MCAC on five continuous control domains: a pointmass navigation environment, and four high-dimensional robotic control domains. All domains are associated with relatively unshaped reward functions, which only indicate constraint violation, task completion, or completion of a subtask.

We combined MCAC with a variety of baseline RL algorithms and evaluated them on these domains. Overall, we find that MCAC is beneficial to training, especially for SAC and TD3. In particular, in the Pointmass Navigation and Block Lifting environments, TD3 and SAC make almost no progress without MCAC, but learn strong policies with it.

When MCAC is combined with state of the art RL from demonstrations algorithms, we find that it also serves to significantly accelerate online exploration across a number of different algorithms.