A Guidance for Reward Shaping and Reward Design in Value-Based DRL

Hao Sun, Lei Han, Rui Yang, Xiaoteng Ma, Jian Guo, Bolei Zhou

hs789@cam.ac.uk

[NeurIPS'22 Site] [Code] [Camera Ready] [Open Review] [BibTex]

TL; DR

How to design a reward function in a reinforcement learning task? Is there a guidance for reward design? How to improve performance on your RL task by changing the reward?

There is an effective and simple way for reward shaping/reward engineering in value-based Deep-RL!

The simplest approach of reward shaping --- constant reward shifting --- is demonstrated to be effective in boosting exploration and conservative exploitation.

Abstract

In this work, we study the simple yet universally applicable case of reward shaping in value-based Deep Reinforcement Learning (DRL). We show that reward shifting in the form of a linear transformation is equivalent to changing the initialization of the Q-value in function approximation. Based on such an equivalence, we bring the key insight that a positive reward shifting leads to conservative exploitation, while a negative reward shifting leads to curiosity-driven exploration. Accordingly, conservative exploitation improves offline RL value estimation, and optimistic value estimation improves exploration for online RL. We validate our insight on a range of RL tasks and show its improvement over baselines: (1) In offline RL, the conservative exploitation leads to improved performance based on off-the-shelf algorithms; (2) In online continuous control, multiple value functions with different shifting constants can be used to tackle the exploration-exploitation dilemma for better sample efficiency; (3) In discrete control tasks, a negative reward shifting yields an improvement over the curiosity-based exploration method.

Key Insight:

A positive reward shifting leads to conservative exploitation, and

a negative reward shifting leads to curiosity-driven exploration.

Presentation

To appear

Poster

Slide

RewardShifting.pptx