A Benchmark for Low-Switching-Cost
A ubiquitous requirement in many practical reinforcement learning (RL) applications is that the deployed policy that actually interacts with the environment cannot change frequently. Such an RL setting is called low-switching-cost RL, i.e., achieving the highest reward while reducing the number of policy switches during training. It has been a recent trend in theoretical RL research to develop provably efficient RL algorithms with low switching cost. The core idea in these theoretical works is to measure the information gain and switch the policy when the information gain is doubled. Despite of the theoretical advances, none of the existing approaches have been validated empirically. We conduct the first empirical evaluation of different policy switching criteria on popular RL testbeds, including a medical treatment environment, the Atari games, and robotic control tasks. Surprisingly, although information-gain-based methods do recover the optimal rewards, they often lead to a substantially higher switching cost. By contrast, we find that a feature-based criterion, which has been largely ignored in the theoretical research, consistently produces the best performances over all the domains. We hope our benchmark could bring insights to the community and inspire future research. Our code and complete results can be found at https://sites.google.com/view/low-switching-cost-rl.
Results on MuJoCo
The results on six MuJoCo control environments. In each environment, we visualize the training reward over the steps on the top and the switching cost in a log scale at the bottom.
Results on Atari games (Average results)
The average results on Atari games. We compare different switching criteria across 56 Atari games with 3 million training steps. We visualize the human normalized reward on the left. The figure on the right shows the average switching cost, which is normalized by the switching cost of "none'' and shown in a log scale.