Hyperparameters in Reinforcement Learning and How to Tune Them

Theresa Eimer  Marius Lindauer  Roberta Raileanu

[Paper]        [Code

TL;DR: Algorithm Configuration tools perform well on Reinforcement Learning, outdoing Grid Searches with less than 10% of the budget. If not reported correctly, however, all hyperparameter tuning can heavily skew future comparisons. Adopting the tools and reporting standards from Algorithm Configuration can help make RL research more efficient, accessible and reproducible

Why Tune in Reinforcement Learning?

Reinforcement Learning is sensitive to its hyperparameters as anyone who has worked with it can attest - and yet the way hyperparameters in research are determined is usually not mentioned in papers at all or done via inefficient Grid Searches.
This is quite impractical since we have found that almost all hyperparameters for matter on any given environment for SAC, DQN and PPO (see the large influence the number of gradient steps of SAC on the right)- and their importance varies. Therefore we probably want to tune as many hyperparameter as we can if we don't know the domain very well.
Thankfully we also see that the interactions between hyperparameters were fairly benign (see the GAE lambda and learning rate of PPO below) and many hyperparameters have broad ranges where they perform well - so while we should tune many hyperparameters, tuning them is actually often not very hard.




Top: The effect of different numbers of gradient steps for SAC on Brax Ant


Left: The relationship between different learning rates (x-axis) and GAE lambda values (y-axis) for PPO on Acrobot.

Big Efficiency Gains!

We see improvement for almost all the HPO tools we tested, even on small budgets of 16 runs for the 11 hyperparameters we tuned for PPO on Brax. With 64 runs, DEHB outperforms the baseline on every environment in terms of the mean final evaluation reward across three tuning runs - in Grid Search terms, this would mean only tuning 3 hyperparameters with 4 values each.
On ProcGen, the Grid Search baseline used 810 runs and still DEHB outperforms it or performs within its standard deviation on each environment with less than 10% of the budget.

But: Big Discrepancies To Test Seeds...

Trying the incument hyperparameters on fresh test seeds shows us just how big the discrepancies can be - while DEHB performs well overall, Random Search, for example, does not scale well and overfits with the performance dropping to close to zero on the test seed compared to a good incumbent performance - this means the incumbent performance is completely misleading in this case and future researchers probably should not compare to this configuration at all since it's clearly overperforming on the tuning seeds and underperforming on test.

But How Will I Know Which Seeds to Compare On?

That's a very good question - and currently almost impossible to answer! Even if the authors of a paper tested their hyperparameter configurations and did everything right, if they don't report their tuning seeds, comparisons can skew heavily simply by selecting the wrong seeds by chance. 

To make sure you include all relevant details in your paper, use our checklist for reproducible RL research with HPO!
In the same repository, you'll find easy-to-use hydra sweepers for some of the algorithms we test here, including our best one, DEHB. This should make it easy for you to get better hyperparameters faster.

Our Recipe for Efficient RL Research!

1. Define a training and test setting (including e.g. environment(s) and variations, seeds, initial state distributions,...);

2. Define a configuration space with all hyperparameters that likely contribute to training success;

3. Decide which HPO method to use;

4. Define the limitations of the HPO method, i.e. the budget;

5. Settle on a cost metric – this will ideally be an evaluation reward across as many episodes as needed for a reliable performance estimate;

6. Run this HPO method on the training set across a number of tuning seeds;

7. Evaluate the resulting incumbent configurations on the test set across a number of separate test seeds and report the results.