Active offline policy selection

Abstract

We address the problem of policy selection in domains with abundant logged data, but with a very restricted interaction budget. Solving this problem would enable safe evaluation and deployment of offline reinforcement learning policies in industry, robotics, and recommendation domains among others.

Several off-policy evaluation (OPE) techniques have been proposed to assess the value of policies using only logged data. However, there is still a big gap between the evaluation by OPE and the full online evaluation in the real environment. At the same time, large amount of online interactions is often not feasible in practice.

To overcome this problem, we introduce active offline policy selection — a novel sequential decision approach that combines logged data with online interaction to identify the best policy. This approach uses OPE estimates to warm start the online evaluation. Then, in order to utilize the limited environment interactions wisely, it relies on a Bayesian optimization method, with a kernel function that represents policy similarity, to decide which policy to evaluate next.

We use multiple benchmarks with a large number of candidate policies to show that the proposed approach improves upon state-of-the-art OPE estimates and pure online policy evaluation. We also show that our approach is successful on the real robot as well.

Active offline policy selection

Offline policy selection

Active offline policy selection (a-ops) as Bayesian optimisation

In our approach we advance a solution based on Bayesian optimization (BO). It entails learning a Gaussian process (GP) surrogate function that maps policies to their expected returns. Then, we build on the GP statistics to construct an acquisition function to decide which policy to test next. To make BO successful in this problem setting, our approach has two key features. First, we incorporate existing OPE estimates as additional noisy observations. This allows us to warm start online policy evaluation and to overcome the difficulties of GP hyper-parameter optimisation at the start. Second, we model correlation between policies through a kernel function based on actions that the policies take in the same states of the environment. This makes our method data efficient as the information about the performance of one policy informs us about the performance of similar behaving policies without costly execution in the environment. It is particularly valuable when the number of candidate policies is large (or even larger than the interaction budget).

Policy kernel

A key component of the GP model is the kernel that measures our belief about the policy correlation. To obtain a kernel we make a key assumption: Policies that are similar in the actions that they take, yield similar returns. Then, our insight is to measure the distance between the policies through the actions that each of them takes on a fixed set of states from the offline dataset

Quantitative results

We conduct a set of experiments on several environments from dm-control (9 environments), MPG (4 environments) and Atari (3 environments). Between 60 and 250 policies are trained for each task. Policies are learnt either from states (dm-control) or vision (MPG and Atari).

We show the results averaged on 9 environments of dm-control. In each environment we conduct 100 experiments with 50 samples policies. Each component of our method is important: combining offline and online evaluation, using Bayesian optimisation to select the most promising policies and modelling the correlation between tasks with Gaussian Processes. We show the average result and the standard deviation of the average.

The full design of ablation is studies each component of our A-ops method: the use of OPE evaluation metric as a starting point in online evaluation (OPE), the active sampling in deciding which policy to evaluate next (UCB vs Uniform) and the use of policy kernel to model correlation between policies (GP vs Ind)

The main comparison is with the strategies that are currently used in practice: full offline evaluation (OPE) and full online evaluation (Ind+Uniform). In all three domains A-ops method outperforms them with a noticeable margin. In dm-control we use 50 policies and in MPG and Atari domains we sample 200 policies in each experiment

In the detailed ablations, we again see that each component is important: 1) The use of OPE (with OPE in the first row and without in the second). 2) The choice of policy model: GP (purple and red) and independent (green and blue). 3) The policy selection strategy: active (dark, solid line) and uniform sampling (lighter, dashed line).

How important is it to incorporate OPE estimates?

Across all the experiments, incorporatingOPE estimates (top row) always considerably improves the results compared to starting policy selection from scratch (bottom row).

How informative is our kernel?

The kernel is a key ingredient for improving data efficiency. Purple and red lines use GP as the policy model and green and blue use the Ind model. In the vast majority of settings (11 out of 12) the use of a kernel significantly improves the results and in one setting they perform on par.

How important is the selection strategy?

The active selection of policies is generally beneficial for identifying a good policy. For this we refer to the results of dashed lines that correspond to Uniform policy selection and solid lines that correspond to the use of UCB. Using Ind+UCB yields a high regret in the initial exploration stage, but it improves substantially over Ind+Uniform later. Moreover, incorporating OPE estimates and a kernel (resulting in A-ops) significantly shortens the exploration stage.

Real robot results

We demonstrate that our approach can be applied for policy selection on the real robot without any modification.

OPE with Fitted Q-evaluation works very well for the real robot policies. It is able to rank both offline RL policies and sim2real policies.

A-ops helps to identify a good policy very quickly: after the same number of evaluations as it is usually used for full evaluation of two policies.

Qualitative results

We show the policies that are selected to be evaluated by Online policy selection and A-ops.

The initial OPE scores are shown in light blue, the selected policy is highlighted with magenta star, the observation of the currently executed policy is shown in orange circle, past observations are in pink and the prediction with its standard deviation is in purple. For visualisation purpose, we use only 20 policies and organise them by ground truth (GT, not observed by the algorithm).

Online policy selection

When doing simple online policy evaluation, a lot of interactions with the environment are not very useful as they consider every policy independently and sample all policies uniformly, including not promising policies. We can see that policy evaluations (pink dots) are scattered uniformly.

A-ops

With a-ops the interactions with the environment are used wisely and the algorithm is very data efficient. Firstly, OPE scores provide a sensible starting point. Secondly, thanks to using a UCB criterion a-ops focuses on the most promising policies. Finally, when a single policy is evaluated, its return is informative for the related through actions policies as well. We can see that policy evaluations (pink dots) are clustered around well performing policies, thus the limited interaction budget is spent wisely.

A-ops method scales well with the growing number of policies.

We varied the number of policies in the dataset between 25 and 200 in 4 tasks of MPG domain. While the performance of simple online policy selection degrades significantly when the number of policies is increased, the performance of A-ops remains almost constant.

A-ops can work with various OPE methods.

Starting from OPE of different quality A-ops improve its regret after 50 policy executions as shown in the figure. However, when OPE quality is very low, it might be more beneficial to ignore as A-ops without it (GP+UCB) may do better.

Results by task

We show the results of all ablation combinations on each of the environments separately. We shows the comparison between our proposed method A-ops, completely offline policy selection with OPE and completely online selection with Ind+Uniform.

Dm-control

A-ops does as well as or better than both offline policy selection (OPE) and online policy selection on 9 of 9 dm-control suite tasks. Depending on the quality of the initial OPE values and the variance of the policy returns, A-ops may take different number of trajectories before it outperforms all the baselines, but usually it only takes a few steps. Eventually, A-ops reaches the best performance with the limited budget.

MPG

OPE performs exceedingly well in 3 of 4 tasks, getting regret close to zero for 2 tasks. Nevertheless we manage to perform about as well or better on all of the tasks: In 2 environments, A-ops only approaches the OPE baseline, but in the other 2 environments, A-ops quickly surpasses OPE. It makes the most improvement in the slide task. However, the most important observation is that A-ops achieves a small regret in all environments.

Atari

Due to the variance of the returns of the policies in this domain, it takes a large number of environment interactions for an online policy evaluation method to provide accurate estimates and in all environments offline evaluation is better. However, A-ops method outperforms other baselines with only a small amount of environment interactions.

We shows the contribution of each of the components of the method for each of the environments in tree domains. Use the arrows to see the results in each environment separately.

Dm-control

Our method A-ops is preferable in all environments across a wide range of interaction budget except for cheetah_run with less than 50 trajectories. Again we observe that modelling correlated policies as in GP performs better than modelling independent policies as in Ind, active policy selection as in UCB is better than uniform policy selection as in Uniform. In manipulator tasks, no method achieves a regret as low as in other tasks. We believe that the main reasons for this are 1) low performance of the initial OPE estimates and 2) the very skewed distribution of episodic returns of all policies where most returns are close to 0.


MPG

Compared to various ablations, our method A-ops is preferable in all environments across a wide range of interaction budgets.

Atari

A-ops is preferable in all environments across a wide range of interaction budgets.

Finally, we shows the contribution of each of the components of the method in case when OPE is not used. It is clear that the results are significantly worse than when using OPE (see above) which clearly indicated the benefit of OPE component in A-ops. When OPE estimates are not available, the combination of modelling correlated policies as in GP and intelligent policy selection as in UCB gives the best results on average.

Dm-control

GP+UCB performs better than the next best method in 6 environments, slightly worse in 1 and approximately the same in 2. On average, the GP+UCB strategy is the best when OPE estimates are not available.

MPG

Notice the degraded performance of Ind+UCB in the first 200 iterations (mostly exploration stage). This happens because each policy is treated independently and until each of them is executed (200 policies) the regret is quite high. Modelling correlation between the policies as in GP methods helps to alleviate this problem and GP+UCB is the best method here.

Atari

Notice the degraded performance of Ind+UCB in the first 200 iterations (mostly exploration stage). This happens because each policy is treated independently and until each of them is executed (200 policies) the regret is quite high. Modelling correlation between the policies as in GP methods helps to alleviate this problem and GP+UCB method outperforms all competitors within a small number of trajectories, including OPE method that relies on the offline dataset.

Despite the variability in different tasks, A-ops is a reliable method for policy selection.

The performance of A-ops in each of the tasks depends on the OPE scores, episodic returns and true scores distributions.

In each environment we show 1) the distribution of policy OPE metrics as a function of true scores at the top (in orange), and 2) the histogram of the episodic returns for a single randomly selected policy (in green) at the bottom. The randomly selected policy is highlighted with a red dot on the top and it is connected by a dashed line to the mean values in the histogram. The performance of A-ops depends on:

  • How good the OPE scores are. If OPE estimate aligns almost perfectly with the true policy returns, there is not much space for improvement with A-ops. However, as it is impossible to know how good the OPE scores are, A-ops is still a useful tool to achieve low regret with better guarantee.

  • How high the variance of the episodic returns is. If episodic returns have high variance, it might take A-ops a longer time to converge to the correct estimate. Still, A-ops would scale better than simple online estimate of the policy performance.

  • The distribution of the true scores. If good and bad policies are far away from each other, it is easier to identify a good policy with a-ops even with high variance in episodic returns, but if there are close, it might be almost impossible to distinguish between them given a limited interaction budget.

Dm-control

MPG

Atari

To understand the behaviour of different policy selection method on various tasks, we show the selection by several strategies on a randomly sampled experiment. Use arrows to see more environments.

We show the results of simple online policy selection (or Ind+Uniform), combined online and offline selection (or Ind+Uniform+OPE), online+offline with the Bayesian optimisation sampling (or Ind+GP+OPE) and our final method A-ops. For the clarity of the visualisation, we use only 50 policies on each experiment. Online+offline usually finds a good policy faster than online because it starts from a reasonable initial guess in the form of OPE scores. Adding Bayesian optimisation helps further because the most promising policies are evaluated first (those with high predicted returns or high variance). Modelling the correlation between policies is beneficial, especially when the number of policies grows.

Online

Online+offline

Online+offline+BO

A-ops

Limitations and future work

The potential directions for future work include combining A-ops with safety constraints, using collected trajectories to improve OPE estimates and policies themselves and improving the kernel by focusing on the most informative states.