QAs

Below, we answer some questions we received from the online audience. Thank you for many great questions! and sorry for not having time enough to address them during the tutorial. (We bunched together some similar questions.)

Estimators

How sensitive is the DRos (shrinkage) estimator to the lambda parameter?

A. You can see our simulation here, and you see that it is sensitive. However, as we discussed, we can use the data-driven hyperparameter tuning method to select a nice value.

How do the different OPE methods behave with large action spaces?

A. You can try it by yourself by customizing our simulations here, (specifically, the "number of actions" argument of "SyntheticBanditDataset"). We expect that all OPE estimators will have lower accuracy than they do with smaller action space. Scaling more effectively to larger action spaces is a really interesting research question.

Can we use OPE for online learning policies? and what do you recommend to estimate the delay in the counterfactual reward?

A. We can, but need some modifications. You can refer to the following papers.

Lihong Li, Wei Chu, John Langford, Xuanhui Wang. Unbiased Offline Evaluation of Contextual-bandit-based News Article Recommendation Algorithms. WSDM2011.
Miroslav Dudik, Dumitru Erhan, John Langford, Lihong Li. Sample-efficient Nonstationary Policy Evaluation for Contextual Bandits. UAI2012.

Great talk! What about a SNIPS for lists? Is that feasible? 🙂

A. Great suggestion. You can definitely do that. Indeed, there is a recent pull request implementing self-normalized versions of the IPS estimators for the combinatorial action setting. https://github.com/st-tech/zr-obp/pull/133

In DRos, a weight of denominator is squared. It is unintuitive to me, because it is not equal to the original weight if λ=0. Could you explain?

A. Great question. When, λ=0, DRos is identical to DM. Instead, it is close to the original DR estimator when λ→∞.

In DRos, a weight of denominator is squared. It is unintuitive to me, because it is basically inversely proportional to the original weight.

A. Great question. The following figure illustrates how shrinkage (and clipping) modifies the true importance weight (IW) with varying values of hyperparameters. X-axis is the true IW up to 100, and y-axis is the modified values. You can see that with appropriately large hyperparameter values (e.g., 10,000 or 100,000), shrinkage softly clips the true IW. However, as you suggest, when lambda is small, it is too aggressive. That is why we see DRos is not so good with small hyperparameter values here (but of course, small hyperparameter values may be effective in some challenging settings).

Assumptions

Regarding the IPS Estimator, how often is there full support in industrial large scale recommenders? What can be done when there is no full support?
The actions taken by the behavior (logged) policy are limited, hence the collected rewards r_i are limited too. Could you say a bit more on IPS in the case of sparse action overlap between behavior policy and target policy?

A. Good point. When the logging policy does not satisfy the full support assumption, we will have some remaining bias even if we have infinite sample data and know the true logging policy. You can refer to the following recent paper to see what is the problem there and how we can address the problem, especially in the policy learning side.

Noveen Sachdeva, Yi Su, and Thorsten Joachims. Off-policy Bandits with Deficient Support. KDD2020.

Note also that mixing data from multiple logging policies can help reduce support deficiency using the balanced estimator.

A. Agarwal, S. Basu, T. Schnabel, T. Joachims. Effective Evaluation using Logged Bandit Feedback from Multiple Loggers. KDD2017.

Experiments

How would practitioners evaluate which estimator is the best in their specific setting?
How would you exactly differentiate between "easy" and "challenging" setting? How much divergence between the two stochastic policies is "required" to favor DRos against SNIPS for example?
How do we evaluate which IPS method performs best on non-synthetic datasets?
How to estimate the true value of a policy if it is not a SyntheticBanditDataset

A. Great questions. As we stated in the last section on future work, this kind of "data-driven estimator selection" is still an unexplored (sub-)field. However, if we have multiple logged bandit datasets collected by different policies in A/B test, you can implement the evaluation of OPE experiment leveraging that datasets. You can refer to Section 5 of the following paper on how to implement that procedure.

Yuta Saito, Shunsuke Aihara, Megumi Matsutani, and Yusuke Narita. Open Bandit Dataset and Pipeline: Towards Realistic and Reproducible Off-Policy Evaluation. arXiv2020.

For the ZoZo dataset, how would performance look like if RandomForest is used without IPW Learner [i.e. vanilla supervised ML model]? How much incremental impact does IPWLearner + RF bring over just RF ?

A. Great point. OpenBanditPipeline doesn't support the supervised ML, prediction based policy, so it is not so trivial to compare right now. We will implement that additional function later. We think the following paper is a good reference comparing the supervised ML-based (model based) off-policy learning and IPS (or IPW) based (model free) policy learning in the context of RecSys.

Olivier Jeunen, David Rohde, Flavian Vasile, Martin Bompaire. Joint Policy-Value Learning for Recommendation. KDD2020.

Propensities

My intuition is that the more random the log policy is the better the OPE estimation is. Are there specific properties of the log policy that can help us to predict how good will be the OPE estimation?

A. It depends on what evaluation policy we want to evaluate. If we just want to evaluate the performance of an evaluation policy that is similar to the logging policy, OPE will be relatively easy (as we showed in the synthetic simulation). As you suggest, a near uniform random policy is useful, when you plan to evaluate a wide variety of evaluation policies.

In a large action space, computing \pi(a|x) is expensive (e.g. if there is 1000 candidates, but only 1 was shown to the user so x is 1000x size of a), using OPE requires 1000x more logs/compute than just cross-entropy optimization when you learn a model \hat{p}(r=1|a). Is there any way around this?

A. Training a policy via policy gradient or methods like POEM do indeed need to repeatedly compute the policy \pi(a|x). It is an interesting research question whether there is room for approximations and caching. Some of the techniques developed for computing the partition function in multi-class classification may be of use here.

How are the probabilities under the stochastic evaluation policy calculated? How do we know what will be the probabilities of choosing particular actions if a policy is new? Does it require simulation?

A. It depends on what algorithm you use for a logging policy. When you use epsilon greedy or softmax, it is easy to calculate propensities. If you use probability matching such as Thompson sampling, you need to approximate propensities by, for example, Monte Carlo simulation.

If you don't have the propensities in your logged datasets, are there any logical heuristics to estimate it?

A. We can use logistic regression or some other classification methods to estimate the propensities. An interesting fact here is that it is more efficient to use estimated propensities than to use the true ones when we can estimate the propensities consistently. You can refer to the following paper about this somewhat counter-intuitive fact.

Yusuke Narita, Shota Yasui, Kohei Yata. Efficient Counterfactual Learning from Bandit Feedback. AAAI2019.

Others

Is there a rule of thumb for when OPE is feasible ... e.g. look at the max IW and decide if OPE is not going to work?

A. Great idea, but it is still challenging to set an appropriate threshold for max IW. It depends on many factors such as number of samples, number of actions, quality of the reward predictor, and the noise level on the reward variable, etc... So, we think a more sophisticated method is needed.

Thanks for a great tutorial! Is there a theory that accounts for the fact that a policy update can, in principle, change the distribution of the context vector, p(x)?

A. Thanks! In most cases the policy update doesn't change the environment (i.e., p(x) and p(r|x,a)), it just controls \pi(a|x). But there may be long-term effects on P(x) (e.g. customer abandonment or growth) that are not modeled.

How do these methods perform with long term rewards like user retention?

A. Good point. We think this is an independent research interest, i.e., how should we develop a surrogate immediate reward that well represents a long-term, ultimate reward. The following is a paper relevant to this topic, but there might be some other great references. We are not experts in this direction.

Weitao Duan, Shan Ba, Chunzhe Zhang. Online Experimentation with Surrogate Metrics: Guidelines and a Case Study. WSDM2021.

How can these methods be adapted to deal with multiple (possibly conflicting) rewards? E.g. Accuracy and fairness

A. This is kind of an independent research interest. There are many papers discussing how to incorporate fairness constraints to learning-to-rank objectives. There are also many works on how to incorporate multiple competing objectives in recommender systems. You can refer to the following paper or tutorial.

Ashudeep Singh and Thorsten Joachims. Policy Learning for Fairness in Ranking. NeurIPS2019.
Yong Zheng, David (Xuejun)Wang. Multi-Objective Recommendations: A Tutorial. KDD2021.

Page updated

Report abuse