In-Workshop Tutorial

Title: Approaches to Off-Policy Evaluation in Large Action Spaces


Abstract


Off-policy evaluation (OPE) in contextual bandits has seen rapid adoption in real-world systems since it enables offline evaluation of new policies using only offline log data. Unfortunately, when the number of actions is large, existing OPE estimators – most of which are based on inverse propensity score (IPS) weighting – degrade severely and can suffer from extreme bias and variance. This foils the use of OPE in many applications such as recommender systems and search engines. In this tutorial, I will introduce a few recent approaches in OPE to dealing with large action spaces. First, I will focus on OPE of ranking policies and introduce an idea to reduce the variance by imposing some assumptions on user behavior and designing ranking-specific IPS estimators. I will also talk about an extension to the existing works by incorporating possibly diverse user behavior in the design of an estimator. Next, I will talk about a more general approach to dealing with large action spaces by leveraging action embeddings. I will show that a new estimator called Marginalized IPS, which utilizes action embeddings in defining the weight, provides substantial statistical benefits over conventional estimators. Finally, I will describe some remaining challenges and open questions in the relevant direction to encourage discussions in the workshop.

Outline (75min + QA)


  1. Quick Intro to Standard Off-Policy Evaluation (15min)

    • Formulation of the statistical estimation problem and some benchmark estimators

  2. How should we perform an OPE of ranking policies? (30min)

    • Introducing assumptions on user behavior (aka click models)
      to optimize the bias-variance tradeoff of ranking-specific estimators

  3. How should we deal with more general large action spaces? (30min)

    • If we have some action features/embeddings, we can define the importance weight differently,
      enabling effective OPE even in large discrete action spaces

Presenter: Yuta Saito (Cornell University)

Yuta Saito is a Ph.D. student in the Department of Computer Science at Cornell University, advised by Prof. Thorsten Joachims. His current research focuses on off-policy evaluation of bandit algorithms and fairness in ranking. Some of his recent works have been published at top-tier conferences, including ICML, NeurIPS, KDD, SIGIR, RecSys, and WSDM. He has recently won the Best Paper Runner-Up Award at WSDM 2022 and co-lectured tutorials on counterfactual evaluation at RecSys’21 and KDD’22. He is the main developer of the Open Bandit Pipeline package and has named in the list of Forbes JAPAN 30 Under 30 2022.