Offline evaluation for recommender systems

A RecSys 2018 Workshop

Vancouver, CA

Sunday, Oct 7, 2018

Update (Aug 31st): We are excited to announce four fantastic invited speakers, 8 accepted talks and 12 posters. Stay tuned for the detailed schedule!

Asking the right question is half the answer: Revisiting the choice of offline metrics for recommender systems.

Recommender systems are notoriously hard to evaluate due to their interactive and dynamic nature. When evaluating their systems, practitioners often observe significant differences between offline results and online results of a new algorithm, and therefore tend to mostly rely on online methods such as A/B testing to evaluate their algorithms.

This is unfortunate because online evaluation is not always possible and often expensive. Offline evaluation, on the other hand, provides a scalable way of comparing recommender systems and helps bridge the gap between academia and industry in the field of recommendation at large.

In the past, recommender systems have been evaluated using proxy offline metrics coming from supervised methods, such as regression metrics (mean squared error, log likelihood), classification metrics (area under precision/recall curve) or ranking metrics (precision@k, normalized discounted cumulative gain).

Recent research on recommender systems makes the link with the work on counterfactual inference and makes possible new ways to evaluate offline the quality of recommendations. In this context, we believe it is timely to organize a workshop that re-visits the problem of designing offline metrics for recommendation and makes sure the community is working on the right problem: find for each user, the most impactful recommendation.

The goal of this workshop is to foster creative discussions within the community, spanning academic and industrial backgrounds to advance the field of offline evaluation of recommender systems.

Potential contributions include (but are not limited to):

  • Framing the problem: what are we trying to solve exactly? e.g.
    • Recommendations as a counterfactual inference problem;
    • Recommendation as a reinforcement learning problem;
  • New metrics and methods for offline evaluation
  • Studies on offline-online metrics correlation for recommendations
  • Using simulation for recommender systems evaluation
  • Open datasets, baseline algorithms, and evaluation toolkits