Tips&Tricks

Turning offline logs into features

A user history in the RecoGym world is a variable length history of product ids, representing products the user browsed (usually without any intervention from the recommender system). Machine learning algorithms usually prefer to work on fixed dimensional spaces as such a fixed dimensional representation of the user is required. There are a number of simple possible approaches:

1) Just look at the most recent item in the user history. This technique is demonstrated in the organic count agent and in the bandit count agent and in bandit MF square. A limitation of this technique is that we may be able to do richer personalisation using more than just the most recent item in history.

2) If there are P products then we could consider a vector of counts of how many times a user viewed each item. This simple approach to feature engineering is described in detail here for the Google Collab go here. An agent implementing the idea is also in the pytorch_mlr class found here. This class is a recommended starting place for implementing your own agent as it is flexible and scales well to reasonably large numbers of products (although 10 000 products remains challenging!). Other agents also use this method, but do not scale well to large numbers of products including likelihood or value based methods logistic regression, Bayesian logistic regression with MCMC (very slow) and Bayesian logistic regression with VB EM.

The approach of simply maintaining a vector of counts ignores any temporality a simple extension could incorporate weighted sums of time, discussion of this can be found in this notebook.

3) It can improve both computation and statistical behaviour to use product embeddings K rather than the full matrix P. For an example of this approach see this notebook. No agents are provided using this technique, but you should be able to transfer the notebook to an agent file if this approach interests you.

4) Many extensions can be considered such as using variational auto-encoders, using RNNs, these more advanced methods are currently not incorporated into any provided RecoGym agents, although variational auto-encoders are applied to RecoGym data here.


Full Reinforcement Learning is not required

People with a reinforcement learning background should note the following differences to the full RL set-up.

a) There is no long term reward, you simply choose the best action you can given your current state of knowledge and you get immediate feedback if it was successful (i.e. did you obtain a click).

b) The users do not have a state, i.e. you cannot make a user more interested in a product by recommending it to them constantly. They have a latent interest and you must discover it, you cannot change it.

c) Training is done in an offline log. Your agent must make the best use of the log of an existing policy which is a simple session based popularity with additional randomisation. This greatly reduces the need to do explore-exploit trade-offs (all of our existing baselines train on an offline log and move directly to a full exploitation policy). Winning the RecoGym challenge requires efficiently leveraging information to determine good actions, it does not (in version 1) require sophisticated explore exploit strategies.


Value vs. Policy agents

There are two main approaches to machine learning one based around likelihood/Bayesian or probabilistic methods and the second based around empirical risk minimisation. In the reinforcement learning world roughly the same distinction is called value based methods vs policy based methods. This distinction is rather subtle and hidden for classic machine learning problems such as regression but becomes very visible when the purpose of the machine learning system is to do actions that perform well.

From the point of view of the RecoGym challenge the distinction is: do you explicitly model the response to every action? (i.e. the value) or do you instead attempt to model the performance of a new policy and optimise that directly? In methodology this causes drastically different approaches, from a value perspective standard supervised learning applies directly. From the empirical risk minimisation point of view there is the problem that it is not possible to "play back" actions that did not occur, in order to deal with this inverse propensity score methods are proposed.

Both value-based and policy-based approaches are implemented here including an attempt at blending the two approaches. A value-based approach is implemented here (only suitable for low numbers of products) - notebook here. A policy based approach is implemented here (only suitable for low numbers of products) - notebook here. Briefly, value/likelihood based approaches predict the reward given the context and the action (and ignore the IPS score); in contrast policy based approaches return the best action given the context (the reward and the IPS score influence the weighting of records during training).

A short survey on this debate can be found here. An example using the RecoGym environment that compares the simplest value based methods (likelihood) to policy based methods is given here. It can be easier to understand IPS first from an evaluation perspective which is explained here also see the notebook here.


Organic Behaviour is similar but different to Bandit Behaviour

RecoGym simulates both organic behaviour (the products that are typically viewed together in a session) and bandit behaviour (did the recommendations delivered actually work?). There is currently little academic literature on this topic, but we believe it is crucial for producing a well performing agent or indeed a well performing recommender system. Also note that currently none of the baselines blend organic. and bandit behaviour together (the SVD notebook does in a basic way) - so this is a real opportunity for entrants to beat the baselines.

The difference between organic behaviour and bandit behaviour means that an algorithm that performs well by an organic metric such as hitrate@k may give disappointing performance on an online bandit metric such as click through rate. This point is demonstrated in the paper here, and in the notebook here Google Collab version here.

The "flips" parameter in RecoGym controls the difference between organic and bandit behaviour. In the RecoGym challenge we set it to its maximum value (under which there remain many similarities between organic and bandit). This notebook demonstrates the effect of the flips parameter.



GOOD LUCK!