Online advertising has grown exponentially over the past few years due to the widespread usage of the internet across the world. The marketers track customer journeys as they are exposed to different online media channels before they make the conversion at the end. Companies allocate marketing budgets to promote their businesses through multiple online campaigns among different channels.
Let me give you an example of a marketing campaign. Say you are scrolling through Instagram and come across an ad from Nike promoting their new collection of running shoes. You look at the ad but ignore it. Later, when you are traveling, you come across a billboard featuring your favorite athlete endorsing the same shoes. Now, let's assume you liked the shoes and are curious about them. So, you visit the Nike website and find that the actual price is over your budget zone. So, you don’t buy the shoes and close the browser. Now, after a few days, you receive a promotional email from Nike, offering a 15% discount on the same shoes if you bought them today. So, you grab your credit card and buy them because you think you might not get a better deal any time soon. This cycle is a classic example of a marketing campaign. The Instagram ad, billboard, google search, and email from Nike are marketing channels and the engagements you had with these channels are called touchpoints. Also, since you ended up purchasing the product, this customer journey for this marketing campaign ended up having a conversion. Now that your shoes have been delivered, how would you attribute the revenue generated by your purchase to individual channels? Was it because of the 15% discount email or the billboard with your favorite athlete endorsing it or would you attribute this purchase to the initial Instagram ad? This is called the attribution problem.
This project aims to:
To predict if a series of events in a customer's journey(touchpoints) leads to a conversion.
Analyze sequences of customer interactions and assess attribution of each advertising channel towards final conversion.
Traditionally, channel attribution has been tackled by a handful of simple but powerful approaches such as First Touch, Last Touch, Linear, and time-decay attributions.
Last touch: As the name suggests, the last touch is the attribution approach where any revenue generated is attributed to the marketing channel that a user last engaged with. While this approach has its advantage in its simplicity, you run the risk of oversimplifying your attribution, as the last touch isn’t necessarily the touchpoint that generates the purchase. From our previous example, the last touch channel(email from Nike) likely didn’t create 100% of the intent to purchase. The awareness stems from the initial spark of watching the Instagram ad.
First touch: The revenue generated by the purchase is attributed to the first channel the user engaged with, on the journey towards the purchase. It is similar to the last touch and also carries the risk of over-simplifying the attribution approach.
Linear: Gives equal credit to all touchpoints i.e. equal credit to all channels.
Time-decay: Gives more credit to the touchpoints that are closer in time to the conversion.
All marketing campaigns have a limited budget allocated to them. Especially in digital marketing, we want to know how much we must invest in each individual channel to receive the maximum Return on Investment (ROI). This is a primary Key Performance Indicator (KPI) for any firm that is employing digital marketing.
This leads us to the problem of marketing spend optimization, which requires estimating the true contribution of individual channels to the final outcome and optimally allocating budgets across these channels. In this article, we will explore how deep learning can be used to analyze sequences of customer interactions and how insights gained from such analyses can be used to optimize the marketing budget.
The data for this project has been obtained from this Criteo AI Lab website. It is open-sourced and readily available to download. This dataset represents a sample of 30 days of Criteo live traffic data. Each line corresponds to one impression (a banner) that was displayed to a user. For each banner we have detailed information about the context, if it was clicked, if it led to a conversion, and if it led to a conversion that was attributed to Criteo or not. Data has been sub-sampled and anonymized so as not to disclose proprietary elements [7].
Here is a detailed description of the fields (they are tab-separated in the file):
timestamp: timestamp of the impression (starting from 0 for the first impression). The dataset is sorted according to timestamp.
uid: a unique user identifier
campaign: a unique identifier for the campaign
conversion: 1 if there was a conversion in the 30 days after the impression (independently of whether this impression was last click or not)
conversion_timestamp: the timestamp of the conversion or -1 if no conversion was observed
conversion_id: a unique identifier for each conversion (so that timelines can be reconstructed if needed). -1 if there was no conversion
attribution: 1 if the conversion was attributed to Criteo, 0 otherwise
click: 1 if the impression was clicked, 0 otherwise
click_pos: the position of the click before a conversion (0 for first-click)
click_nb number of clicks: More than 1 if there was several clicks before a conversion
cost: the price paid by Criteo for this display (disclaimer: not the real price, only a transformed version of it)
cpo: the cost-per-order in case of attributed conversion (disclaimer: not the real price, only a transformed version of it)
time_since_last_click: the time since the last click (in s) for the given impression
cat[1-9]: contextual features associated to the display. Can be used to learn the click/conversion models. We do not disclose the meaning of these features but it is not relevant for this study. Each column is a categorical variable. In the experiments, they are mapped to a fixed dimensionality space using the Hashing Trick (see paper for reference).
Key figures
2.4 GB uncompressed
16.5M impressions
45K conversions
700 campaigns
Since this dataset does not really have channels for us to calculate final attributions, we choose to optimize the budget across the campaigns instead. This is a more challenging task since the dataset contains ~700 advertising campaigns. So, our model will have many more budgeting parameters to learn from than in a conventional cross-channel optimization, which usually has a smaller number of channels(Example: Instagram, Facebook, Email, Paid search, etc.)
After the initial data collection, the following tasks will be implemented:
Data pre-processing and feature selection
Since we aim to analyze entire customer journeys i.e. all events of a customer for a particular campaign, we will create a new column 'jid' representing the journey of each customer by concatenating corresponding user id and conversion id values.
The dataset size is overwhelming and can have memory constraints when processing. So for ease of our analysis, we are going to randomly sample 200 campaigns and remove all campaigns with just one event(meaning if there was only 1 event in the entire journey it wouldn't effectively portray a sequence analysis).
The original dataset is imbalanced- there are ~15 million non-conversion events and only ~800,000 conversion events, After sampling these numbers will reduce further which might affect the training of our model(the model might learn non-conversion events better than converted). So, we downsample the non-conversion events and balance the dataset in such a way that the number of converted events is on par with non-converted events.
Later, we will use a normalizing technique to scale the timestamp and time_since_last_click columns so that all our final features will be on the same scale which helps the model in learning better and not attributing high values to high importance.
Lastly, all the categorical column cat1-cat9 will be one-hot encoded into arrays to obtain final features that will be used to fit with our model.
Modeling and training
Develop the Last Touch Attribution (LTA) model and supervised machine learning models like Logistic regression & LSTM.
Train sequences of customer journeys and predict if the sequence ends up with a conversion.
Calculating attribution weights of individual channels
Model evaluation and tuning
Calculating and visualizing Area Under the Curve (AUC) of Receiving Operator Characteristic (ROC) for all models
Visualizing other scoring metrics to find the best performing model.
Let us look at the distribution of journey lengths to confirm that it makes sense to use modeling methods for sequential data. Our dataset consists of journeys with upto 100 events or more, but the number of journeys falls exponentially with the length.
The above plot represents the distribution of journeys with up to 200 events (sampled) in a customer's journey. The number of journeys with several events is considerable; thus, it makes sense to try methods for sequential data.
Before we dive into the supervised machine learning models, let us take a look at the conventional marketing technique used by marketing analysts that has proven to be highly effective in terms of digital marketing analytics.
As defined earlier, a common technique used by marketers to address the attribution problem i.e attributing a conversion of a customer to a particular channel or in our case a campaign is the Last Touch Attribution where all credit of conversion is assigned to the last channel a customer interacted with before subscribing/purchasing the product.
To achieve this result we will:
Count the number of impressions for each campaign i.e the number of times a particular campaign occurs in the data frame.
Count the number of times a campaign is the last touch before the conversion
Calculate attribution weight which is the ratio between the number of journeys in which a given campaign is the last event and the total number of events for the same campaign. This can also be called the return per impression.
The following plot describes the results obtained from employing the LTA technique for 200 campaigns. The distribution shows us which marketing campaigns had the highest and least return per impression. These results can help a marketer understand which marketing campaigns are making the biggest impact on their business and accordingly, allocate budgets across these campaigns.
Now, let's look at how we can implement a Logistic regression model. Unlike position-based models, regression analysis aims to reveal a touchpoint's true contribution.
After performing some feature aggregation and splitting the data (check out my Github), we implement a Keras logistic regression model which has convenient functions like get_weights and get_layer to directly obtain the attribution weights of any given layer. Categorical columns like cat1-cat9 are one-hot encoded into arrays and will be used as features for our model. Additionally, we also aggregate other numerical columns like click and cost depending on the aggregation strategy.
We are getting reasonably good validation accuracy (84.7%) for such a basic approach. To evaluate our model let us look at the confusion matrix. Our model did a decent job of predicting converted journeys (6096) and non-converted journeys (2653). The model misclassified 997 journeys as converted when in reality these journeys did not end up converting. Similarly, 514 journeys that actually ended up with conversion were wrongly classified as non-converted journeys.
A common way to compare models that predict probabilities for two-class problems is to use AUC of ROC (Area Under Curve for Receiving Operator Characteristic).
It is a plot of the false positive rate (x-axis) versus the true positive rate (y-axis) for a number of different candidate threshold values between 0.0 and 1.0.
Let us go ahead and calculate the AUC of ROC for both Logistic Regression and a no-skill model which only predicts 0 for all examples.
According to this book, page 160:
0.5 = This suggests no discrimination, so we might as well flip a coin.
0.5-0.7 = We consider this poor discrimination, not much better than a coin toss.
0.7-0.8 = Acceptable discrimination
0.8-0.9= Excellent discrimination
0.9+ = Outstanding discrimination
Our model has a ROC AUC score of 0.915 which implies that the model is skilled and can confidently discriminate between converted and non-converted journeys. Let us take look at the attribution weights produced by the model and compare it with the LTA model.
The attribution weights produced by the two models are co-related, although significant differences exist with some campaigns. We used a simple design that can be improved through more elaborate feature engineering.
Now let us build a more advanced model that explicitly accounts for dependencies between events in a journey i.e an LSTM model. This problem can be framed as a conversion prediction based on the ordered sequence of events, and recurrent neural networks (RNNs) are a common solution for it. We choose to use a basic long short-term memory (LSTM) architecture with 64 hidden units(check my Github).
We can observe that the LSTM approach significantly improved the accuracy (90.96%) compared to the logistic regression model (84.7%). We can below that the LSTM model outperformed the Logistic regression model in classifying each sequence of a customer's journey as converted/ not converted. The False Positive and False Negative values are quite low in proportion to the results obtained from the logistic regression model.
To further re-affirm our model's skill let us look at the ROC of AUC curve plot. The score values for LSTM (0.955) are a confident indicator that our model is skilled and can make good predictions of customer conversions given a sequence of their interactions.
We were able to accurately (90.96%) predict if a series of events in a customer's journey leads to a conversion using the LSTM model (Project Goal 1).
We compared the performance of all the models and visualized their score metrics to understand the limitations and pros of each model.
We were able to extract attribution weights of each marketing campaign that can be used by marketers to make informed decisions while optimizing the budget for these campaigns. Depending upon the return per impression, they can choose to pump or divert funds from a campaign (Project Goal 2).
We have discussed, implemented, and evaluated several attribution models that provide a solid foundation for measuring the efficiency of marketing activities and the optimization of budgeting parameters. We have seen that the state-of-the-art models that consume sequences of events provide superior accuracy and greatly simplify feature engineering.
We sampled our data to consider only 200 campaigns since we have limited memory to process such a huge scale of data. An effective way to overcome this memory issue can be implementing the data pre-processing in Pyspark which takes up 70 % of the time to complete the entire project.
From the results, we observe that there is no plot for attribution weights obtained from the LSTM model. LSTM does not provide a simple way to extract the attribution weights from the fitted model. And hence it would be wise to implement an attention mechanism model from which we could extract the attention weights.
To accurately optimize the budget across each marketing campaign, we can simulate campaign execution with new budgeting constraints by replaying historical events.
We generally assumed the availability of historical data for modeling and optimization. However, the same techniques can be combined with reinforcement learning to evaluate and adjust budgeting parameters dynamically. This approach can be particularly useful for sponsored search bids optimization and other use cases in which a large number of budgeting parameters need to be tuned dynamically [7].
Click here to be re-directed to this project's GitHub repository.
N. li, S. K. Arava, C. Dong, Z. Yan, and A. Pani, “Deep Neural Net with Attention for Multi-channel Multi-touch Attribution,” arXiv.org, 06-Sep-2018. [Online]. Available: https://arxiv.org/abs/1809.02230. [Accessed: 14-May-2021].
E. Diemert, J. Meynet, P. Galland, and D. Lefortier, “Attribution Modeling Increases Efficiency of Bidding in Display Advertising,” arXiv.org, 21-Jul-2017. [Online]. Available: https://arxiv.org/abs/1707.06409. [Accessed: 14-May-2021].
K. Ren, Y. Fang, W. Zhang, S. Liu, J. Li, Y. Zhang, Y. Yu, and J. Wang, “Learning Multi-touch Conversion Attribution with Dual-attention Mechanisms for Online Advertising,” arXiv.org, 30-Aug-2018. [Online]. Available: https://arxiv.org/abs/1808.03737. [Accessed: 14-May-2021].
Morten Hegewald. 2019. "Marketing Channel Attribution with Markov Chains in Python-2". https://towardsdatascience.com/marketing-channel-attribution-with-markov-chains-in-python-part-2-the-complete-walkthrough-733c65b23323
Lucy Alexander. 2021. "The Who, What, Why, & How of Digital Marketing". https://blog.hubspot.com/marketing/what-is-digital-marketing
Shweta Verma. 2020. "Deep Neural Nets for Multi-Touch Attribution". expressanalytics.org. https://medium.com/@shweta_44668/deep-neural-nets-for-multi-touch-attribution-114c7e78f5a8#_nr2mj0nwahio
J. Zhao, G. Qiu, Z. Guan, W. Zhao, and X. He, “Deep Reinforcement Learning for Sponsored Search Real-time Bidding,” Proceedings of the 24th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining, 2018.
“Criteo Attribution Modeling for Bidding Dataset,” Criteo AI Lab, 23-Nov-2020. [Online]. Available: https://ailab.criteo.com/criteo-attribution-modeling-bidding-dataset/. [Accessed: 14-May-2021].