Andrey Shestakov and Engelbert Mephu Nguifo (Higher School of Economics, Clermont University, Blaise Pascal University, LIMOS):
Predicting web-page popularity with Machine Learning and Heuristic Time-Series Prediction approaches.
We used two approaches to deal with the task. The first approach includes a simple utilization of well-known machine learning techniques to predict a target feature of web-page. We compared several algorithms and choose Random Forest and Lasso Regression methods as they demonstrated superior results.The idea of the second approach, which is referred to as heuristic time-series approach, is to represent the data with three components: multiplicative seasonal component, "forgetting" component and naive prediction component.
Marc Boullé (Orange Labs):
Selective Naive Bayes Regressor with Variable Construction for Predictive Web Analytics.
Our submission exploits a selective naive Bayes regressor, together with automatic feature construction from the input time series. The data of the challenge is represented using a multi-tables schema, with pages as the main statistical units and the time series records in a secondary table. Using a small set of construction rules, one thousand of new features are created automatically to enrich the representation of pages. These features are then preprocessed to assess their relevance and a small subset of them are selected using the selective naive Bayes regressor. Our submission, obtained almost automatically, was ranked 3rd on each task.
Gitte Vanwinckelen and Wannes Meert (KU Leuven):
Predicting the popularity of online articles with random forests.
We present an analysis of the time series data generated by the Chartbeat web analytics engine, which was made available for this competition, and the approach we used to predict page visits. Our model is based on random forest regression and learned on a set of features derived from the given time series data to capture the expected amount of visits, rate of change and temporal effect. Our approach won second place for predicting the number of visitors and the number of Facebook likes, and first place for predicting the number of tweets.
Flavio Figueiredo, Jussara Almeida and Marcos Gonçalves (Universidade Federal de Minas Gerais):
Improving the Effectiveness of Content Popularity Prediction Methods using Time Series Trends.
We here present a simple, yet very effective, model to predict the popularity of web content. Our solution, which is the winner of two of the three tasks of the ECML/PKDD 2014 Predictive Analytics Challenge, aims at predicting user engagement metrics, such as number of visits and social network engagement, that a given web page will achieve 48 hours after its upload, using only information available in the first hour after upload. Our model is based on two steps. We first use time series clustering techniques to extract common temporal trends of content popularity. Next, we use linear regression models, exploiting as predictors both content features (e.g., numbers of visits and mentions on online social networks) and metrics that capture the distance between the popularity already observed to the popularity trends extracted in the first step. We discuss why this simple model is effective and show its gains over state of the art solutions.
The workshop will take place on Friday, September 19th, from 10:15 to 11:30, in room 106 of the Centre Prouvé in Nancy, France.
Notes by Carlos Castillo taken during the workshop.
With respect to this workshop:
Aspects to consider in future competition: