Papers and Workshop

Accepted Papers

Andrey Shestakov and Engelbert Mephu Nguifo (Higher School of Economics, Clermont University, Blaise Pascal University, LIMOS):

Predicting web-page popularity with Machine Learning and Heuristic Time-Series Prediction approaches.

We used two approaches to deal with the task. The first approach includes a simple utilization of well-known machine learning techniques to predict a target feature of web-page. We compared several algorithms and choose Random Forest and Lasso Regression methods as they demonstrated superior results.The idea of the second approach, which is referred to as heuristic time-series approach, is to represent the data with three components: multiplicative seasonal component, "forgetting" component and naive prediction component.

Marc Boullé (Orange Labs):

Selective Naive Bayes Regressor with Variable Construction for Predictive Web Analytics.

Our submission exploits a selective naive Bayes regressor, together with automatic feature construction from the input time series. The data of the challenge is represented using a multi-tables schema, with pages as the main statistical units and the time series records in a secondary table. Using a small set of construction rules, one thousand of new features are created automatically to enrich the representation of pages. These features are then preprocessed to assess their relevance and a small subset of them are selected using the selective naive Bayes regressor. Our submission, obtained almost automatically, was ranked 3rd on each task.

Gitte Vanwinckelen and Wannes Meert (KU Leuven):

Predicting the popularity of online articles with random forests.

We present an analysis of the time series data generated by the Chartbeat web analytics engine, which was made available for this competition, and the approach we used to predict page visits. Our model is based on random forest regression and learned on a set of features derived from the given time series data to capture the expected amount of visits, rate of change and temporal effect. Our approach won second place for predicting the number of visitors and the number of Facebook likes, and first place for predicting the number of tweets.

Flavio Figueiredo, Jussara Almeida and Marcos Gonçalves (Universidade Federal de Minas Gerais):

Improving the Effectiveness of Content Popularity Prediction Methods using Time Series Trends.

We here present a simple, yet very effective, model to predict the popularity of web content. Our solution, which is the winner of two of the three tasks of the ECML/PKDD 2014 Predictive Analytics Challenge, aims at predicting user engagement metrics, such as number of visits and social network engagement, that a given web page will achieve 48 hours after its upload, using only information available in the first hour after upload. Our model is based on two steps. We first use time series clustering techniques to extract common temporal trends of content popularity. Next, we use linear regression models, exploiting as predictors both content features (e.g., numbers of visits and mentions on online social networks) and metrics that capture the distance between the popularity already observed to the popularity trends extracted in the first step. We discuss why this simple model is effective and show its gains over state of the art solutions.

Workshop Program

The workshop will take place on Friday, September 19th, from 10:15 to 11:30, in room 106 of the Centre Prouvé in Nancy, France.

    • 20-minutes detailed presentations will be done in the order in which the papers are listed above. Presentations by Shestakov, Boullé, Vanwinckelen and Figueiredo.
    • Additionally, there will be a round table discussion of future research directions and feedback from participants.

Notes from Workshop Discussion (conclusions and notes for future competitions)

Notes by Carlos Castillo taken during the workshop.

With respect to this workshop:

    • One model per website seems to be the way to go.
    • A fixed reference time (48 days after publication) is an interesting setting, but there might be other settings e.g. predict the next 6 hours independently of whether 1, 12, or 24 hours have passed.
    • Evaluation could be done in terms of ranking instead of log(.) given the scales of different pages are not comparable.
    • Some participants were available for sharing their code at the end of the competition, perhaps this could be encouraged.

Aspects to consider in future competition:

    • Positive aspects
      • Participants observed that training and testing errors where very similar in all cases, which is good.
      • Participants valued this dataset and would like that the competition data remains available for some time for other people to use it.
    • Negative aspects
      • The final results should not be decided based on the same dataset as the leaderboard. When the leaderboard's set is equal to the final test set, participants can optimize for it repeatedly sending submissions.
      • Dates should not be changed under any circumstance. Participants decide their schedule for participation in advance and "book" the time to work on this problem intensively during the competition dates. Changing the competition dates de-rails the agendas of participants and is unfair to some of them.
    • Further data that could be made available.
      • Traffic sources (e.g. percentage of visits from search, internal links, external links) could have been used but were not available.
      • Content-based features could have been exploited but they were not available. Perhaps there are some ways of exposing content-based features while keeping the pseudonymity of websites (e.g. projecting each web page into a set of topics and providing topic vectors, or providing a content similarity matrix).
      • Graph structure features could also have been incorporated.