Task description
We provide data from real-time analytics engine Chartbeat. This data contains, for a sample of URLs posted on the web, a time series of the number of pageviews of those URLs and the number of messages posted on Twitter and Facebook that include those URLs.
A fixed list of (pseudonymous) websites is used, from which the training and testing portions of the data are sampled. This reproduces a practical scenario in which a website is observed during a certain period of time, a model is built for that website, and then that model is used to predict the activity of future users.
The first task is to predict the number of pageviews a URL will receive during its first 48 hours based on observations done during the first 1 hour after publication. The second and third tasks are to predict the number of messages on Twitter and Facebook containing this URL that will be posted during its first 48 hours based on the same observations.
The base dataset is obtained from a sample of 100 websites, whose identity remains pseudonymous to disallow the usage of external information. From each website, 600 URLs posted during 2013 and having at least 10 visits are sampled uniformly at random.
Each host is contained in one file with a JSON record of the form:
{"host_id": HOST_ID, "pages": PAGE_LIST }
Where PAGE_LIST is an array of records of the form:
{"page_id": PAGE_ID,
"posted_weekday": POSTED_DAY_OF_WEEK,
"posted_hour": POSTED_HOUR_OF_DAY,
"sum_visits_48h": SUM_VISITS_48H,
"series_1h": SERIES,
"series_48h": SERIES
}
SERIES is an object of the form
{"twitter": [TWEET_COUNT_SERIES],
"facebook": [FACEBOOK_LIKES_SERIES],
"time": [AVERAGE_ACTIVE_TIME_SERIES],
"visits: [VISITORS_COUNT_SERIES],
}
The meanings and types of the fields are:
The tasks consists on using the first 60 minutes of a URL to predict the number of visitors, tweets on Twitter and likes on Facebook the URL will receive in total during 48 hours (the summation of all the observations for that URL).
The data are split into two parts. From the URLs of each website, 300 are selected randomly for the public challenge data (30,000 URLs), and the remaining 300 are included in the secret evaluation data (30,000 URLs).
The public challenge data is given to participants. The public challenge data will contain all 576 records for each of the 30,000 URLs.
The secret evaluation data, corresponding to 30,000 URLs, is split as follows. The first 12 records of each URL (corresponding to 60 minutes) are given to participants under the non-disclosure agreement. The remaining 564 records are kept for evaluation purposes.
How to get access to the data? See Participating »