Data Description

Task description

We provide data from real-time analytics engine Chartbeat. This data contains, for a sample of URLs posted on the web, a time series of the number of pageviews of those URLs and the number of messages posted on Twitter and Facebook that include those URLs.

A fixed list of (pseudonymous) websites is used, from which the training and testing portions of the data are sampled. This reproduces a practical scenario in which a website is observed during a certain period of time, a model is built for that website, and then that model is used to predict the activity of future users.

The first task is to predict the number of pageviews a URL will receive during its first 48 hours based on observations done during the first 1 hour after publication. The second and third tasks are to predict the number of messages on Twitter and Facebook containing this URL that will be posted during its first 48 hours based on the same observations.

Dataset

The base dataset is obtained from a sample of 100 websites, whose identity remains pseudonymous to disallow the usage of external information. From each website, 600 URLs posted during 2013 and having at least 10 visits are sampled uniformly at random.

Each host is contained in one file with a JSON record of the form:

{"host_id": HOST_ID, "pages": PAGE_LIST }

Where PAGE_LIST is an array of records of the form:

{"page_id": PAGE_ID,

"posted_weekday": POSTED_DAY_OF_WEEK,

"posted_hour": POSTED_HOUR_OF_DAY,

"sum_visits_48h": SUM_VISITS_48H,

"series_1h": SERIES,

"series_48h": SERIES

}

SERIES is an object of the form

{"twitter": [TWEET_COUNT_SERIES],

"facebook": [FACEBOOK_LIKES_SERIES],

"time": [AVERAGE_ACTIVE_TIME_SERIES],

"visits: [VISITORS_COUNT_SERIES],

}

The meanings and types of the fields are:

  • POSTED_WEEKDAY is an integer from 0-6 giving the day of the week in UTC the page was made available online (0=Monday, 6=Sunday).
  • POSTED_HOUR is an integer from 0-23 giving the hour of the day in UTC the page was made available online.
  • SUM_VISITS_48H is the number of visits to the article in its first 48 hours.
  • SERIES_1H is the time series on the first hour (12 elements), and SERIES_48H is the time series on the first 48 hours (576 elements). Each value of the time series corresponds to a 5 minute window sorted by time. This includes the number of visitors, the number of tweets on Twitter, the number of Facebook likes, and the average time in which visitors were active on the page.

The tasks consists on using the first 60 minutes of a URL to predict the number of visitors, tweets on Twitter and likes on Facebook the URL will receive in total during 48 hours (the summation of all the observations for that URL).

Data split

The data are split into two parts. From the URLs of each website, 300 are selected randomly for the public challenge data (30,000 URLs), and the remaining 300 are included in the secret evaluation data (30,000 URLs).

The public challenge data is given to participants. The public challenge data will contain all 576 records for each of the 30,000 URLs.

The secret evaluation data, corresponding to 30,000 URLs, is split as follows. The first 12 records of each URL (corresponding to 60 minutes) are given to participants under the non-disclosure agreement. The remaining 564 records are kept for evaluation purposes.

How to get access to the data? See Participating »