ASSISTments Data Mining Competition 2017: Details

We are organizing the Workshop on Scientific Findings from the ASSISTments Longitudinal Data Competition during the The 11th Conference of Educational Data Mining in Buffalo, NY on July 15-18, 2018. For more information, please visit our workshop website.

Introduction Video by Professor Neil Heffernan

Overview: This competition is sponsored by the Big Data for Education Spoke of the Northeast Big Data Hub, an NSF initiative to help spur progress in educational research using big data. Individuals/teams competing in this research will use educational data from ASSISTments, an intelligent tutoring system of middle school mathematics, to make long term predictions, and winners will be asked to publish their work in the Journal of Educational Data Mining.

Long term outcome being modeled: The task in this competition is to develop a cross-validated prediction model that is able to use middle-school data to predict whether the students (who have now finished college) pursue a career in STEM fields (1) or not (0).


By downloading our dataset, using our dataset, registering to the competition, and/or submitting predictions, you have agreed to our Terms of Use.

The dataset for this competition, in which you will predict which students have entered STEM career fields (and which haven't), includes these 12 files:

  • student_log_#.csv (9 files), which contain the logged activity data from students' interactions with assistments ASSISTments
  • training_label.csv, which contains some student-level data and the dependent measure for the training set: isSTEM.
  • validation_test_label.csv contains some student-level data but not the dependent measure for the validation and test set

More information can be found in our column label descriptions for these files.


There is no fee to participate but you must register here, where you will receive a key that will allow you to submit your models for evaluation.

Prediction Model Submissions:

Models can be submitted (here) daily for evaluation, using the following format.

1. The submission must be a comma-separated string of 172 predictions (same number as the rows in validation_test_label.csv) without white spaces.

2. A prediction must be a number between 0 and 1, inclusive. Each number can be an integer (i.e. 1), or decimals (e.g. 0.69472) or decimal without leading zero (e.g. .69472).

3. The predictions should be in the same order as the the students in validation_test_label.csv

4. You also need to input the 10-character key that was sent to you during the registration.

5. The submissions will be evaluated only once everyday at noon EST. We will only evaluate the latest submission for each registered participant.


In keeping with cross-validation practices that are an important part of the Educational Data Mining community, we have held out two, randomly-selected portions of this data set to be used to evaluate prediction models.

(1) a validation set which is being used to give participants formative feedback on their prediction models, which they can resubmit as often as once daily between now and December 1st.

(2) a test set which will be used to make the final evaluation of prediction models on December 1st. The labels of the test set will not be available to the participants by any means. The purpose of this set is to be used for the final evaluation of the predictions at the end of the competition.

We will use the validation set to evaluate any newly submitted models each day at noon EST.

The Winner:

Submissions are evaluated using two measures: RMSE and AUC. The winner of this competition is the competitor whose submission has both low RMSE and high AUC on the test set. More specifically, we will use the linear aggregation of RMSE and AUC (i.e. (1 - RMSE) + AUC) to determine the winner.

In addition, participants must agree to submit the code that produces the submitted predictions in order to be eligible for the winner's awards.

The best submission of each competitor will be updated on our PUBLIC SCOREBOARD, which allows you to see how your model compares to those of other competitors.

New Public Scoreboard 2017

Common submission errors:

There are a few things that can invalidate your submission.

  • Email address: you can check whether you input the right address by checking your submission log file (Google Doc) that was sent to you when you register. If you input the wrong address, a new row corresponding to the submission will not appear.
  • Key: if you input the right email but with a wrong key, the submission will still show up on the log file, but it will appear as isValidated = FALSE. In this case, the submission will not count as the latest submission. Only the latest submission with isValidated = TRUE is evaluated.
  • File formats: If your predictions do not match the format we specified, the form will show an error message, and you will not be able to proceed with the submission.

If you have any questions, comments, or concerns, please contact us at [at] gmail [dot] com