Challenge Tracks

Two Tracks: Knowledge Tracing and Final Grade Prediction

The Data Challenge will consist of related tracks, each with a separate modeling goal. Participants can submit to either or both of these tracks, which may benefit from the same modeling approaches. Both tracks involve using previous students' programming process data to model student learning and predict whether students will succeed in future tasks. This is a central challenge of student modeling, in particular knowledge tracing [Corbett & Anderson]. The two proposed tracks are:

  • Track 1: Knowledge Tracing: predicting a students' performance on a problem before they start it, based on their performance on prior problems.

  • Track 2: Early Grade Prediction: predicting students' final exam grades in the course near the halfway point of the course.


Why it matters: Predictive student models can be used to enable mastery learning [Corbett & Anderson], to make adaptive or feedback that targets struggling students [Murray & VanLehn], or to encourage students through an open learner model [Brusilovsky et al.]. They can also be used to identify students at risk of failure early in a course [Mao et al.]. While similar knowledge tracing tasks have been attempted in many domains (including programming), they usually rely on labels for each problem, identifying which Knowledge Components (KCs), or domain concepts, are required by that problem (e.g. in [Kasurinen & Nikula]).


The specific goal of this Data Challenge is to leverage CS-specific aspects of the data, namely the source code, to build a model without these KC labels, as was attempted in [Yudelson et al., Rivers et al.]. For example, one approach is to automatically extract concepts from code (e.g. "loop", "conditional") to use in a model of student knowledge, as in [Rivers et al., Berges & Hubwieser, Hosseini & Brusilovsky].


Track 1: Knowledge Tracing

Goal: The goal of this track is to predict a student's performance on a given programming problem before they start it. In this challenge, we define struggle based on the number of attempts that a student makes at solving the problem before getting it correct, if at all. This is explained in detail below.


Input: Your trained model should take as input the history of a student's attempts at prior programming problems, including the code of their submissions, and their performance on these problems. This will include data for the first 30 problems (3 assignments) of the semester. This data includes all problem attempts during that time period, the submitted code, and the "score" received for each attempt (0-1) indicating how many test cases were passed.


Output: The model should predict whether the student will succeed (1) or struggle (0) on each of the final 20 problems (2 semesters) of the semester. These predictions can enable interventions from the instructor and learning environment.


Note: Typically, Knowledge Tracing uses a student's performance on problems 1...n to predict performance on problem (n+1). However, in order to facilitate the Data Challenge competition, we needed to split the data into early (first 30 problems) and late (last 20 problems) groups. For example, the model will not have any information about a student's performance on problem 40 when predicting problem 41, only data for problems 1-30.


Track 1 Label: Struggle on a Problem

We define a student as struggling on a given problem (Label = 0) if either of the following are true:

  • They attempted the problem but never got the problem correct.

  • They got the problem correct, but they required more attempts to do so than 75% of other students who attempted the problem.


Therefore, the definition of "struggling" varies from problem to problem, with harder problems allowing more attempts without classifying a student as struggling.


We chose this definition of struggle for a few reasons, outlined in our preprocessing notebook:

  • Traditionally in programming practice environments, students are given feedback from test cases after each attempt (as was the case in the CodeWorkout environment), so the number of attempts measures how much of this feedback they need to succeed.

  • The number of attempted students needed to complete a problem were generally exponentially distributed with a long tail (name with few attempts, few with many attempts). A 75th percentile on the number of attempts generally separated the tail of this distribution from the main body of students.

  • We chose not to use success on the first attempt, since we hypothesized that students use the autograder in different ways, and not all students are trying to get the problem correct on their first attempt (e.g. they may just want to see the test cases), so this seemed like an exaggerated definition of struggle.

  • We chose not to use overall problem correctness, since most students get most problems correct eventually (~93% of all student/problem pairs were eventually successful in Spring 2019).

  • We chose not to use students' test case scores on individual submissions (only whether they got a problem correct eventually), since it was not clear if they could be interpreted as a raw grade (e.g. is each test case equally important?).

  • We cannot use time on task, since we do not know when students started working on a given problem.


We make no claims that this label is objectively correct, but we do argue that it is a meaningful measure of students' success on a given problem, worth predicting, and that the same techniques we develop to predict this value should be useful for predicting other measures.

Track 1 Evaluation

The primary evaluation metric for Track 1 will be the ROC-AUC value across all student-problem pairs (predictions) for the last 20 problems of the test dataset. The scoring system will also calculate macro F1-score (across both classes) and accuracy for reference.


A few notes:

  • The code for the evaluation procedure can be found here.

  • We chose AUC as a standard competition measure (e.g. used in the RIID competition), which also captures performance at various thresholds for intervention, which reflects the reality of many educational systems (e.g. a low-cost intervention might be deployed for all students with a 20% chance of struggle, while a high-cost intervention might be saved for those with an 80% chance).

  • Note: This means that submissions should include predicted probabilities (0-1, continuous), not discrete predictions.


Track 1 Submission

Submit Track 1 on CodaLab (via the link in the Quickstart). Your submission should be a .zip folder with a single predictions.csv file. The file should be based on the Test dataset's late.csv file, with an additional column, Label, including your prediction as a continuous value between 0 (struggle) and 1 (success). An example submission can be found in the Quickstart, where the first few rows are as follows:

Track 2: Early Grade Prediction

Goal: The goal of this track is to predict students' final exam grades in the course (provided in the Students.csv link table) based only on their behavior in the programming environment during the first 30 problems (a little over halfway through the course).


Input (same as Track 1): Your trained model should take as input the history of a student's attempts at prior programming problems, including the code of their submissions, and their performance on these problems. This will include data for the first 30 problems (3 assignments) of the semester. This data includes all problem attempts during that time period, the submitted code, and the "score" received for each attempt (0-1) indicating how many test cases were passed.


Output: The model should output a numeric prediction of the student's final exam grade. This value is provided for students in the training dataset and must be predicted for students in the test dataset. Range (updated 10/13/21): For the Spring 2019 dataset, the target values (exam scores) are scaled to the range of 0-1, but for the Fall 2019 dataset, the target values are in the range 0-100. Therefore, when doing-cross dataset prediction, training on Spring and predicting Fall, you should multiply Fall 2019 predictions/outputs by 100. Failing to do so will result in very high MSE scores. Keep this in mind also for Phase II (within-semester prediction; Fall predicting Fall), if you chose to to incorporate the Spring data as well.

Track 2 Evaluation

The primary evaluation metric for Track 2 will be the Mean Squared Error (MSE) across all students in the test dataset. Your model should minimize this value, representing the mean squared difference between actual and predicted exam grades.

Track 2 Submission

Submit Track 2 on CodaLab (via the link in the Quickstart). Your submission should be a .zip folder with a single predictions.csv file. The file should be based on the Test dataset's Data/LinkTables/Student.csv file, with an additional column, X-Grade, including your prediction as a continuous value between 0 and 1. An example submission can be found in the Quickstart, where the first few rows are as follows: