Dataset

Dataset: CodeWorkout (Spring and Fall 2019)


The CodeWorkout dataset is collected from a CS1 course in the Spring and Fall 2019 semesters at a public university in the U.S. It contains the code submissions from students for 50 coding problems, each requiring ~10-26 lines of code. In total, there are 329 and 490 students in the Spring and Fall semesters who completed the course. Each dataset contains 65K+ code submissions, including the submitted code. In addition to the code submissions, the scores of the submissions (% of unit tests passes) are also available, as well as the compiler message if the compilation is not successful. The final grades of students are also provided for this dataset.


To download the datasets, see the links under the Quickstart.

Training/Test Split

Each semester's data is split into training (75% of students) and testing (25%) datasets. Since the Data Challenge is an early prediction task, both Training and Test datasets contain data about the first 30 problems, which should be used to make predictions on the final 20 problems (Track 1) and students' final grades (Track 2). The labels for these late problems are provided in the Training dataset, but not the Test dataset. Table below gives an overview of this data.

Data Format: ProgSnap2

Each dataset contains a Data folder, which contains students' trace data, stored in the ProgSnap2 format. Additional information on the format can be found in ProgSnap2 spec and in [Price et al.].


Each training and testing dataset is organized as follows:


Root/

early.csv

late.csv

Data/

MainTable.csv

Metadata.csv

CodeStates/

CodeState.csv

LinkTables/

Subject.csv


Data/MainTable.csv: Contains all programming process event data. Each row of the table represents an event: e.g. one program submission , compilation result, and errors if any. The Run.Program events are initiated by one student submitting a given problem, while the Compile and Compile.Error events are generated by the grading system. The following attributes are defined for each row:


Primary Attributes:

  • SubjectID: A unique ID for each participant in the study.

  • AssignmentID: An ID for the assignment a student is working on. One assignment may include multiple problems.

  • ProblemID: An ID for the problem being worked on by the student.

  • CodeStateID: An ID for the students code at the time of the event. This corresponds to one row in the LinkTables/CodeState.csv file, which contains the text of the code.

  • EventType:

    • Run.Program indicates that the students submitted the program and made an attempt to run the code, and get feedback from test cases. The finished percentage score of the run in the Score column.

    • Compile indicates the program is compiled, and a result of whether the compilation is a success or failure is shown on the Compile.Result column.

    • Compile.Error value indicates the compilation fails, and the error messages are available in CompileMessageType and CompileMessageData column.

  • Score: The score the student got on

  • ServerTimestamp: The time of the event.

  • ServerTimezone: The relative timezone of the server to US Eastern Time, or UTC. Here it is always 0 or UTC.


Additional Attributes -- these columns may also be useful:

  • Order: An integer indicating the chronological order of this event, relative to each other event in the table. This is necessary to order events that occurred at the same timestamp.

    • Note: For the CodeWorkout dataset Order is only defined for a given SubjectID and ProblemID. In other words, you can compare events from the same SubjectID and ProblemID, but otherwise comparing Order is meaningless. Instead, you should use ServerTimestamp to compare the global order of events.

  • ToolInstances: The tutoring system and programming language being used. In this case, it is always CodeWorkout [13] and Java 8.

  • CourseID: The ID for the course participated by the student. Here it is all CS 1. One course is composed of multiple course sections

  • CourseSectionID: An ID for the section a student is working on. In one section, students are assigned with multiple assignments.

  • Attempt: The chronological attempt order of the student on a problem.

  • IsEventOrderingConsistent: Set as TRUE when the event ordering is consistent.

  • EventID: The unique ID of the event.

  • ParentEventID: If this event is a sub-event (caused by running or compilations), this field indicates the parent EventID of the current event.

  • SourceLocation: Indicates the text line of the code that caused results.


Data/DatasetMetadata.csv: Contains metadata about this dataset, as defined in the ProgSnap2 spec.


Data/CodeStates/CodeState.csv: This file contains a mapping from each CodeStateID in the MainTable to the source code it represents.


Note: The following files are used for prediction in Track 1 and Track 2 (Tracks). Their descriptions will make more sense after reading about these tracks.


early.csv: This table contains one row for each combination of SubjectID and ProblemID for the first 30 problems (the first 3 assignments). Each row represents one student's combined attempts at one problem. These problems represent the "early data," available to the model for the early prediction tracks for Tracks 1 and 2. Each row has the following attributes:

  • SubjectID: The unique ID of the student attempting the problem.

  • AssignmentID / ProblemID: The IDs of the assignment and problem being attempted.

  • Attempts: The number of attempts the student made on the problem before either getting it right for the first time, or giving up without getting it right.

  • CorrectEventually: This will be TRUE if the student eventually got the problem fully correct (Score = 1), and FALSE if they never submitted a correct solution.

    • Note: Attempts and CorrectEventually are provided for convenience and transparency, but these values can be calculated from the MainTable.csv.

  • Label: Whether the student was successful (TRUE) or struggled (FALSE) on this problem, as defined in Track 1.

    • Note: This value is given in both training and test datasets for the early problems, but it must be predicted for the late problems in the test dataset.


late.csv: This table contains one row for each combination of SubjectID and ProblemID for the final 20 problems (the last 2 assignments). Each row represents one student's combined attempts at one problem. These problems represent the "later events," where we are trying to predict student outcomes (in Track 1). Each row has the same ID fields as early.csv, as well as a possible Label column.

  • Note: In training datasets, the Label value is given, but in test datasets, this value must be predicted (in Track 1).


Data/LinkTables/Subject.csv: This file contains a mapping from each SubjectID in the MainTable to the final grade of students in this class (X-Grade).

  • Note: In training datasets, the X-Grade value is given, but in test datasets, this value must be predicted (in Track 2).


Problem Prompts and Programming Concepts Used


The CodeWorkout dataset contains 50 problems nested into 5 assignments. Students could attempt problems within a given assignment in any order. The prompts for each problem can be found in this spreadsheet. The spreadsheet also contains estimates of which programming concepts are used in each problem; however, these are just one researcher's judgement and should not be taken as ground truth.


Differences Between Semesters

The Spring and Fall 2019 semesters were similar; however, there were a few differences:

  • In F19 there was an additional assignment (between Assignment 4 and 5), which only ~70% of students completed, likely additional optional practice. We will not use this assignment for prediction, since it is abnormal and not in S19. Since it comes in between the two assignments we are using for prediction, we have simply ignored it.

  • In F19 some AssignmentIDs and ProblemIDs were renamed, so we have updated their IDs to match S19. This should therefore not affect prediction.


These differences are representative of real changes that happen from semester-to-semester, making cross-semester prediction an especially important task.