CSEDM Data Challenge

The 2nd CSEDM Data Challenge

Part of the Educational Data Mining in Computer
Science Education (CSEDM) Workshop

2nd CSEDM Data Challenge

The 2nd CSEDM Data Challenge has been launched at the 5th CSEDM Workshop. The goal of the challenge is to develop new modeling techniques to predict students' learning outcomes in CS/programming classrooms, based on their submissions to and performance on past programming problems. We hope to bring together researchers at the intersection of AI/ML, Educational Data Mining (EDM), Learning Analytics and Computing Education to develop advances that can directly impact CS classrooms. This year's challenge will feature a larger dataset and multiple submission tracks.

This challenge builds on the 1st Data Challenge, where participants competed to create the best student model to predict programming performance. We had four entries and one winner, which were presented at the 2nd CSEDM Workshop.

Update (6/1/22): Join us for the 6th CSEDM Workshop to learn more about the winning entries!

Update (5/24/22): Winners for Phase 2 have been announced! This concludes the 2nd CSEDM Data Challenge. Thanks to all the participants.

Update (3/9/22): Winners for Phase 1 have been announced, and Phase 2 has begun!

Update (8/23/21): The challenge now has an increased $1,500 prize pool, as explained here.

Update (8/19/21): The challenge now has a Piazza forum for asking questions (join code: csedmdc2021), finding teammates, and getting help.

The CSEDM Data Challenge is help in-cooperation with:

Quick Start and Resources

If you want to skip the documentation and go straight to working with the data, you can check out these code samples. To run them, you will need to first download the datasets below and put them in a directory structure as indicated in the README.

Demonstration Video: At the 5th CSEDM Workshop, we presented an introduction to the CSEDM DC that walks through all of the following. You can watch the video here, and skip to time 57:30 (you have to skip manually).

Downloading the data:

All data is available from the PSLC Datashop. If you do not already have an account, create one here: https://pslcdatashop.web.cmu.edu/login?Submit=Log+in
Next go to this webpage: https://pslcdatashop.web.cmu.edu/Files?datasetId=3458
Then log in to the dataset, if you are not already logged in.
Finally, download the file(s) you are interested in, as described below.

Dataset Files:

Spring 2019 Training and Test Data (S19 Train + Test v1.0)
- Training data needed for the Practice and Cross-Semester Phases (optional for the Within-Semester Phase).
- Test data needed to extract features for prediction in the Practice Phase.
Fall 2019 Test Data (F19 Test Only v1.0)
- Test data needed to extract features for prediction in the Cross-Semester and Within-Semester Phases.
Fall 2019 Training Data (not available until the Within-Semester Phase begins)
- Training data needed for the Within-Semester Phase.

Submit your results on CodaLab, an open platform for data science competitions:

Track 1 submission
- Sample Submission for Track 1 (see Format in Dataset page.)
Track 2 submission
- Sample Submission for Track 2 (see Format in Dataset page.)

Before you get started, you may want read up on:

The way the train/test split was made, and the difference between early and late problems.
The two Tracks of competition (you can submit to either or both)
- Track 1: Try to predict late problem performance based on early problem performance.
- Track 2: Try to predict final exam grade problem performance based on early problem performance.
The two Phases of the competition:
- Cross-Semester: Predict Fall 2019 outcomes using Spring 2019 data.
- Within-Semester: Predict Fall 2019 outcomes using Fall 2019 data.
The general format of the data (ProgSnap2)