My research stands on the intersection of multiple disciplines, mainly spanning over domains such as computer science education (CSEd), artificial intelligence (AI), and human-computer interaction (HCI). With a vision of leveraging data analytics to advance computing education, I conducted research focusing on answering the question: What is the role of big data in computing education, and how to develop AI-empowered technologies to build a more intelligent, effective, and accessible learning environment.

Code-Informed Learning Analytics

Modeling students of their progress in learning certain knowledge involves many perspectives in computer science education. In this project, we take the advantage of collected students’ code and look into the status of student learning, building deep neural networks to model their knowledge and predict their performance.

Collaborators: NCSU, CMU.

Key Contributions:

  1. We designed Code-DKT, a model that leverages tree structural information from student code for future performance prediction, and improved the performance of deep knowledge tracing by 3.07% - 4.00% in AUC score on the CodeWorkout dataset. Code-DKT also shows how code features make an impact on the performance prediction task [EDM’22].

  2. While we can make predictions on the performance, it is still unclear what knowledge components (or skills) students actually learned or not learned. We proposed a data-driven model to automatically discover actual interpretable candidate knowledge components that fits the power law of practice in programming exercises [Submitted to LAK’23].

  3. The detection of sub-tasks in a programming exercise can show the completion rate of students, and unfinished sub-tasks reveal their knowledge gaps. Our study shows that instead of having experts extensively author detection rules of sub-tasks, Expert-AI collaborations can achieve a high detection accuracy while keeping human effort low [EDM’21].

  4. We also developed a data-driven model which combines LSTM and ASTNN models for students' early performance prediction, showing that code information can be used for the early detection of potential failures or success [EDM’21].

Future Directions:

There are many directions under this project that we can further explore, and here are three major directions I will work on in the near future.

  1. Create interventions based on the skills and concepts, and show to students projected to have a low probability of mastery.

  2. Systematically evaluate how such models would help students’ actual learning in a classroom or a MOOC setting.

  3. Certain assumptions are made when discovering knowledge components. While they fit the characteristics of learning or practicing (they follow the power law of practice), it is still unclear whether these knowledge components are consistent, and to what extent they are consistent through different problems. Further evaluations are needed for discovered knowledge components.

Student Bug Detection with Limited Label

Detecting and pinpointing the bugs in students’ code submissions can help them learn more efficiently. However, it is usually labor-intensive to gather data with graded students’ misconceptions. To address this challenge, we use structural information from students’ code to achieve bug detection using semi-supervised and unsupervised learning.

Collaborators: NCSU.

Key Contributions:

  1. We compared the performance of two recent code analysis deep neural networks (Code2vec and ASTNN) in student bug detection under a limited number of available labels using semi-supervised learning. Our results show that the models are able to improve the detection performance with a large number of unlabeled data, and they are more accurate than classic data-driven models [EDM’21].

  2. We can also discover possible bugs from student code without labeled data by fitting a deep learning model, and clustering the middle layer embeddings. The clusters are then interpreted by experts as possible bugs by students, and propagated to student batches [LAK’21].

Future Directions:

  1. More large-scale replication studies should be helpful in developing a spectrum of student bug libraries and testing the robustness of the method.

  2. Build tools around the methods and create a platform that interfaces with student and teacher users.

  3. Bugs and misconceptions are different, and while our current methods can detect bugs, how to efficiently detect misconceptions remains to be explored.

Data Science for Data Science Education (DS for DSE)

Students’ code in data science courses is different than other programming courses. On one hand, students are likely to still have issues with their code, and on the other hand, they could also suffer from misconceptions in data science. In this project, we use data science methods to detect student misconceptions in their assignment code, and track students’ learning status for better student modeling.

Key Contributions:

  1. We collected students’ programming data from an open-ended data science project, and manually coded common errors from their code. Our findings show that students have multiple issues in their code, not only programming errors, but also data science errors [SIGCSE’22].

Collaborators: NCSU, NTU/NIE (Singapore)