CS 329M: Machine Programming (2024)
Instructor: Justin Gottschlich (goju@stanford.edu, goju@merly.ai)
Course Assistant: Ethan Hellman (hellman1@stanford.edu)
Stanford Course: CS 329M (3-4 credits)
Dates: September 23 - December 11 (2024)
When: Monday / Wednesday 11:30-1:20pm
Building: History Corner (Building 200), room 200-002
Office Hours: Monday / Wednesday 1:30pm-2:20pm (Building 240-202)
Stanford CS 329M Lectures (September - December 2024):
[Week 01] Sept. 23, 25: Lectures 1 & 2
[Week 02] Sept. 30, Oct. 2: Lectures 3 & 4
[Week 03] Oct. 7, 9: Lectures 5 & 6
[Week 04] Oct. 14, 16: Lectures 7 & 8
[Week 05] Oct. 21, 23: Lectures 9 & 10
[Week 06] Oct. 28, 30: Lectures 11 & 12
[Week 07] Nov. 4, 6: Lectures 13 & 14
[Week 08] Nov. 11, 13: Lectures 15 & 16
[Week 09] Nov. 18, 20: Lectures 17 & 18
[Week 10] Nov. 25, 27: Autumn Break (No classes)
[Week 11] Dec. 2, 4: Invited Guest Lectures, Student Presentations
Course Description
The field of machine programming (MP) is concerned with the automation of software development. Given the recent advances in software algorithms, hardware efficiency and capacity, and an ever increasing availability of code data, it is now possible to train machines to help develop software. In this course, we teach students how to build real-world MP systems. We begin with a high-level overview of the field, including an abbreviated analysis of state-of-the-art (e.g., Merly Mentor). Next, we discuss the foundations of MP and the key areas for innovation, some of which are unique to MP. We close with a discussion of current limitations and future directions of MP. This course includes a nine-week hands-on project, where students (as individuals or in a small group) will create their own MP system and demonstrate it to the class.
While some overlap exists between traditional techniques to train machines to perform non-programming tasks (e.g., natural language processing, computer vision, etc.), teaching machines to perform programming-specific tasks has uniqueness in at least two dimensions. First, there are certain techniques that are more (or less) effective for MP, such as using self-supervision to learn from the large corpora of unlabeled open-source code. Second, software reasoning is fundamentally multi-dimensional; that is, there exist multiple unique ways to learn from software (e.g., static analysis, dynamic analysis, input/output specifications, program state reinforced-convergence, hardware telemetric data, etc.). In this course, we discuss each of these techniques (and others) and how they can be effectively applied to MP systems.
This course is primarily intended for Stanford MS and PhD graduate students. However, advanced (senior-level) and committed undergraduates have successfully completed this course. Hard-working and disciplined students have historically done will in CS 329M even if they are relatively new to ML, PL, SE, and systems.
Prerequisites
This course is designed for graduate students. Highly motivated and hard-working undergraduates have historically done well in CS 329M. The following are the prerequisites for this course.
Required:
Machine learning (ML): some ML background like basic understanding of neural networks, advanced statistics, Bayesian networks, decision tree learning, k-means clustering, isolation forests, etc. CS 329M is ML agnostic (i.e., students can use whatever form of ML is most effective for their projects).
Programming languages (PLs): some PL background; lecture material usually use C/C++ examples, but project and assignments can be done in almost any language (e.g., Python, JavaScript, etc.). CS 329M is PL agnostic (i.e., students can use whatever PL is most interesting to them for their assignments and projects).
Software engineering (SE): some coursework or experience in software engineering; this is different than PLs and is focused more on building real software and the intricacies of it (e.g., CI/CD pipelines, source control systems, software testing, debugging, etc.).
Systems: some basic understanding of computing, data, and communication systems (e.g., CPU, GPU, internet communication, local and cloud storage concepts, etc.).
Recommended:
Python programming experience, including experience with TensorFlow or PyTorch can be helpful.
Prompt engineering experience, including how to control the spherically-constrained stochasticity of language models (small, mid, large), basic understanding of transformer-based neural networks
Basic understanding of PL abstractions, static and dynamic program analysis, compilers, and general machine learning techniques is helpful, but not required.
Students do not need an extensive background in machine learning, data systems, programming languages, software engineering, and compilers. The necessary aspects of these fields will be covered in the course as they become necessary. However, students with a background in these topics will likely have an easier time understanding the intuition behind some of the more advanced MP topics in the course (e.g., building semantics reasoners, program synthesis for MP data generation).
Tentative Course Syllabus
Tentative Grading
40% exams (20% mid-term, 20% final)
10% assignments (3 written assignments; 1 oral in-class assignment)
50% project (5% proposal, 5% project checkpoint, 30% project report, 10% final presentation)
+5% attendance (+0.28% per attended lecture)
Examples of Outstanding Student Assignment Reports (2023, Autumn Quarter):
Assignment #1 from Jamil Dhanani (strong scientific merit, excellent exposition, formatting, style)
Assignment #1 from Martin Juan Jose Bucher (strong scientific merit, tier-1 conference-level formatting and style)
Reading List
Week 1
Three Pillars of Machine Programming (Gottschlich et al.)
The Case for Learned Index Structures (Kraska et al.)
AI Programmer: Autonomously Creating Software Programs Using Genetic Algorithms (Becker and Gottschlich, GECCO 2021)
Week 2
Aroma: Code Recommendation via Structured Code Search (Luan et al.)
Halide: A Language and Compiler for Optimizing Parallelism, Locality, and Recomputation in Image Processing Pipelines (Ragan-Kelley et al.)
code2vec: Learning Distributed Representations of Code (Alon et al.)
Week 3
MP-CodeCheck: Evolving Logical Expression Code Anomaly Learning with Iterative Self-Supervision (Muff et al., 2022)
MISIM: A Neural Code Semantics Similarity System Using the Context-Aware Semantics Structure (Ye et al.)
Neural Code Comprehension: A Learnable Representation of Code Semantics (Ben-Nun et al., NeurIPS 2018)
Week 4
Automating String Processing in Spreadsheets Using Input-Output Examples (Gulwani)
Learning to Represent Programs with Graphs (Allamanis et al., ICLR 2018)
Learning to Represent Programs with Property Signatures (Odena and Sutton, ICLR 2020)
Week 5
A Zero-Positive Learning Approach For Diagnosing Software Performance Regressions (Alam et al., NeurIPS 2018, video)
Neo: A Learned Query Optimizer (Marcus et al.)
Self-supervised Bug Detection and Repair (Allamanis et al., NeurIPS 2021, video)
Week 6
Evaluating Large Language Models Trained on Code (Chen et al.)
Bao: Making Learned Query Optimization Practical (Marcus et al.)
Verified Lifting for Stencil Computations (Kamil et al.)
Week 7
Program Synthesis for Scientific Computing (Finkel et al., US Department of Energy 2021)
Learned Garbage Collection (Cen et al., MAPL 2020)
Week 8
Hoppity: Learning Graph Transformations to Detect and Fix Bugs in Programs (Dinella et al.)
(Dexter) Automatically Translating Image Processing Libraries to Halide (Ahmad et al., SIGGRAPH '19)
Week 9
Software Language Comprehension using a Program-Derived Semantics Graph (Iyer et al., NeurIPS CAP, 2020)
Learning Fitness Functions for Machine Programming (Mandal et al., MLSys '21)
Week 10
A Survey on Semantic Parsing for Machine Programming (Lee et al., KDD PLL 2021)
2023 Lectures:
Stanford CS 329M Lectures (September - December 2023):
[Week 05] October 22-28: Lecture 9 , <Student Presentations>
[Week 06] October 29-Nov 4: Lecture 10, Lecture 11
[Week 07] November 5-11: <Stanford Democracy Day; no class>, Lecture 12
[Week 08] November 12-18: Lecture 13, Lecture 14
[Week 09] November 19-25: <Autumn break; no class>
[Week 10] November 26-December 2: Lecture 15, Lecture 16
[Week 11] December 3-9: Lecture 17, <Student Presentations>
[Week 12] December 10-16: Finals Week (Take-home Final Exam, due midnight on Dec. 14)
2022 Lectures:
Part 1: Lecture 1. Lecture 2. Lecture 3. Lecture 4. Lecture 5. Lecture 6. Lecture 7. Lecture 8. Lecture 9.
Part 2: Lecture 10. Lecture 11. Lecture 12. Lecture 13. Lecture 14 (guest lecture). Lecture 15.
Part 3: Lecture 16 (student presentations). Lecture 17 (student presentations).