"Introduction to Machine Programming"
Instructor: Justin Gottschlich
Stanford Course: CS 329M (3 credits)
Dates: Autumn 2022 (September 26 - December 16, 2022)
When & where: 1:30-2:50PM (PT), Tuesdays & Thursdays, building/room 540-108
Course Description
The field of machine programming (MP) is concerned with the automation of software development. Given recent advances in algorithms, hardware efficiency and capacity, and an ever increasing availability of code data, it is now possible to train machines to help develop software. In this course, we teach students how to build real-world MP systems. We begin with a high-level overview of the field, including an abbreviated analysis of state-of-the-art MP systems (e.g., DeepMind's AlphaCode, GitHub's Copilot, Merly's Mentor). Next, we discuss the foundations of MP and the key areas for innovation, some of which are unique to MP. We close with a discussion of current limitations and future possibilities of MP. This course includes a nine-week hands-on project, where students (as individuals or in a small group) will create their own MP system and demonstrate it to the class.
While some overlap exists between traditional techniques to train machines to perform non-programming tasks (e.g., natural language processing, computer vision, etc.), teaching machines to perform programming-specific tasks has uniqueness in at least two dimensions. First, there are certain techniques that are more (or less) effective for MP, such as using self-supervision to learn from the large corpora of unlabeled open-source code. Second, software reasoning is fundamentally multi-dimensional; that is, there exist multiple unique ways to learn from software (e.g., static analysis, dynamic analysis, input/output specifications, program state reinforced-convergence, hardware telemetric data, etc.). In this course, we discuss each of these techniques (and others) and how they can be effectively applied to MP systems.
Prerequisites
This course is designed for advanced undergraduates, master's, and PhD graduate students. The following are the prerequisites for this course.
Required:
(i) Deep learning or (ii) linear algebra and mathematical maturity.
Experience with the C and C++ programming languages (PLs).
Software engineering coursework or experience.
Recommended:
Python programming experience, including experience with TensorFlow or PyTorch.
Basic understanding of PL abstractions, static and dynamic program analysis, compilers, and general machine learning techniques.
Students do not need an extensive background in machine learning, data systems, programming languages, software engineering, and compilers. The necessary aspects of these fields will be covered in the course as they become necessary. However, students with a background in these topics will likely have an easier time understanding the intuition behind some of the more advanced MP topics in the course (e.g., building semantics reasoners, program synthesis for MP data generation).
Tentative Course Syllabus
The course lectures cover the following major segments:
Overview: two lectures: lecture 1 focuses on the foundations of MP and some background necessary to understand the field, lecture 2 briefly covers the current state-of-the-art MP systems.
Learning: two lectures: lecture 1 presents the fundamentals of machine learning and classical learning techniques (e.g, supervised learning with neural networks), lecture 2 covers emerging techniques necessary for MP-specific learning (e.g., self-supervised static analysis, dynamic program analysis and data generation, etc.).
Three Pillars of MP: three lectures spread evenly across the three pillars of MP: intention, invention, and adaptation; each lecture also includes some background material to bring students up to speed on domain-specific details in the areas of programming languages, code abstractions and representations, software engineering principles, and compilers, amongst others.
Deep Data: three lectures about the various ways to reason about, generate, and utilize data for MP systems: lecture 1 discusses classical ML-based data utilization (e.g., training, validation, and testing), lecture 2 covers future ways to harness and automatically synthesize and label data for MP systems (e.g., program synthesis for data generation, dynamic execution information for analysis, automated semantics labeling, etc.).
Semantic Reasoning: two lectures on the foundations, construction, and utilization of semantic reasoning systems in MP: lecture 1 covers some of the basics found in other CS courses that are necessary for advanced reasoning about syntax and semantics, lecture 2 covers details on how to construct semantics representations and use them for downstream MP tasks.
Tentative Course Grading
25% exams (10% mid-term, 15% final)
10% homework (5% per homework * 2)
65% project (10% proposal, 5% project checkpoint, 35% project report, 15% final presentation)
+4.25% attendance (+0.25% per attended lecture)
+2.5% extra credit on exams (1.0% on mid-term, 1.5% on final exam)
+2.5% extra credit assignment (hard!)
Lectures:
Part 1: Lecture 1. Lecture 2. Lecture 3. Lecture 4. Lecture 5. Lecture 6. Lecture 7. Lecture 8. Lecture 9.
Part 2: Lecture 10. Lecture 11. Lecture 12. Lecture 13. Lecture 14 (guest lecture). Lecture 15.
Part 3: Lecture 16 (student presentations). Lecture 17 (student presentations).