Optimization has become the engine for training modern artificial intelligence systems, including LLM, reinforcement learning, generative models. This course provides in-depth treatment of classical and modern optimization methods that are relevant in statistics, machine learning, and data science. We start with the role of optimization in the process of learning from data, and then we discuss the basic theory of convex optimization. Based on this, we present and analyze (stochastic) gradient-based methods, modern nonconvex optimization, and optimizers used in deep learning, reinforcement learning, and language models, as well as applications to statistical and data science problems. The emphasis is on the intuition for modeling, design principles of algorithms, theoretical performance bounds, and the mathematical tools for analysis. The objective is to provide students with a deep understanding of how and why optimization algorithms work, along with their limitations. This knowledge will assist in modeling from an optimization perspective, choosing appropriate algorithms for specific applications, and provide preliminary guidance on practical implementation.
Lecturer: Prof. Yifan Hu
Course Number: 16:960:690:01 and 16:954:694:02
Teaching assistant: to be assigned.
Classes: Thu 12:10pm - 15:10pm
Location: PH-111
Prerequisites: A solid background in analysis and linear algebra; some background in theoretical computer science (computational complexity, analysis of algorithms); the ability to understand mathematical proofs.
Tentative Schedule
Week 1 Jan 22nd, Introduction to Optimization
What is optimization
Motivating examples from statistics, data science, and machine learning
Constrained and unconstrained optimization
Week 2 Jan 29th, Theory of Convexity
Convex sets and functions
Strong convexity and smoothness
Geometric interpretations
Week 3 Feb 5th, Stochastic Optimization
Sample average approximation/empirical risk minimization
Stability and generalization
Week 4 Feb 12nd, Stochastic Gradient Descent
Convergence analysis and sample complexity
Algorithmic stability of SGD
Implementation tricks
Week 5 Feb 19th, (Stochastic) Coordinate Descent and Frank-Wolfe Algorithms
Week 6 Feb 26th, Stochastic Mirror Descent and Geometric-based algorithms
Nonsmoothness
Bergman divergence
Hidden convexity
Week 7 Mar 5th, Optimizer in Deep Learning and Variance Reduction Gradient Methods
From SGD to Adam-mini
Control variate techniques
SARAH, SPIDER, STORM, PAGE algorithms
Week 8 Mar 12nd, Matrix Optimizer
Muon and its variants
Week 9 Mar 19th, Spring break
Week 10 Mar 26th, Modern Nonconvex Optimization
Landscape: gradient dominance conditions, invexity, hidden convexity
Stationary convergence
Week 11 Apr 2nd, Stochastic Minimax, Bilevel, Contextual Optimization
Adversarial generative network, hyperparameter tuning, neural architecture search
Biased Gradient Methods
Model misspecification
Week 12 Apr 9th, Distributionally Robust Optimization
Equivalence between square-root Lasso, regularized logistic regression and distributionally robust optimization
Dual form of distributionally robust optimization
Computation and generalization
Week 13 Apr 16th, Application I: Causal Learning as Optimization
Causal invariance learning
Causal inference: CATE/Uplifting/Instrumental Variable Regression
DAG learning
Week 14 Apr 23rd, Applications II: Policy Optimization in Reinforcement Learning
Policy gradient theorem
Function approximation
Policy-based RL algorithms
Week 15 Apr 30th, Applications III: Variational Inference and LLM as Optimization
Particle gradient methods
Reinforcement learning with human feedback
Fine-tuning
Evaluation and Grading
There will be one quiz and a course project. The quiz consists of 20% of the final grade and the project consists of 80% of the final grade. There will be 5 bonus take-home questions. Each could contribute to 5% of the final grade.
Course Project
The course project can be conducted alone or within a group of 2 or 3 students. The project can focus on designing new optimization algorithms, theoretical analysis of optimization algorithms, or comprehensive implementation of the algorithms on applications. A list of projects will be provided to registered students. Students can also design their own projects (please discuss with the TA and lecturer).
Tentative Projects List
To be released later.
Reading Material
Optimization for Data Science, by Zebang Shen, Niao He, Bernd Gartner, and Martin Jaggi.
Convex Optimization, by Stephen Boyd and Lieven Vandenberghe.
Convex Optimization: Algorithms and Complexity, by Sébastien Bubeck.
High-Dimensional Statistics: A Non-Asymptotic Viewpoint, by Martin J. Wainwright.
Optimization Methods for Large-Scale Machine Learning, by Leon Bottou et al.