Semester: Fall 2024
Prerequisites: Linear Algebra (MATH 257 or 415), Grad Students: Programming Knowledge
Instructor: Prof. R. Srikant, rsrikant@illinois.edu
TAs: Saptarshi Mandal smandal4@illinois.edu; Yashaswini Murthy ymurthy2@illinois.edu
Office Hours: Yashaswini, 4-5 Wed; Saptarshi 5-6 Wed; both in ECEB 3036
Lectures: 11-12:20 TuTh in Room 3081 ECEB
Thanksgiving Break: Nov. 23-Dec. 1
Last day of instruction: Dec. 11
Grading:
Quizzes: Homework will be posted with solutions and there will be a quiz each Thursday based on the latest homework posted. There will not be a quiz on the first Thursday (8/29) but there will be one every Thursday after that unless explicitly cancelled.
Undergrads: 100% of the grade will based on weekly Thursday quizzes (can miss up to two)
Grad Students: 80% of the grade will be based on weekly Thursday quizzes (can miss up to two) and remaining 20% will be based on a final project
Final project (20%): Due: Dec. 13 midnight. Read and write a report on two algorithms: Nesterov's accelerated gradient method and mirror descent. The TAs and the instructor will not provide help on the project. You have to find appropriate sources on the Internet and write a report, providing precise proofs of convergence. You have to cite the resources used. You can use ChatGPT to find resources or to learn about the material or to do coding, but you will be responsible for any errors produced by ChatGPT. The report has to be typewritten by you using your own words (not verbatim from any source), it should not be more than 15 pages long (but can be shorter if you are able to convey everything that is needed in fewer than 15 pages) and must be in 11 pt font or bigger. The report should be submitted as a pdf file. Details on how to submit the final report via gradescope will be provided later. Your project report should include the following:
The report should contain an introduction section, a section on accelerated gradient descent, a section on mirror descent, a section on numerical examples and a conclusion section.
The introduction should state what the project is about and what the reader can learn by reading the report. The conclusion should contain a concise summary of the findings in the report.
The sections on accelerated gradient descent and mirror descent should contain a clear description of the corresponding algorithm, a proof of convergence, and a comparison of the theoretical rate of convergence of the algorithm with the theoretical rate of convergence of standard gradient descent or standard projected gradient descent as appropriate. The section on mirror descent should also clearly show how standard projected gradient descent is a special case of mirror descent.
The section on numerical examples should contain two plots. One plot should provide a comparison of the numerical performance of accelerated gradient descent with that of standard gradient descent, showing that accelerated gradient descent is clearly better. The other plot should provide a comparison of the numerical performance of mirror descent with that of standard projected gradient descent, showing that mirror descent is clearly better. In each plot, the x-axis should be the number of iterations and the y-axis should be value of the objective function that is being minimized. Each of these plots can be for a different optimization problem; you have to make up your own optimization problems guided by the theory, since the theory will tell you the type of optimization problems for which each algorithm is better than standard gradient descent. You have to include the code used to generate the numerical examples, you can use any programming language for this and any built-in libraries for the purpose, but you are responsible for the accuracy of the code. The code should be included in your fifteen pages.
Topics (as time permits):
· Applications: Least squares, Least squares with regularization , SVM , neural networks, power allocation in wireless networks, Internet congestion control, Netflix recommendation problem
· Inf vs min, sup vs max, global vs local optimality, existence of optimal solutions and the Weierstrass extreme value theorem, convex sets, convex and concave functions, global and local minima for convex functions, uniqueness for strictly convex functions, tests for convex/concave functions and strongly convex/concave functions using Hessian matrix, gradient-based characterization of convex and concave functions
· First-order necessary and second-order sufficient conditions for local minimum
· Gradient descent, descent lemma, convergence analysis for a fixed step size for functions with Lipschitz-continuous derivatives, for convex functions and for strongly convex functions, the role of the condition number and a quadratic objective example (least squares is a special case); variable step sizes, Armijo’s rule and its convergence; applications to neural networks and the back propagation algorithm
· Newton’s method and improvement in convergence rate locally; problems with implementing Newton’s method and some fixes
· Constrained optimization, projected gradient descent, and its convergence for a fixed step size
· Constrained optimization with equality and inequality constraints; Lagrange multipliers, the KKT necessary conditions and the KKT sufficient conditions; application to power control in wireless networks and the waterfilling solution; sensitivity and Lagrange multipliers
· Penalty, barrier function methods, and their convergence
· Duality: the primal and dual problems; strong and weak duality and the Slater conditions, minimax characterization, linear programming and the economic interpretation of the dual; applications of strong duality to SVM
· Subgradients, subgradient descent and its convergence; application of subgradient descent to the dual problem; decomposition and its application to Internet congestion control
· Conjugate gradient method to avoid inverting the Hessian
· Proximal gradient descent and application to sparse regression (leads to an algorithm called ISTA for the l1-regularized least squares problem)
· Augmented Lagrangian methods (aka method of multipliers)