DS605 - Large-scale Distributed Optimization and Learning (Autumn'25)

Instructor: Pranay Sharma

TA: Ankit Kumar

Time: Tuesdays and Fridays, 2.00-3.30 pm

Room: LT 006

Office Hours: Fridays 3.30-4.30 pm

Course Description

Modern machine learning applications often rely on data generated and/or collected at the edge-devices (phones, smart watches, sensors, etc.). However, bringing this data to a central location (often called a server) is often infeasible (high communication cost) or undesirable (privacy concerns). Distributed learning is essential in such applications to efficiently learn large-scale ML models.

This course will begin with some basics of optimization. This will be followed by an extensive discussion on federated learning – a general framework under the worker-server network architecture. The last few lectures will focus on distributed learning with agents connected in a more general network topology.

Lecture Notes

I: Basics

Introduction to Distributed Learning (handwritten)
Optimization basics - convex sets; convex functions; strict/strong convexity; smoothness; gradient descent (GD) (handwritten/scribed)
SGD - stochastic gradient method; convergence for smooth functions; comparison with GD (handwritten/scribed)

II: Introduction to Distributed Learning

Distributed SGD - minibatch SGD; distributed SGD; simplistic communication model; Local SGD; convergence for smooth losses; generalization error (handwritten/scribed)
Introduction to Federated Learning (FL) - local SGD vs FL; cross-device vs cross-silo; FedAvg; various challenges in FL (handwritten/scribed)

III: Challenges in Federated Learning (FL)

Data Heterogeneity in FL - effects of non-iid data across clients; different proposed solutions; FedProx; variance-reduction based stochastic methods (SAG/SAGA); SCAFFOLD (handwritten/scribed)
Communication Efficiency in FL (Part I) - compression; quantization; Q-SGD and convergence analysis (handwritten/scribed)
Communication Efficiency in FL (Part II) - sparsification (Rand-k and Top-k); error-feedback; Mem-SGD; composition of compression operators; Q-sparse-local-SGD (handwritten/scribed)
Partial client availability in FL (Part I) - Generalized FedAvg (two-sided learning rates); convergence under full/partial client availability; under iid/non-iid data (handwritten/scribed)
Partial client availability in FL (Part II) - study of client drift; partial participation error; FedVARP; arbitrary client participation (handwritten)
Computational heterogeneity in FL - convergence to mismatched loss; FedNova (handwritten)
Personalization in FL - meta learning-based; user clustering; fine-tuning with interpolated data; model interpolation (handwritten/scribed)

IV: (Differential) Privacy in ML

Introduction to Privacy - privacy violation examples; randomized response; Differential Privacy (DP) -definition; implications; Post-processing; Laplace mechanism; Basic composition (handwritten/scribed)
Differential Privacy (contd.) - counting queries; Group privacy; Approximate DP; Gaussian mechanism; properties of approx. DP; advanced composition (handwritten)
DP in ML and DP-SGD - Threat models; Where to apply DP?; DP-SGD; convergence; Privacy amplification by Sampling (PABS); User-level privacy; DP-FedAvg (handwritten/scribed)
Federated Multi-objective Optimization (slides/scribed)
Note: this lecture belongs to the previous module on "Challenges in FL," but was taken later.
Background: Concentration of Probability Measure - moment-generating function; Hoeffding's lemma and inequality (handwritten)
Advanced DP Composition (Part I) - privacy loss distribution; composition via privacy loss dist.; Concentrated DP; Gaussian mechanism is concentrated-DP (handwritten/scribed)
Advanced DP Composition (Part II) - Advanced composition of concentrated-DP; PABS (handwritten/scribed)

V: Decentralized Learning

Introduction to Decentralized Learning (handwritten/scribed)
Graph Theory - connectivity, irreducibility, aperiodicity; stochastic matrices; Perron-Frobenius Theory; Consensus complexity (handwritten Part 1 and Part 2/scribed)
Decentralized Gradient Descent (Part I) - convergence for smooth functions; difference from centralized GD (handwritten)
Decentralized Gradient Descent (Part II) (handwritten)
Gradient Tracking and Stochastic Optimization over Networks (handwritten)
Future of Decentralized Learning (handwritten/scribed)

Note: since it was a small class, there were not enough students to scribe all the lectures. And I was too lazy to do it myself.

Page updated

Google Sites

Report abuse