Description
This self-contained introduction to machine learning, designed from the start with engineers in mind, will equip students with everything they need to start applying machine learning principles and algorithms to real-world engineering problems. With a consistent emphasis on the connections between estimation, detection, information theory, and optimization, it includes: an accessible overview of the relationships between machine learning and signal processing, providing a solid foundation for further study; clear explanations of the differences between state-of-the-art techniques and more classical methods, equipping students with all the understanding they need to make informed technique choices; demonstration of the links between information-theoretical concepts and their practical engineering relevance; reproducible examples using Matlab, enabling hands-on student experimentation. Assuming only a basic understanding of probability and linear algebra, and accompanied by lecture slides and solutions for instructors, this is the ideal introduction to machine learning for engineering students of all disciplines.
Table of Content
1 When and how to use machine learning
Overview
What is machine learning for?
Why study machine learning now?
What is machine learning?
Domain knowledge-based model-driven design
Inductive bias-based data-driven design: machine learning
Taxonomy of machine learning methods
Supervised learning
Unsupervised learning
Reinforcement learning
When to use machine learning?
Pros and cons
Checklist
Summary
Recommended resources
References
2 Background
Overview
Random variables
Discrete random variables
Continuous random variables
Expectation
Variance
Vectors
Definitions
Inner product
Squared ℓ2 norm
Cosine similarity
Linearly independent vectors
Matrices
Definitions
Products
Symmetric and positive (semi-)definite matrices
Invertible matrices
Trace and determinant
Random vectors
Discrete random vectors
Continuous random vectors
Covariance and covariance coefficient
Marginal distributions
Conditional distributions
Independence of random variables and chain rule of probability
Bayes’ theorem
Law of iterated expectations
Summary
Recommended resources
Problems
References
3 Inference, or model-driven prediction
Overview
Defining inference
Input, output, and population distribution
Detection and estimation
Hard and soft predictors
Optimal hard prediction
Loss function
Population loss
Optimal hard prediction
Optimal hard prediction when x and t are independent
Optimal hard prediction under the quadratic loss
Optimal hard prediction under the detection-error loss
Optimal prediction for jointly Gaussian random vectors
The case of jointly Gaussian random variables
The case of jointly Gaussian random vectors
KL divergence and cross entropy
Log-distribution ratio
KL divergence
Cross entropy
JS divergence
Optimal soft prediction
Log-loss, or cross-entropy loss
Information-theoretic interpretations of the log-loss
Population log-loss and optimal soft prediction
Optimal soft prediction when x and t are independent
Entropy
Optimal soft prediction
Conditional entropy
On soft and hard predictors
Optimal soft predictor as minimizer of the KL divergence
Mutual information
Definition
Some properties of the mutual information
Mutual information and KL divergence
Mutual information and explained uncertainty
Data processing inequality
Variational bounds on the mutual information
Log-loss as a “universal” loss function
Free energy
Minimizing the reverse KL divergence
Introducing the free energy
Interpreting the free energy
Alternative views on the free energy
General definition of free energy involving an unnormalized distribution
Summary
Recommended resources
Problems
References
Overview
Defining supervised learning
Regression and classification
Generalization
Training and test data
Population distribution
Recurring examples
Training hard predictors
Model class
Training loss
Empirical risk minimization
Solving a least squares problem
Inductive bias selection and validation
Underfitting
Overfitting
Validation
K-fold cross-validation
Bias and estimation error
Population loss = minimum population loss + bias + estimation error
Epistemic and aleatoric uncertainty
Beyond the bias vs. estimation error trade-off
Regularization
Ridge regression
Regularized ERM
Testing
Training soft predictors
Model class
Population log-loss
Training log-loss and maximum likelihood learning
Interpretations of ML learning
From soft to hard predictors
MAP learning and regularization
MAP learning
MAP learning as minimum description length inference
Kernel methods
Introducing kernel methods
Kernel functions
“Kernelizing” ridge regression
Reduced-complexity kernel methods
Summary
Recommended resources
Problems
References
Overview
Optimization for training
Solving an optimization problem
Global minima
Local minima
Stationary points
Stochastic optimality guarantees
First-order necessary optimality condition for single-variable functions
Local linear approximation
First-order necessary optimality condition
Derivative as direction of local descent
Second-order optimality conditions for single-variable functions
Second derivative and curvature
Second-order sufficient and necessary optimality conditions
Summary of first- and second-order optimality conditions
Optimality conditions for convex single-variable functions
Definition of convex single-variable functions
Stationarity as a sufficient global optimality condition for convex functions
Strictly convex functions
Optimality conditions for multiple-variable functions
Gradient and first-order necessary optimality condition
First-order necessary optimality condition
Directional derivative
Hessian matrix
Second-order optimality conditions
Optimality conditions for convex functions
Gradient descent
Local search-based optimization and gradient descent
Local strictly convex quadratic approximant
Gradient descent as successive convex approximation minimization
Negative gradient as a “force” field
Convergence of gradient descent
Why can GD converge to a stationary point?
Smoothness and global upper bound property
Convergence of gradient descent
Iteration complexity
Choosing the learning rate and the initialization
Second-order methods
Stochastic gradient descent
Stochastic gradient updates
Choosing the mini-batch
Iterations vs. gradient computations
Convergence of stochastic gradient descent
Properties of the stochastic gradient
Convergence properties of SGD
Choosing the learning rate for SGD
Minimizing population loss vs. minimizing training loss
Symbolic and numerical differentiation
Symbolic differentiation
Numerical differentiation
Automatic differentiation
Computational graph
Introducing backprop
General form of backprop
Backprop computes all the partial derivatives of the output g(θ)g(\theta)g(θ)
Summary
Recommended resources
Problems
References
Overview
Discriminative linear models for binary classification: hard predictors
Model class
Classification margin and correct classification condition
Surrogate loss function
Training binary classifiers via SGD
Perceptron algorithm
Discriminative linear models for binary classification: soft predictors and logistic regression
Model class
Training probabilistic models via SGD
Logistic regression: model class
Logistic regression: training via SGD
Kernelized binary classifiers
Generalized linear models
Discriminative non-linear models for binary classification: multi-layer neural networks
Multi-layer feedforward neural network model class
Neurons, synapses, and computational graph
Training neural networks via backprop: problem definition and computational graph
Training neural networks via backprop: forward pass and backward pass
Training neural networks via backprop: combining forward and backward passes
Training neural networks via backprop: a summary
Approximations of backprop for locality and reduced computational complexity
Beyond feedforward neural networks
Generative models for binary classification
Model class
Ancestral data sampling with a trained generative model
Inference with a trained generative model
Training generative models
Quadratic discriminant analysis
Naive Bayes classification
Discriminative linear models for multi-class classification: softmax regression
Using binary classifiers for multi-class classification
Softmax function
Softmax regression: model class
Training softmax regression
Discriminative non-linear models for multi-class classification: multi-layer neural networks
Model class
Training neural networks
Generative models for multi-class classification
Inference with a trained generative model
Training generative models
Mixture models
Beyond feedforward multi-layer neural networks
Invariance
Equivariance
K-nearest neighbors classification
Applications to regression
Summary
Recommended resources
Problems
References
Overview
Unsupervised learning tasks
Density estimation
Histogram-based density estimation
Kernel density estimation
Contrastive density learning
Other density estimation methods
Contrastive density ratio learning
Latent-variable models
Autoencoders
Training autoencoders
Principal component analysis
Sparse dictionary learning
Neural autoencoders
Discriminative models
Contrastive representation learning
Undirected generative models
Restricted Boltzmann machines: model class
Restricted Boltzmann machines: training
Energy-based models: model class
Energy-based models: training
Drected generative models
Training directed generative models: K-means clustering
Model class
Supervised and unsupervised losses
ERM problem
K-means clustering
Interpreting K-means
K-means as alternate optimization and convergence
Connection with directed generative models
Training directed generative models: expectation maximization
Supervised vs. unsupervised training loss
Free energy and optimal soft prediction
Free energy as a bound on the log-loss
Introducing EM as the successive optimization of local approximants
EM algorithm
K-means and EM, revisited
EM for mixture of Gaussians
Probabilistic PCA
Convergence of EM
Scaling EM
Beyond EM
EM as behavioral principle for biological systems
Directed generative models for density estimation
Summary
Recommended resources
Problems
References
Overview
Benchmarks and decomposition of the optimality error
Bias vs. estimation error decomposition of the optimality error
Training algorithms and ERM
Probably approximately correct learning
PAC condition
PAC training algorithm
Sample complexity
Is ERM PAC?
PAC learning for finite model classes
Capacity of a finite model class
Sample complexity of ERM for finite model classes
Proof of the ERM sample complexity and generalization error
PAC learning for general model classes: VC dimension
Quantizing a model class
Capacity of a model class, revisited
Definition of VC dimension
VC dimension and sample complexity
PAC Bayes and information-theoretic bounds
To recap: generalization error in PAC learning
Stochastic training algorithms
PAC Bayes generalization bounds
Information-theoretic generalization bounds
Summary
Recommended resources
Problems
References
Overview
Definitions and examples
Definition of the exponential family
Examples
Natural and mean parameters
Exponential-family distributions as maximum-entropy models
Gradient of the log-loss
Gradient and mean error
Proof of the relationship between gradient and mean error
From natural parameters to mean parameters and possibly back
ML learning
Gradient of the training log-loss
Mean matching condition
Gradient descent-based ML learning
Information-theoretic metrics
Entropy
KL divergence
Cross entropy
Fisher information matrix (FIM)
Definition of the Fisher information matrix
FIM and quadratic approximation of the KL divergence
FIM as covariance of the gradient of the log-loss
Natural gradient descent
FIM for the exponential family
Generalized linear models
Definitions
Logistic regression as a generalized linear model
Softmax regression as a generalized linear model
Probit regression
Generalized linear models for unsupervised learning
Summary
Recommended resources
Problems
References
Overview
Variational inference, amortized variational inference, and variational EM
Variational inference
Amortized VI
Variational Expectation Maximization
Exact Bayesian inference
Conjugate exponential family
Beta-Bernoulli model
Table of conjugate exponential family distributions
Conjugate models outside the exponential family
Laplace approximation
Introducing variational inference
Mean-field variational inference
Mean-field factorization of the variational posterior
Coordinate descent optimization
Mean-field VI for the Ising model
Generalizing mean-field VI
Introducing parametric variational inference
Black-box variational inference
Stochastic optimization
REINFORCE gradient
Some properties of the REINFORCE gradient
REINFORCE gradient algorithm
Selecting the baseline
Black-box variational inference
Reparametrization-based variational inference
Reparametrizability of a parametric variational posterior
Reparametrization trick
Gaussian reparametrization trick
Reparametrization-based gradient algorithm
Reparametrization-based variational inference
Gradient of the KL term
Reparametrization-based VI with Gaussian prior and variational posterior
Broadening the applicability of reparametrization-based VI
Combining factorization and parametrization for variational inference
Particle-based variational inference and Stein variational gradient descent
Particle-based variational inference
Stein variational gradient descent
Amortized variational inference
Black-box amortized variational inference
Reparametrization-based amortized variational inference
Variational expectation maximization
A brief recap of the EM algorithm
Variational EM
Black-box VEM
Variational autoencoders
Summary
Recommended resources
Problems
References
Overview
I-projection and M-projection
Defining I-projection and M-projection
Mode-seeking vs. inclusive approximations
Bridging the gap between I-projection and M-projection
Generalized variational inference and generalized variational expectation maximization
Generalized variational inference
Generalized variational expectation maximization
Solving the GVI and GVEM problems: generalized posterior distributions
Approximate solution of GVI problem
Maximum entropy learning
Defining maximum entropy learning
Addressing maximum entropy learning
InfoMax
Defining InfoMax
Addressing InfoMax
Information bottleneck
Rate-distortion encoding
Two-sample estimation of information-theoretic metrics
Two-sample estimation of the KL divergence
Discriminator-based two-sample estimators of the KL divergence
Classifier-based two-sample estimators of the KL divergence
Properties of two-sample estimators of the KL divergence
Estimation of the mutual information
Beyond the KL divergence: f-divergences and two-sample estimators
f-divergences
Two-sample estimation of f-divergences
Generative adversarial networks (GANs)
GAN model
GAN learning problem
GAN as a min-max optimization problem
GAN training as a game between generator and discriminator
Integral probability metrics-based GANs
Distributionally robust learning
Summary
Recommended resources
Problems
References
Overview
Frequentist learning and calibration
Frequentist learning via ML
Calibration
Basics of Bayesian learning
Bayesian learning as inference
Bayesian vs. frequentist learning
Gibbs sampling and Bayesian ensembling
Marginal likelihood and marginal training log-loss
Why Bayesian learning?
Bayesian learning quantifies epistemic uncertainty and can detect “out-of-distribution” data
Bayesian learning is robust to overfitting and enables model selection without validation
Bayesian learning is optimal if the model is well specified
Bayesian learning minimizes a bound on the population loss (even for misspecified models)
Bayesian learning facilitates online learning
Bayesian learning facilitates active learning
Exact Bayesian learning
When x and t are independent
Gaussian-Gaussian GLM
Laplace approximation
Laplace approximation fits a Gaussian pdf at the MAP solution
Simplification of Laplace approximation via the empirical FIM
Simplification of Laplace approximation via the generalized Gauss-Newton method
MC sampling-based Bayesian learning
Goal of MC sampling
Rejection sampling
Markov chain Monte Carlo: Metropolis-Hastings algorithm
Stochastic gradient Markov chain Monte Carlo
Monte Carlo methods for likelihood-free models
Importance sampling
Variational Bayesian learning
Reparametrization-based variational Bayesian learning
Adversarial variational Bayesian learning
Particle-based variational Bayesian learning
Model selection without validation: Empirical Bayes
Empirical Bayes, aka type-II ML
Gaussian-Gaussian GLM
Laplace approximation for empirical Bayes
Bayesian non-parametric learning
Gaussian-Gaussian GLM revisited
Gaussian processes
Dirichlet process mixture models
Bayesian learning with local latent variables
Generalized Bayesian learning
Summary
Recommended resources
Problems
References
Overview
Transfer learning
Defining transfer learning
Covariate shift
General shift
Likelihood ratio-based transfer learning
Multi-task learning
Inductive bias assumed by multi-task learning
Two-level ERM formulation
Non-parametric multi-task learning
Continual learning
Defining continual learning
Stability vs. plasticity
Coreset-based continual learning
Regularization-based continual learning
Extensions
Meta-learning
Meta-training and meta-testing
Reviewing conventional learning
Joint learning
Introducing meta-learning
An improved meta-learning criterion: splitting the meta-training set
MAML: meta-learning the initialization
Meta-inductive bias
Bayesian perspective on transfer, multi-task, continual, and meta-learning
Bayesian multi-task learning
Bayesian transfer learning
Bayesian continual learning: exact solutions
Bayesian continual learning: approximate solutions
Bayesian meta-learning
Summary
Recommended resources
References
Overview
Frequentist federated learning
Parameter server setting
Single-agent learning
Global learning
One-shot, or “embarrassingly parallel”, FL
FedSGD
Local SGD-based FL
Aggregation function at the server
Selecting the set of active agents
Reducing the communication overhead
Testing the performance of FL
Convergence properties of FedAvg
Beyond FedAvg
Private federated learning
Differential privacy
Gaussian mechanism
Gaussian mechanism for federated learning
Bayesian federated learning
Defining (generalized) Bayesian federated learning
Federated variational inference: model and definitions
Federated variational inference with unconstrained approximate likelihoods
Federated variational inference with exponential-family models
Summary
Recommended resources
References
15 Beyond this book
Overview
Probabilistic graphical models
Bayesian networks
Markov random fields
Adversarial attacks
Causality
Quantum machine learning
Machine unlearning
General AI?
Recommended resources
References
Cite as: O. Simeone, "Machine Learning for Engineers," Cambridge University Press, 2022 . [O. Simeone, Machine learning for Engineers, Cambridge University Press, 2022. ]
For feedback, please contact me directly.