Machine Learning for Engineers

Description

This self-contained introduction to machine learning, designed from the start with engineers in mind, will equip students with everything they need to start applying machine learning principles and algorithms to real-world engineering problems. With a consistent emphasis on the connections between estimation, detection, information theory, and optimization, it includes: an accessible overview of the relationships between machine learning and signal processing, providing a solid foundation for further study; clear explanations of the differences between state-of-the-art techniques and more classical methods, equipping students with all the understanding they need to make informed technique choices; demonstration of the links between information-theoretical concepts and their practical engineering relevance; reproducible examples using Matlab, enabling hands-on student experimentation. Assuming only a basic understanding of probability and linear algebra, and accompanied by lecture slides and solutions for instructors, this is the ideal introduction to machine learning for engineering students of all disciplines.

Table of Content

Part I Introduction and background

1 When and how to use machine learning

Overview
What is machine learning for?
Why study machine learning now?
What is machine learning?
- Domain knowledge-based model-driven design
- Inductive bias-based data-driven design: machine learning
Taxonomy of machine learning methods
- Supervised learning
- Unsupervised learning
- Reinforcement learning
When to use machine learning?
- Pros and cons
- Checklist
Summary
Recommended resources
References

2 Background

Overview
Random variables
- Discrete random variables
- Continuous random variables
Expectation
Variance
Vectors
- Definitions
- Inner product
- Squared ℓ2 norm
- Cosine similarity
- Linearly independent vectors
Matrices
- Definitions
- Products
- Symmetric and positive (semi-)definite matrices
- Invertible matrices
- Trace and determinant
Random vectors
- Discrete random vectors
- Continuous random vectors
- Covariance and covariance coefficient
Marginal distributions
Conditional distributions
Independence of random variables and chain rule of probability
Bayes’ theorem
Law of iterated expectations
Summary
Recommended resources
Problems
References

Part II Fundamental concepts and algorithms

3 Inference, or model-driven prediction

Overview
Defining inference
- Input, output, and population distribution
- Detection and estimation
- Hard and soft predictors
Optimal hard prediction
- Loss function
- Population loss
- Optimal hard prediction
- Optimal hard prediction when x and t are independent
- Optimal hard prediction under the quadratic loss
- Optimal hard prediction under the detection-error loss
Optimal prediction for jointly Gaussian random vectors
- The case of jointly Gaussian random variables
- The case of jointly Gaussian random vectors
KL divergence and cross entropy
- Log-distribution ratio
- KL divergence
- Cross entropy
- JS divergence
Optimal soft prediction
- Log-loss, or cross-entropy loss
- Information-theoretic interpretations of the log-loss
- Population log-loss and optimal soft prediction
- Optimal soft prediction when x and t are independent
- Entropy
- Optimal soft prediction
- Conditional entropy
- On soft and hard predictors
- Optimal soft predictor as minimizer of the KL divergence
Mutual information
- Definition
- Some properties of the mutual information
- Mutual information and KL divergence
- Mutual information and explained uncertainty
- Data processing inequality
- Variational bounds on the mutual information
Log-loss as a “universal” loss function
Free energy
- Minimizing the reverse KL divergence
- Introducing the free energy
- Interpreting the free energy
- Alternative views on the free energy
- General definition of free energy involving an unnormalized distribution
Summary
Recommended resources
Problems
References

4 Supervised learning: getting started

Overview
Defining supervised learning
- Regression and classification
- Generalization
- Training and test data
- Population distribution
- Recurring examples
Training hard predictors
- Model class
- Training loss
- Empirical risk minimization
- Solving a least squares problem
Inductive bias selection and validation
- Underfitting
- Overfitting
- Validation
- K-fold cross-validation
Bias and estimation error
- Population loss = minimum population loss + bias + estimation error
- Epistemic and aleatoric uncertainty
Beyond the bias vs. estimation error trade-off
Regularization
- Ridge regression
- Regularized ERM
Testing
Training soft predictors
- Model class
- Population log-loss
- Training log-loss and maximum likelihood learning
- Interpretations of ML learning
- From soft to hard predictors
MAP learning and regularization
- MAP learning
- MAP learning as minimum description length inference
Kernel methods
- Introducing kernel methods
- Kernel functions
- “Kernelizing” ridge regression
- Reduced-complexity kernel methods
Summary
Recommended resources
Problems
References

5 Optimization for machine learning

Overview
Optimization for training
Solving an optimization problem
- Global minima
- Local minima
- Stationary points
- Stochastic optimality guarantees
First-order necessary optimality condition for single-variable functions
- Local linear approximation
- First-order necessary optimality condition
- Derivative as direction of local descent
Second-order optimality conditions for single-variable functions
- Second derivative and curvature
- Second-order sufficient and necessary optimality conditions
- Summary of first- and second-order optimality conditions
Optimality conditions for convex single-variable functions
- Definition of convex single-variable functions
- Stationarity as a sufficient global optimality condition for convex functions
- Strictly convex functions
Optimality conditions for multiple-variable functions
- Gradient and first-order necessary optimality condition
- First-order necessary optimality condition
- Directional derivative
- Hessian matrix
- Second-order optimality conditions
- Optimality conditions for convex functions
Gradient descent
- Local search-based optimization and gradient descent
- Local strictly convex quadratic approximant
- Gradient descent as successive convex approximation minimization
- Negative gradient as a “force” field
Convergence of gradient descent
- Why can GD converge to a stationary point?
- Smoothness and global upper bound property
- Convergence of gradient descent
- Iteration complexity
- Choosing the learning rate and the initialization
Second-order methods
Stochastic gradient descent
- Stochastic gradient updates
- Choosing the mini-batch
- Iterations vs. gradient computations
Convergence of stochastic gradient descent
- Properties of the stochastic gradient
- Convergence properties of SGD
- Choosing the learning rate for SGD
Minimizing population loss vs. minimizing training loss
Symbolic and numerical differentiation
- Symbolic differentiation
- Numerical differentiation
Automatic differentiation
- Computational graph
- Introducing backprop
- General form of backprop
- Backprop computes all the partial derivatives of the output g(θ)g(\theta)g(θ)
Summary
Recommended resources
Problems
References

6 Supervised learning: beyond least squares

Overview
Discriminative linear models for binary classification: hard predictors
- Model class
- Classification margin and correct classification condition
- Surrogate loss function
- Training binary classifiers via SGD
- Perceptron algorithm
Discriminative linear models for binary classification: soft predictors and logistic regression
- Model class
- Training probabilistic models via SGD
- Logistic regression: model class
- Logistic regression: training via SGD
- Kernelized binary classifiers
- Generalized linear models
Discriminative non-linear models for binary classification: multi-layer neural networks
- Multi-layer feedforward neural network model class
- Neurons, synapses, and computational graph
- Training neural networks via backprop: problem definition and computational graph
- Training neural networks via backprop: forward pass and backward pass
- Training neural networks via backprop: combining forward and backward passes
- Training neural networks via backprop: a summary
- Approximations of backprop for locality and reduced computational complexity
- Beyond feedforward neural networks
Generative models for binary classification
- Model class
- Ancestral data sampling with a trained generative model
- Inference with a trained generative model
- Training generative models
- Quadratic discriminant analysis
- Naive Bayes classification
Discriminative linear models for multi-class classification: softmax regression
- Using binary classifiers for multi-class classification
- Softmax function
- Softmax regression: model class
- Training softmax regression
Discriminative non-linear models for multi-class classification: multi-layer neural networks
- Model class
- Training neural networks
Generative models for multi-class classification
- Inference with a trained generative model
- Training generative models
Mixture models
Beyond feedforward multi-layer neural networks
- Invariance
- Equivariance
K-nearest neighbors classification
Applications to regression
Summary
Recommended resources
Problems
References

7 Unsupervised learning

Overview
Unsupervised learning tasks
Density estimation
- Histogram-based density estimation
- Kernel density estimation
- Contrastive density learning
- Other density estimation methods
- Contrastive density ratio learning
Latent-variable models
Autoencoders
- Training autoencoders
- Principal component analysis
- Sparse dictionary learning
- Neural autoencoders
Discriminative models
- Contrastive representation learning
Undirected generative models
- Restricted Boltzmann machines: model class
- Restricted Boltzmann machines: training
- Energy-based models: model class
- Energy-based models: training
Drected generative models
Training directed generative models: K-means clustering
- Model class
- Supervised and unsupervised losses
- ERM problem
- K-means clustering
- Interpreting K-means
- K-means as alternate optimization and convergence
- Connection with directed generative models
Training directed generative models: expectation maximization
- Supervised vs. unsupervised training loss
- Free energy and optimal soft prediction
- Free energy as a bound on the log-loss
- Introducing EM as the successive optimization of local approximants
- EM algorithm
- K-means and EM, revisited
- EM for mixture of Gaussians
- Probabilistic PCA
- Convergence of EM
- Scaling EM
- Beyond EM
- EM as behavioral principle for biological systems
- Directed generative models for density estimation
Summary
Recommended resources
Problems
References

8 Statistical learning theory

Overview
Benchmarks and decomposition of the optimality error
- Bias vs. estimation error decomposition of the optimality error
- Training algorithms and ERM
Probably approximately correct learning
- PAC condition
- PAC training algorithm
- Sample complexity
- Is ERM PAC?
PAC learning for finite model classes
- Capacity of a finite model class
- Sample complexity of ERM for finite model classes
- Proof of the ERM sample complexity and generalization error
PAC learning for general model classes: VC dimension
- Quantizing a model class
- Capacity of a model class, revisited
- Definition of VC dimension
- VC dimension and sample complexity
PAC Bayes and information-theoretic bounds
- To recap: generalization error in PAC learning
- Stochastic training algorithms
- PAC Bayes generalization bounds
- Information-theoretic generalization bounds
Summary
Recommended resources
Problems
References

9 Exponential family of distributions

Overview
Definitions and examples
- Definition of the exponential family
- Examples
- Natural and mean parameters
Exponential-family distributions as maximum-entropy models
Gradient of the log-loss
- Gradient and mean error
- Proof of the relationship between gradient and mean error
- From natural parameters to mean parameters and possibly back
ML learning
- Gradient of the training log-loss
- Mean matching condition
- Gradient descent-based ML learning
Information-theoretic metrics
- Entropy
- KL divergence
- Cross entropy
Fisher information matrix (FIM)
- Definition of the Fisher information matrix
- FIM and quadratic approximation of the KL divergence
- FIM as covariance of the gradient of the log-loss
- Natural gradient descent
- FIM for the exponential family
Generalized linear models
- Definitions
- Logistic regression as a generalized linear model
- Softmax regression as a generalized linear model
- Probit regression
- Generalized linear models for unsupervised learning
Summary
Recommended resources
Problems
References

10 Variational inference and variational expectation maximization

Overview
Variational inference, amortized variational inference, and variational EM
- Variational inference
- Amortized VI
- Variational Expectation Maximization
Exact Bayesian inference
- Conjugate exponential family
- Beta-Bernoulli model
- Table of conjugate exponential family distributions
- Conjugate models outside the exponential family
Laplace approximation
Introducing variational inference
Mean-field variational inference
- Mean-field factorization of the variational posterior
- Coordinate descent optimization
- Mean-field VI for the Ising model
- Generalizing mean-field VI
Introducing parametric variational inference
Black-box variational inference
- Stochastic optimization
- REINFORCE gradient
- Some properties of the REINFORCE gradient
- REINFORCE gradient algorithm
- Selecting the baseline
- Black-box variational inference
Reparametrization-based variational inference
- Reparametrizability of a parametric variational posterior
- Reparametrization trick
- Gaussian reparametrization trick
- Reparametrization-based gradient algorithm
- Reparametrization-based variational inference
- Gradient of the KL term
- Reparametrization-based VI with Gaussian prior and variational posterior
- Broadening the applicability of reparametrization-based VI
Combining factorization and parametrization for variational inference
Particle-based variational inference and Stein variational gradient descent
- Particle-based variational inference
- Stein variational gradient descent
Amortized variational inference
- Black-box amortized variational inference
- Reparametrization-based amortized variational inference
Variational expectation maximization
- A brief recap of the EM algorithm
- Variational EM
- Black-box VEM
- Variational autoencoders
Summary
Recommended resources
Problems
References

11 Information-theoretic inference and learning

Overview
I-projection and M-projection
- Defining I-projection and M-projection
- Mode-seeking vs. inclusive approximations
- Bridging the gap between I-projection and M-projection
Generalized variational inference and generalized variational expectation maximization
- Generalized variational inference
- Generalized variational expectation maximization
- Solving the GVI and GVEM problems: generalized posterior distributions
- Approximate solution of GVI problem
Maximum entropy learning
- Defining maximum entropy learning
- Addressing maximum entropy learning
InfoMax
- Defining InfoMax
- Addressing InfoMax
Information bottleneck
Rate-distortion encoding
Two-sample estimation of information-theoretic metrics
- Two-sample estimation of the KL divergence
- Discriminator-based two-sample estimators of the KL divergence
- Classifier-based two-sample estimators of the KL divergence
- Properties of two-sample estimators of the KL divergence
- Estimation of the mutual information
Beyond the KL divergence: f-divergences and two-sample estimators
- f-divergences
- Two-sample estimation of f-divergences
Generative adversarial networks (GANs)
- GAN model
- GAN learning problem
- GAN as a min-max optimization problem
- GAN training as a game between generator and discriminator
- Integral probability metrics-based GANs
Distributionally robust learning
Summary
Recommended resources
Problems
References

12 Bayesian learning

Overview
Frequentist learning and calibration
- Frequentist learning via ML
- Calibration
Basics of Bayesian learning
- Bayesian learning as inference
- Bayesian vs. frequentist learning
- Gibbs sampling and Bayesian ensembling
- Marginal likelihood and marginal training log-loss
Why Bayesian learning?
- Bayesian learning quantifies epistemic uncertainty and can detect “out-of-distribution” data
- Bayesian learning is robust to overfitting and enables model selection without validation
- Bayesian learning is optimal if the model is well specified
- Bayesian learning minimizes a bound on the population loss (even for misspecified models)
- Bayesian learning facilitates online learning
- Bayesian learning facilitates active learning
Exact Bayesian learning
- When x and t are independent
- Gaussian-Gaussian GLM
Laplace approximation
- Laplace approximation fits a Gaussian pdf at the MAP solution
- Simplification of Laplace approximation via the empirical FIM
- Simplification of Laplace approximation via the generalized Gauss-Newton method
MC sampling-based Bayesian learning
- Goal of MC sampling
- Rejection sampling
- Markov chain Monte Carlo: Metropolis-Hastings algorithm
- Stochastic gradient Markov chain Monte Carlo
- Monte Carlo methods for likelihood-free models
- Importance sampling
Variational Bayesian learning
- Reparametrization-based variational Bayesian learning
- Adversarial variational Bayesian learning
- Particle-based variational Bayesian learning
Model selection without validation: Empirical Bayes
- Empirical Bayes, aka type-II ML
- Gaussian-Gaussian GLM
- Laplace approximation for empirical Bayes
Bayesian non-parametric learning
- Gaussian-Gaussian GLM revisited
- Gaussian processes
- Dirichlet process mixture models
Bayesian learning with local latent variables
Generalized Bayesian learning
Summary
Recommended resources
Problems
References

13 Transfer learning, multi-task learning, continual learning, and meta-learning

Overview
Transfer learning
- Defining transfer learning
- Covariate shift
- General shift
- Likelihood ratio-based transfer learning
Multi-task learning
- Inductive bias assumed by multi-task learning
- Two-level ERM formulation
- Non-parametric multi-task learning
Continual learning
- Defining continual learning
- Stability vs. plasticity
- Coreset-based continual learning
- Regularization-based continual learning
- Extensions
Meta-learning
- Meta-training and meta-testing
- Reviewing conventional learning
- Joint learning
- Introducing meta-learning
- An improved meta-learning criterion: splitting the meta-training set
- MAML: meta-learning the initialization
- Meta-inductive bias
Bayesian perspective on transfer, multi-task, continual, and meta-learning
- Bayesian multi-task learning
- Bayesian transfer learning
- Bayesian continual learning: exact solutions
- Bayesian continual learning: approximate solutions
- Bayesian meta-learning
Summary
Recommended resources
References

14 Federated learning

Overview
Frequentist federated learning
- Parameter server setting
- Single-agent learning
- Global learning
- One-shot, or “embarrassingly parallel”, FL
- FedSGD
- Local SGD-based FL
- Aggregation function at the server
- Selecting the set of active agents
- Reducing the communication overhead
- Testing the performance of FL
- Convergence properties of FedAvg
- Beyond FedAvg
Private federated learning
- Differential privacy
- Gaussian mechanism
- Gaussian mechanism for federated learning
Bayesian federated learning
- Defining (generalized) Bayesian federated learning
- Federated variational inference: model and definitions
- Federated variational inference with unconstrained approximate likelihoods
- Federated variational inference with exponential-family models
Summary
Recommended resources
References

Part V Epilogue

15 Beyond this book

Overview
Probabilistic graphical models
- Bayesian networks
- Markov random fields
Adversarial attacks
Causality
Quantum machine learning
Machine unlearning
General AI?
Recommended resources
References

Cite as: O. Simeone, "Machine Learning for Engineers," Cambridge University Press, 2022 . [O. Simeone, Machine learning for Engineers, Cambridge University Press, 2022. ]

For feedback, please contact me directly.

Page updated

Google Sites

Report abuse