Course 2 - Prof. Tanner

Perspectives on the theoretical understanding of deep networks

Why deep networks?
How should you initialize you networks so it trains well?
Random matrix theory perspectives on deep neural networks?
Sparsity in deep networks

Lectures

Lecture 1: [Here are the slides for lecture 1] [Here is the video for lecture 1](1 hour) .

This lecture will include two distinct parts, with the first hour on approximation rates followed by half an hour on how to initialize a deep network.

The reason for depth, exponential approximation rates. The 2015 article “Representational benefits of deep feedforward networks” by Matus Telgarsky constructed a simple deep network that if alternatively approximated with a one layer network would require a width that is exponential in the depth of the deep network. “Error bounds for approximations with deep ReLU networks” by Yarotsky extened the construction by Telgarsky showing that any locally smooth function can be approximated to a prescribed accuracy \epsilon with depth proportional to log (1/\epsilon). Associated reading:

https://arxiv.org/abs/1509.08101 (full article)

https://arxiv.org/pdf/1610.01145.pdf (Pages 1-9, further time permitting).

(30 minutes) How to randomly initialize a feedforward DNN 1: The 2010 article “Understanding the difficulty of training deep feedforward neural networks” by Xavier Glorot and Yoshua Bengio shows a connection between how the network weights are initialized and the pre-activation hidden layer values as well as gradient values. The famous Xavier initialization (default in some popular deep learning packages) is derived and shown to stabilize the histogram of pre-activation and gradient values with respect to depth. Associated reading:

http://proceedings.mlr.press/v9/glorot10a/glorot10a.pdf (full article)

Lecture 2: [Here are the slides for lecture 2 ] [Here is the video for lecture 2]

How to model pre-activation values of a random feedforward DNN and how correlation between inputs evolve through depth: The 2016 article “Exponential expressivity in deep neural networks through transient chaos” by Poole et al. shows how we can model the pre-activation values and correlation between inputs as a function of the network weight and bias variances as well as the nonlinear activation (\sigma_w,\sigma_b,\phi()). This analysis allows us to model the inner workings of a network, and derive distributions for the pre-activation values within a deep net. Importantly, we can understand the correlation between two inputs evolve through layers of a network. This analysis allows us to: understand a more sophisticated analysis of how to initialize a deep network, properties of a nonlinear activation which aid or hinder the initial map, and how data manifolds evolve through a network. Associated reading:

https://arxiv.org/pdf/1606.05340.pdf (primarily Sections 1-3)

Lecture 3: [Here are the slides for lecture 3 ] [Here is the video for lecture 3]

The spectrum of the input-output map and back-propagation gradient can further inform us how well a randomly deep network can be trained. “The emergence of spectral universality in deep networks” by Pennington et al. extends the above work by Poole et al. by derive the complete spectrum, rather than just its first moments. The conditions derive by Poole et al. ensure that the mean of the spectrum remains 1 independent of depth; here Pennington et al, show how the spectrum concentration about the mean is also a function of the network parameters (\sigma_w,\sigma_b,\phi()). These results are derived using the Stieltjes Transform to derive the moments of the spectrum. Murray et al. integrated competing interests in “Activation function design for deep neural networks: linearity and efficient initialization” which shows principles for the selection of activation functions. The "excess optional material" listed below will also be discussed, but not covered in tutorials. Associated reading:

https://arxiv.org/pdf/1802.09979.pdf (full article);

https://arxiv.org/abs/2105.07741 (full article).

Lecture 4: [Here is the video for lecture 4]

This lecture will be done in two parts, the first hour will be covered by Prof. Tanner and the slides are here. The second hour will see a guest lecture by Ilan Price and the slides can be viewed here.

Deep networks are generally severely over-parameterized, with many millions of trainable parameters. It is now well understood that, retaining the same network depth and width, many of the parameters can be set to zero and the network can still be trained to achieve nearly the same (and at time better) test accuracy. Typically around 95% of the parameters can be set to zero with little impact on the training accuracy, and at time even substantially more parameters can be removed. We review the simplest of these methods, known as prune-at-initialization and the golden-ticket hypothesis. Associated reading:

https://arxiv.org/abs/1803.03635

https://arxiv.org/abs/1911.11134

https://arxiv.org/abs/2102.07655

Excess optional material (1 hour):

While the overall loss landscape for training a DNN is typically highly non-convex, when one is sufficiently close to a local minimizer the loss landscape is generally convex. “Geometry of neural network loss surfaces via random matrix theory” by Pennington and Bahri use the same techniques as in the prior material to determine the local shape of the loss function for random networks by measuring the fraction of positive and negative eigenvalues in the loss landscape Hessian. In particular, they show a dependence on the degree of overparameterization of the network (amount of parameters as compared to the amount of data to train on) which determines how close to a local minimizer one needs to be in order that the loss landscape is convex. We also briefly introduce batch-normalization, introduced by Ioffe and Szegedy, which is an important method to improve the ability to train a networks through more global model parameters

http://proceedings.mlr.press/v70/pennington17a.html

https://arxiv.org/pdf/1502.03167.pdf

Praticals/Tutorials

Practical 1

A link to the associated Google CoLab worksheet is in the pdf; click where it says "Google CoLab" or "notebook." See here for a recording of this practical 1.

Practical 2.

The Google CoLab worksheet for this practical is available here. A recording of this practical is also available, follow this link. Also here is the recording of the assignment 2.

Practical 3. The colab notebook is avaiable here . Also find the recording the first part of the practical, followed by the second part.

Report abuse