Neural Networks and Deep Learning
December 2018
Module 1: Introduction to deep learning
What is a neural network
ReLu - Rectified Linear Unit.
Nodes work them selves out based on the inputs
Supervised learning with Neural Networks
Standard NN
CNN - For photo tagging
RNN - For time series
Why is Deep Learning taking off?
m - Size of training data
Gradient descent - Sigmoid -> RelU activation function to speed up run.
Module 2: Neural Network Basics
Binary classification
Logistic regression
nx - Number of values in vector x
Sigmoid function to linear regression (yhat = sigmoid (w^T + b)
Sigmoid = 1 / 1 + e^-z
Logistic regression cost function
Loss (error) function: L(y^, y) = - (ylogy^ + (1-y)log(1-y^). This is for one example
Cost function J(w,b) = Average of loss function
Gradient descent
To learn w and b parameters.
Start with 0 values then next attempt is down the steepest gradient to find the minimum (global optimum).
w := w - alpha dJ(w)/dw. alpha is the learning parameter.
Derivatives, More Derivate Examples
Computation Graph
e.g. J(a,b,c) = 3(a + bc)
u = bc (b and c input)
v = a + u (a and u input)
J = 3v (v input)
Derivatives with a Computation Graph
dJ/dv = 3
dJ/da = d 3(a + u)/da = 3 = dJ/dv * dv/da (chain rule) = 3 * 1
dJ/du = d 3(a + u)du = 3 = dJ/dv * dv/du
dJ/db = d 3(a + b*c) db = 3c. c is 2 = 6 = dJ/du * du/db = 3 * 2
dJ/dc = d 3(a + b*c) dc = 3b. b is 3. = 9
In code d Final OutPut Var / d var -> dvar : dJ/da -> da
Logistic Regression Gradient Descent
z = w^Tx + b
y = a = sigma(z)
L(a, y) - -(ylog(a) + (1 - y)log(1 - a))
alpha is learning rate
w1 := w1 - alpha * dw1
b = b - alpha * db
Gradient Descent on m Examples
Vectorization and Broadcasting
(m,n) + (1,n) -> (m, n)
(m, n) + (m, 1) -> (m, n)
(m, 1) + R -> (m, 1)
Logistic regression cost function
yhat = sigma(w^T + b)
sigma(z) = 1 / (1 + e^-z)
if y = 1 p(y|x) = yhat
if y = 0 p(y|x) = 1 - yhat
p(y|x) =yhat^y * (1-yhat)^(1-y)
log p(y|x) = log yhat^y * (1-yhat)^(1-y) = y log yhat + (1-y)log(1 - yhat) = -L(yhat, y) (minus as we want to minimize the cost function).
minimize cost = maximize likelihood
L1 loss = np.sum(np.abs(y - yhat))
L2 loss = np.dot(y - yhat, y - yhat)
Module 3: Shallow Neural Network
Neural Network Overview
What is a neural network?
Logistic regressL z = w.T* x + b -> a = sigmoid(z) = yhat = L(a, y)
Multiple z like calculations and a like calculation in first node. Same in second node (layer) e.g. first layer z^[1] = W^[1]* + b^[1]. Not to be confused with x^(1) which is a training sample. z^[2] is second layer.
Neural Network Representation
Input layer (x1, x2, x3), a^[0] = X (activation);
hidden layer (not seen in training set) a^[1] (a^[1]_1, a^[1]_2. w^[1], b^[1] where w^[1] is (4,3) and b^[1] is (4,1). 4 nodes in hidden layer and 3 features in input.;
output layer -> yhat = a^[2]. w^[2], b^[2] where w^[2] is (1,4) and b^[2] is (1,1) This is a 2 layer NN (don't count input layer).
Computing a Neural Network's Output
a^[l]_[i]. l is layer and i is node in layer.
Vectorize the hidden layer into a matrix e.g. W is 4,3.
z^[1] = W^[1] * x + b^[1]; a^[1] = sigmoid(z^[1]). x = a^[0].
Vectorizing across multiple examples
X (x^(1) -> a^[2](1) = yhat^(1).
Z^[1] = W^T[1] + b^[1]
A^[1] = sigmoid(Z^[1])
Z^[2] = W^T[1] + b^[1]
A^[2] = sigmoid(Z^[2])
horizontal = training examples
vertical = hidden units.
Activation functions
Change sigma to g to represent other activation functions
e.g. tanh(z^[1]) 1 to -1 gives a 0 mean (centering the data) for hidden layer. a = tanh(z) = (e^-z + e^-z) / (e^z + e^-z).
Sigmoid function for output layer as gives 0 or 1 (binary classification). a =sigma(z) = 1 / (1 + e^-z). However, when value are very large or very small gradient is small so slow down gradient descent.
Rectified linear unit (ReLU) a = max(0, z). Gradient is 1 or 0. Good one to go for.
Leaky ReLU. Slight slope when z < 0 . a = max(0.001z, z).
Why do you need non-linear activation functions?
g(z) = z - Linear activation function
a^[2] = (W^[2]W^[1])x + (W^[2]b^[1] + b^ [2]).
In hidden layer is linear activation function than it is the same as logistic regression. Can use a linear activation function in output layer is a regression problem e.g. house prices.
Derivatives of activation functions
Sigmoid: g'(z) of sigmoid. d/dz of e^-z = e^-z. d/dz of 1 / e^-z = e^z. g(z) * (1 - g(z))
Tanh(z) = (e^z - e^-z) / (e^z + e^-z). d/dz = 1 - (tanh(z))^2. https://www.wolframalpha.com/input/?i=tanh(z)
ReLU g(z) = max(0, z). g'(z) 0 if z< 0, 1 if z > 0. Technically undefined at z = 0 but can say 1 if z <= 1.
Leaky ReLU g(z) = max(0.01z, z). g'(z) = 0.01 if z < 0 and 1 if z > 0.
Gradient descent for neural networks
With one hidden layer.
Parmeters: W^[1], b^[1], W^[2], b^[2].
n_x = n^[0], n^[1], n^[2] = 1
Cost function: J(w^[1], b^[1],...) = 1/m sum(L(y^hat = a^[2], y)). Where m is the training examples.
dW^[1] = dJ/dW^[1].
Forward prop: Z^[1] = W^[1]X b^[1]
A^[1] = g^[1](z^[1])
Z^[2] = W^[2]A^[1] + b^[1]
A^[2] = g^[2](Z^[2]) = sigma(Z^[2])
Back prop: dZ^[1] = A^[2] - Y; Y = [y^[1], ..., y^[n]]
dW^[2] = 1/m dZ^[2] A^[1]T
db^[2] = 1/m np,sum(dZ^[2], axis=1, keepdims=True)
dZ^[1] = W^[2]^T dZ^[2] * g^[1]^T (Z^[1])
dW^[1] = 1/m dZ^[1] X^T
db^[1] = 1/m np.sum(dZ^[1], axis=1, keepdims=True).
Backpropagation intuition
d/da log(a) = 1/a
g(Z) = sigma(Z)
Random Initialization
0 weights don't work with hidden units. It does with logistic regression thought.
Zeros weights causes symmetry of units in hidden layer.
W^[1] = np.random.rand((2, 2)) * 0.01
b^[1] = np.zeros((2, 1))
Use small weights otherwise will end up at the flat parts of the activation function e.g. tanh.
Module 4: Deep Neural Network
Deep L-layer Neural network
L = Number of layers
n^[L] = # units in later l
N^[0] = Input layer.
a^[l] = activations in layer l = g^[l](Z^[l])
W^[l] = Weights for computing Z^[l].
Forward propagation in a deep network
Loop over layers.
Getting your matrix dimensions right
First layer has three units n^1 = 3, there Z = shape (3, 1) = (n^[1], 1).
X (n^[0], 1) = (2, 1) as there are 2 input features.
You can therefore workout the shape of W^[1]
(3, 1) = (X, Y) * (2, 1). Therefore X = 3, Y = 2. W^[1] = (n^[1], n^[2])
W^[2] = (n^[2], n^[1])
W^[L] = (n^[L], n^[L-1])
b^[1] = (n^[1], 1); b^[L] = (N^[L], 1)
dW^[L] = (n^[L], n^[L-1])
db^[L] = (n^[L], 1)
Z^[L] = g^[L](a^[L])
In vectorized version
Z^[1] = (n^[1], m)
X = (n^[0], m)
b^[1] = (n^[1], m)
Why deep representations?
Early layers provide insight on edges (i.e. feature detection). The next layer groups those edges and identifies eyes, nose, etc. The third layer can recognize faces. Simple to complex representation.
Circuit theory relates to logic gates. i.e. can computer function with a small number of layer compared to a shallow network with lots of units.
ie. https://en.wikipedia.org/wiki/XOR_gate
Building blocks of deep neural networks
For layer L: W^[L]. b^[L]
Forward: input a^[L - 1], output a^[L]
Z^[L] = W^[L]a^[L - 1] + b^[L]; cache this
a^[L] = g^[L](z^[L])
Backwards: input: da^[L], output da^[L - 1], dW^[L]; db^[L]
Forward and Backward Propagation
Forward:
Z^[L] = W^[L] * A^[L - 1] + b^[L]
A^[L] = g^[L](Z^[L])
Backward:
dZ^[L] = dA^[L] * g^[L]'(Z^[1])
dW^[L] = (1 / m) * dZ^[L] * A^[L - 1]T
db^[L] = (1 / m) * np.sum(dZ^[L], axis=1, keepdims=True)
dA^[L - 1] = W^[L].T * dZ^[L]
Logistic regression da^[L] = - y / a + (1 - y) / (1 - a)
Parameters vs Hyperparameters
Parameters: W, b
Hyperparameters: learning rate, number of iterations, number of hidden layers, number of hidden units, Choice of activation function (momentum, mini-batch, regularization parameters).