Neural Networks and Deep Learning

December 2018

Module 1: Introduction to deep learning

What is a neural network

ReLu - Rectified Linear Unit.

Nodes work them selves out based on the inputs

Supervised learning with Neural Networks

Standard NN

CNN - For photo tagging

RNN - For time series

Why is Deep Learning taking off?

m - Size of training data

Gradient descent - Sigmoid -> RelU activation function to speed up run.

Module 2: Neural Network Basics

Binary classification

Logistic regression

nx - Number of values in vector x

Sigmoid function to linear regression (yhat = sigmoid (w^T + b)

Sigmoid = 1 / 1 + e^-z

Logistic regression cost function

Loss (error) function: L(y^, y) = - (ylogy^ + (1-y)log(1-y^). This is for one example

Cost function J(w,b) = Average of loss function

Gradient descent

To learn w and b parameters.

Start with 0 values then next attempt is down the steepest gradient to find the minimum (global optimum).

w := w - alpha dJ(w)/dw. alpha is the learning parameter.

Derivatives, More Derivate Examples

Computation Graph

e.g. J(a,b,c) = 3(a + bc)

u = bc (b and c input)

v = a + u (a and u input)

J = 3v (v input)

Derivatives with a Computation Graph

dJ/dv = 3

dJ/da = d 3(a + u)/da = 3 = dJ/dv * dv/da (chain rule) = 3 * 1

dJ/du = d 3(a + u)du = 3 = dJ/dv * dv/du

dJ/db = d 3(a + b*c) db = 3c. c is 2 = 6 = dJ/du * du/db = 3 * 2

dJ/dc = d 3(a + b*c) dc = 3b. b is 3. = 9

In code d Final OutPut Var / d var -> dvar : dJ/da -> da

Logistic Regression Gradient Descent

z = w^Tx + b

y = a = sigma(z)

L(a, y) - -(ylog(a) + (1 - y)log(1 - a))

https://www.coursera.org/learn/neural-networks-deep-learning/discussions/all/threads/TreP-qH9EeeHLQ45ByZZBg

alpha is learning rate

w1 := w1 - alpha * dw1

b = b - alpha * db

Gradient Descent on m Examples

Vectorization and Broadcasting

(m,n) + (1,n) -> (m, n)

(m, n) + (m, 1) -> (m, n)

(m, 1) + R -> (m, 1)

Logistic regression cost function

yhat = sigma(w^T + b)

sigma(z) = 1 / (1 + e^-z)

if y = 1 p(y|x) = yhat

if y = 0 p(y|x) = 1 - yhat

p(y|x) =yhat^y * (1-yhat)^(1-y)

log p(y|x) = log yhat^y * (1-yhat)^(1-y) = y log yhat + (1-y)log(1 - yhat) = -L(yhat, y) (minus as we want to minimize the cost function).

minimize cost = maximize likelihood


L1 loss = np.sum(np.abs(y - yhat))

L2 loss = np.dot(y - yhat, y - yhat)

Module 3: Shallow Neural Network

Neural Network Overview

What is a neural network?

Logistic regressL z = w.T* x + b -> a = sigmoid(z) = yhat = L(a, y)

Multiple z like calculations and a like calculation in first node. Same in second node (layer) e.g. first layer z^[1] = W^[1]* + b^[1]. Not to be confused with x^(1) which is a training sample. z^[2] is second layer.

Neural Network Representation

Input layer (x1, x2, x3), a^[0] = X (activation);

hidden layer (not seen in training set) a^[1] (a^[1]_1, a^[1]_2. w^[1], b^[1] where w^[1] is (4,3) and b^[1] is (4,1). 4 nodes in hidden layer and 3 features in input.;

output layer -> yhat = a^[2]. w^[2], b^[2] where w^[2] is (1,4) and b^[2] is (1,1) This is a 2 layer NN (don't count input layer).

Computing a Neural Network's Output

a^[l]_[i]. l is layer and i is node in layer.

Vectorize the hidden layer into a matrix e.g. W is 4,3.

z^[1] = W^[1] * x + b^[1]; a^[1] = sigmoid(z^[1]). x = a^[0].

Vectorizing across multiple examples

X (x^(1) -> a^[2](1) = yhat^(1).

Z^[1] = W^T[1] + b^[1]

A^[1] = sigmoid(Z^[1])

Z^[2] = W^T[1] + b^[1]

A^[2] = sigmoid(Z^[2])

horizontal = training examples

vertical = hidden units.

Activation functions

Change sigma to g to represent other activation functions

e.g. tanh(z^[1]) 1 to -1 gives a 0 mean (centering the data) for hidden layer. a = tanh(z) = (e^-z + e^-z) / (e^z + e^-z).

Sigmoid function for output layer as gives 0 or 1 (binary classification). a =sigma(z) = 1 / (1 + e^-z). However, when value are very large or very small gradient is small so slow down gradient descent.

Rectified linear unit (ReLU) a = max(0, z). Gradient is 1 or 0. Good one to go for.

Leaky ReLU. Slight slope when z < 0 . a = max(0.001z, z).

Why do you need non-linear activation functions?

g(z) = z - Linear activation function

a^[2] = (W^[2]W^[1])x + (W^[2]b^[1] + b^ [2]).

In hidden layer is linear activation function than it is the same as logistic regression. Can use a linear activation function in output layer is a regression problem e.g. house prices.

Derivatives of activation functions

Sigmoid: g'(z) of sigmoid. d/dz of e^-z = e^-z. d/dz of 1 / e^-z = e^z. g(z) * (1 - g(z))

Tanh(z) = (e^z - e^-z) / (e^z + e^-z). d/dz = 1 - (tanh(z))^2. https://www.wolframalpha.com/input/?i=tanh(z)

ReLU g(z) = max(0, z). g'(z) 0 if z< 0, 1 if z > 0. Technically undefined at z = 0 but can say 1 if z <= 1.

Leaky ReLU g(z) = max(0.01z, z). g'(z) = 0.01 if z < 0 and 1 if z > 0.

Gradient descent for neural networks

With one hidden layer.

Parmeters: W^[1], b^[1], W^[2], b^[2].

n_x = n^[0], n^[1], n^[2] = 1

Cost function: J(w^[1], b^[1],...) = 1/m sum(L(y^hat = a^[2], y)). Where m is the training examples.

dW^[1] = dJ/dW^[1].


Forward prop: Z^[1] = W^[1]X b^[1]

A^[1] = g^[1](z^[1])

Z^[2] = W^[2]A^[1] + b^[1]

A^[2] = g^[2](Z^[2]) = sigma(Z^[2])


Back prop: dZ^[1] = A^[2] - Y; Y = [y^[1], ..., y^[n]]

dW^[2] = 1/m dZ^[2] A^[1]T

db^[2] = 1/m np,sum(dZ^[2], axis=1, keepdims=True)

dZ^[1] = W^[2]^T dZ^[2] * g^[1]^T (Z^[1])

dW^[1] = 1/m dZ^[1] X^T

db^[1] = 1/m np.sum(dZ^[1], axis=1, keepdims=True).

Backpropagation intuition

d/da log(a) = 1/a

g(Z) = sigma(Z)

Random Initialization

0 weights don't work with hidden units. It does with logistic regression thought.

Zeros weights causes symmetry of units in hidden layer.

W^[1] = np.random.rand((2, 2)) * 0.01

b^[1] = np.zeros((2, 1))

Use small weights otherwise will end up at the flat parts of the activation function e.g. tanh.


Module 4: Deep Neural Network

Deep L-layer Neural network

L = Number of layers

n^[L] = # units in later l

N^[0] = Input layer.

a^[l] = activations in layer l = g^[l](Z^[l])

W^[l] = Weights for computing Z^[l].

Forward propagation in a deep network

Loop over layers.

Getting your matrix dimensions right

First layer has three units n^1 = 3, there Z = shape (3, 1) = (n^[1], 1).

X (n^[0], 1) = (2, 1) as there are 2 input features.

You can therefore workout the shape of W^[1]

(3, 1) = (X, Y) * (2, 1). Therefore X = 3, Y = 2. W^[1] = (n^[1], n^[2])

W^[2] = (n^[2], n^[1])

W^[L] = (n^[L], n^[L-1])

b^[1] = (n^[1], 1); b^[L] = (N^[L], 1)

dW^[L] = (n^[L], n^[L-1])

db^[L] = (n^[L], 1)

Z^[L] = g^[L](a^[L])


In vectorized version

Z^[1] = (n^[1], m)

X = (n^[0], m)

b^[1] = (n^[1], m)

Why deep representations?

Early layers provide insight on edges (i.e. feature detection). The next layer groups those edges and identifies eyes, nose, etc. The third layer can recognize faces. Simple to complex representation.

Circuit theory relates to logic gates. i.e. can computer function with a small number of layer compared to a shallow network with lots of units.

ie. https://en.wikipedia.org/wiki/XOR_gate

Building blocks of deep neural networks

For layer L: W^[L]. b^[L]

Forward: input a^[L - 1], output a^[L]

Z^[L] = W^[L]a^[L - 1] + b^[L]; cache this

a^[L] = g^[L](z^[L])


Backwards: input: da^[L], output da^[L - 1], dW^[L]; db^[L]

Forward and Backward Propagation

Forward:

Z^[L] = W^[L] * A^[L - 1] + b^[L]

A^[L] = g^[L](Z^[L])

Backward:

dZ^[L] = dA^[L] * g^[L]'(Z^[1])

dW^[L] = (1 / m) * dZ^[L] * A^[L - 1]T

db^[L] = (1 / m) * np.sum(dZ^[L], axis=1, keepdims=True)

dA^[L - 1] = W^[L].T * dZ^[L]


Logistic regression da^[L] = - y / a + (1 - y) / (1 - a)

Parameters vs Hyperparameters

Parameters: W, b

Hyperparameters: learning rate, number of iterations, number of hidden layers, number of hidden units, Choice of activation function (momentum, mini-batch, regularization parameters).

What does this have to do with the brain?