December 2018
ReLu - Rectified Linear Unit.
Nodes work them selves out based on the inputs
Standard NN
CNN - For photo tagging
RNN - For time series
m - Size of training data
Gradient descent - Sigmoid -> RelU activation function to speed up run.
Logistic regression
nx - Number of values in vector x
Sigmoid function to linear regression (yhat = sigmoid (w^T + b)
Sigmoid = 1 / 1 + e^-z
Loss (error) function: L(y^, y) = - (ylogy^ + (1-y)log(1-y^). This is for one example
Cost function J(w,b) = Average of loss function
To learn w and b parameters.
Start with 0 values then next attempt is down the steepest gradient to find the minimum (global optimum).
w := w - alpha dJ(w)/dw. alpha is the learning parameter.
e.g. J(a,b,c) = 3(a + bc)
u = bc (b and c input)
v = a + u (a and u input)
J = 3v (v input)
dJ/dv = 3
dJ/da = d 3(a + u)/da = 3 = dJ/dv * dv/da (chain rule) = 3 * 1
dJ/du = d 3(a + u)du = 3 = dJ/dv * dv/du
dJ/db = d 3(a + b*c) db = 3c. c is 2 = 6 = dJ/du * du/db = 3 * 2
dJ/dc = d 3(a + b*c) dc = 3b. b is 3. = 9
In code d Final OutPut Var / d var -> dvar : dJ/da -> da
z = w^Tx + b
y = a = sigma(z)
L(a, y) - -(ylog(a) + (1 - y)log(1 - a))
alpha is learning rate
w1 := w1 - alpha * dw1
b = b - alpha * db
(m,n) + (1,n) -> (m, n)
(m, n) + (m, 1) -> (m, n)
(m, 1) + R -> (m, 1)
yhat = sigma(w^T + b)
sigma(z) = 1 / (1 + e^-z)
if y = 1 p(y|x) = yhat
if y = 0 p(y|x) = 1 - yhat
p(y|x) =yhat^y * (1-yhat)^(1-y)
log p(y|x) = log yhat^y * (1-yhat)^(1-y) = y log yhat + (1-y)log(1 - yhat) = -L(yhat, y) (minus as we want to minimize the cost function).
minimize cost = maximize likelihood
L1 loss = np.sum(np.abs(y - yhat))
L2 loss = np.dot(y - yhat, y - yhat)
What is a neural network?
Logistic regressL z = w.T* x + b -> a = sigmoid(z) = yhat = L(a, y)
Multiple z like calculations and a like calculation in first node. Same in second node (layer) e.g. first layer z^[1] = W^[1]* + b^[1]. Not to be confused with x^(1) which is a training sample. z^[2] is second layer.
Input layer (x1, x2, x3), a^[0] = X (activation);
hidden layer (not seen in training set) a^[1] (a^[1]_1, a^[1]_2. w^[1], b^[1] where w^[1] is (4,3) and b^[1] is (4,1). 4 nodes in hidden layer and 3 features in input.;
output layer -> yhat = a^[2]. w^[2], b^[2] where w^[2] is (1,4) and b^[2] is (1,1) This is a 2 layer NN (don't count input layer).
a^[l]_[i]. l is layer and i is node in layer.
Vectorize the hidden layer into a matrix e.g. W is 4,3.
z^[1] = W^[1] * x + b^[1]; a^[1] = sigmoid(z^[1]). x = a^[0].
X (x^(1) -> a^[2](1) = yhat^(1).
Z^[1] = W^T[1] + b^[1]
A^[1] = sigmoid(Z^[1])
Z^[2] = W^T[1] + b^[1]
A^[2] = sigmoid(Z^[2])
horizontal = training examples
vertical = hidden units.
Change sigma to g to represent other activation functions
e.g. tanh(z^[1]) 1 to -1 gives a 0 mean (centering the data) for hidden layer. a = tanh(z) = (e^-z + e^-z) / (e^z + e^-z).
Sigmoid function for output layer as gives 0 or 1 (binary classification). a =sigma(z) = 1 / (1 + e^-z). However, when value are very large or very small gradient is small so slow down gradient descent.
Rectified linear unit (ReLU) a = max(0, z). Gradient is 1 or 0. Good one to go for.
Leaky ReLU. Slight slope when z < 0 . a = max(0.001z, z).
g(z) = z - Linear activation function
a^[2] = (W^[2]W^[1])x + (W^[2]b^[1] + b^ [2]).
In hidden layer is linear activation function than it is the same as logistic regression. Can use a linear activation function in output layer is a regression problem e.g. house prices.
Sigmoid: g'(z) of sigmoid. d/dz of e^-z = e^-z. d/dz of 1 / e^-z = e^z. g(z) * (1 - g(z))
Tanh(z) = (e^z - e^-z) / (e^z + e^-z). d/dz = 1 - (tanh(z))^2. https://www.wolframalpha.com/input/?i=tanh(z)
ReLU g(z) = max(0, z). g'(z) 0 if z< 0, 1 if z > 0. Technically undefined at z = 0 but can say 1 if z <= 1.
Leaky ReLU g(z) = max(0.01z, z). g'(z) = 0.01 if z < 0 and 1 if z > 0.
With one hidden layer.
Parmeters: W^[1], b^[1], W^[2], b^[2].
n_x = n^[0], n^[1], n^[2] = 1
Cost function: J(w^[1], b^[1],...) = 1/m sum(L(y^hat = a^[2], y)). Where m is the training examples.
dW^[1] = dJ/dW^[1].
Forward prop: Z^[1] = W^[1]X b^[1]
A^[1] = g^[1](z^[1])
Z^[2] = W^[2]A^[1] + b^[1]
A^[2] = g^[2](Z^[2]) = sigma(Z^[2])
Back prop: dZ^[1] = A^[2] - Y; Y = [y^[1], ..., y^[n]]
dW^[2] = 1/m dZ^[2] A^[1]T
db^[2] = 1/m np,sum(dZ^[2], axis=1, keepdims=True)
dZ^[1] = W^[2]^T dZ^[2] * g^[1]^T (Z^[1])
dW^[1] = 1/m dZ^[1] X^T
db^[1] = 1/m np.sum(dZ^[1], axis=1, keepdims=True).
d/da log(a) = 1/a
g(Z) = sigma(Z)
0 weights don't work with hidden units. It does with logistic regression thought.
Zeros weights causes symmetry of units in hidden layer.
W^[1] = np.random.rand((2, 2)) * 0.01
b^[1] = np.zeros((2, 1))
Use small weights otherwise will end up at the flat parts of the activation function e.g. tanh.
L = Number of layers
n^[L] = # units in later l
N^[0] = Input layer.
a^[l] = activations in layer l = g^[l](Z^[l])
W^[l] = Weights for computing Z^[l].
Loop over layers.
First layer has three units n^1 = 3, there Z = shape (3, 1) = (n^[1], 1).
X (n^[0], 1) = (2, 1) as there are 2 input features.
You can therefore workout the shape of W^[1]
(3, 1) = (X, Y) * (2, 1). Therefore X = 3, Y = 2. W^[1] = (n^[1], n^[2])
W^[2] = (n^[2], n^[1])
W^[L] = (n^[L], n^[L-1])
b^[1] = (n^[1], 1); b^[L] = (N^[L], 1)
dW^[L] = (n^[L], n^[L-1])
db^[L] = (n^[L], 1)
Z^[L] = g^[L](a^[L])
In vectorized version
Z^[1] = (n^[1], m)
X = (n^[0], m)
b^[1] = (n^[1], m)
Early layers provide insight on edges (i.e. feature detection). The next layer groups those edges and identifies eyes, nose, etc. The third layer can recognize faces. Simple to complex representation.
Circuit theory relates to logic gates. i.e. can computer function with a small number of layer compared to a shallow network with lots of units.
ie. https://en.wikipedia.org/wiki/XOR_gate
For layer L: W^[L]. b^[L]
Forward: input a^[L - 1], output a^[L]
Z^[L] = W^[L]a^[L - 1] + b^[L]; cache this
a^[L] = g^[L](z^[L])
Backwards: input: da^[L], output da^[L - 1], dW^[L]; db^[L]
Forward:
Z^[L] = W^[L] * A^[L - 1] + b^[L]
A^[L] = g^[L](Z^[L])
Backward:
dZ^[L] = dA^[L] * g^[L]'(Z^[1])
dW^[L] = (1 / m) * dZ^[L] * A^[L - 1]T
db^[L] = (1 / m) * np.sum(dZ^[L], axis=1, keepdims=True)
dA^[L - 1] = W^[L].T * dZ^[L]
Logistic regression da^[L] = - y / a + (1 - y) / (1 - a)
Parameters: W, b
Hyperparameters: learning rate, number of iterations, number of hidden layers, number of hidden units, Choice of activation function (momentum, mini-batch, regularization parameters).