May 2019
Image classification, object detection, Neural Style Transfer (transfer one one of image to a type of painting).
Detect vertical edges -> detect horizontal edges.
6x6 image. Construct a 3x3 filter as [[1,1,1][0,0,0][-1,-1,-1]] (~kernal). Convolute 6x6 with 3x3 -> 4x4 (~sliding window), element-wise product (3x3) x (3x3) = (1,1 * 1,1) + (2,1 * 2,1),...
python: conv-forward. tf: tf.nn.conv26. keras: conv2d
If image is 10's all on left and 0's all on right, convoluted with (1's on left, 0's in middle and -1 on right). -> 4x4 with 30's in the middle
Vertical edge filter is [1,1,1][0,0,0][-1,-1,-1]] and horizontal edge filter is [[1,0,-1][1,0,-1][1,0,-1]].
Sobel filter [[1,2,1][0,0,0,][-1,-2,-1]]; Scharr filter [[3,10,3][0,0,0][-3,-10,-3]]
You can learn these values as weights using back prop. Can get edges at 45o, 73o, etc.
6x6 * 3x3 = 4x4; nxn * fxf = n-f+1 x n-f+1
Can only do a few times as your picture shrinks.
Corners only used once. Throw away information.
Pad image with one pixel. 6x6 -> 8x8 * 3x3= 6x6. p=1 (padding)
n+2p-f+1 x n+2p-f+1
Valid convolution: no padding (n-f+1 x n-f+1)
Same convolution: pad to give same size as image. n+2p-f+1=n => p = (f-1)/2
f is usually odd.
7x7 * 3x3 with stride =2 = 3x3. Step the filter over two pixels.
nxn * fxf. floor(((n+2p -f) / s) + 1) + floor(((n+2p -f) / s) + 1)
cross-correlation vs convolution, Flip filter. (A*B)*C = A*(B*C) (association).
On RGB image (6x6x3). Convolve with 3x3x3 = 4x4.
Multiply the numbers in all channels
If you want to detect edges in the R channel could have numbers in the R filter and 0's in G, B part of the channel.
Multiple filters: Could end up with a 4x4x2 volume (two different filters).
n x n x nc (chanel; depth). * f x f x nc -> n - f + 1 x n - f + 1 x nc
6 x 6 x 3 * 3 x 3 x 3 -> relu(4 x 4 + b1) -> 4 x 4 ->
6 x 6 x 3 * 3 x 3 x 3 -> relu(4 x 4 + b2) -> 4 x 4 -> 4 x4 x 2
Similar to
Z^[1] = W^[1]a^[0] + b^[1]
a^[1] = g(Z^[1])
10 filters that are 3 x 3 x 3. 27 parameters + bias (28 parameters). 280 parameters.
f^[l] = filter size. p^[l] = padding. s^[l] = stride
Input: n_h^[l-1] x n_w^[l-1] x n_c^[l-1]
Output: n_h^[l] x n_[w]^[l] x n_c^[l]
n_[h/w]^[l] = [((n_[h/w]^[l-1] + 2p^[l] - f^[l]) / s^[l]) + 1]
each filter is: f^[l] x f^[l] x n_c^[l-1]
activations: a^[l] -> n_h^[l] x n_w^[l] x n_c^[l]
Batch gradient descent A^[l] -> m x n_h^[l] x n_w^[l] x n_c^[l]
Weights: f^[l] x f^[l] x n_c^[l-1] x n_c^[l]
Bias: n_c^[l] - (1,1,1,n_c^[l])
39 x 39 x 3; n_H^[0] = n_W^[0] = 39; n_c^[0] = 3
f^[l] = 3; s^[l] = 1; p^[l] = 0. 10 filters
Next layer: 37x37x10 (n+2p-f / s) + 1
f^[2] = 5; s^[2] = 2; p^[2] = 0
Next layer: 17x17x20
f^[3] = 5; s^[3] = 2; f=40
Next layer: 7x7x40
Flatten this to a vector and feed to a logistic regression / softmax. -> y^
Types of layers: Convolutions (CONV), Pooling (POOL), Fully connected (FC)
Speed up computation and make features more robust.
Max pooling:
4x4 grid. Split into 2x2 grids then keep the max value in each grid -> 2x2
Hyper-parameters: f=2, s=2. e.g. may pick up a cat whisker. No parameters to learn.
5x3 with f=3, s=1. Take maximum value in the filter.
Max pooling is done independently on each channel.
Average pooling:
Average the values in the filter.
e.g. 7x7x1000 > 1x1x1000 (with a 7x7 filter).
leNet-5
32x32x3. f=5, s=1 ->
28x28x5 (CONV1). max pooling f=2, s=2 ->
14x14x6 (POOL1). Both of these can be layer 1. f=5, s=1 ->
10x10x10 (CONV2). max pooling f=2, s=2 ->
5x5x10 (POOL2). Both of these can be layer 2. Flatten this to 400 x 1 ->
120x1 (FC3). W^[3] (120,400), b^[3] (120) ->
84X1 (FC4) ->
Softmax (10 outputs).
Choose others hyper-parameters in the literature.
Throughout the network n_H, n_W decrease and n_C increase
Activation shape Activation Size # Parameters
Input: (32, 32, 3) 3,072 (a^[0]) 0
CONV1 (f=5, s=1) (28, 28, 8) 6,272 208
POOL1 (14, 14, 8) 1,568 0
CONV2 (f=5, s=1) (10, 10, 16) 1,600 416
POOL2 (5, 5, 16) 400 0
FC3 (120, 1) 120 48,001
FC4 (84, 1) 84 10,081
Softmax (10, 1) 10 841
32x32x3 (3,072) f=5, 6 filters -> 28x28x6 (4,704). The weight matrix of this would be huge (3072 x 4704 = 14 m)
Number of parameters = (5 * 5 + 1) * 6 = 156.
Parameter sharing: A feature detector that is useful in one part of the image is useful in another part of the image.
Sparsity of connections: In each layer, each output value depends only on a small number of inputs.
Cat detector
(x^(1), y^(1))...(x^(m),y^(m))
Cost, J = 1/ m sum L(y^^i, y^i)
Use gradient descent to optimize parameters to reduce J
Classic networks:
LeNet-5 (32x32x1) -> 5x5, s=1
28x28x6 -> avg pool, f=2, s=2
14x14x6 -> 5x5, s=1
10x10x16 -> avg pool f=2, s=2
5x5x16 (400) -> FC
120 neurons -> FC
84 neurons ->
1 neuron (yhat)
AlexNet (227x227x3) -> 11x11, s=4
55x55x96 -> max pool, s=2
27x27x96 -> 5x5, same
27x27x256 -> max pool, s=2
13x13x384 -> 3x3, same
13x13x384 -> 3x3, same
13x13x256 -> max pool, s=2
6x6x256 -> FC
4096 neurons -> FC
4096 neurons ->
1 neuron, softmax 1000
Used local response normalization (LRN) normalize values across channels - don't have much effect.
VGG-16 (224x224x3) -> conv 64 x 2
224x224x62 -> pool
112x112x64 -> conv 128 x 2
112x112x128 -> pool
56x56x128 -> conv 256 x 3
56x56x256 -> pool
28x28x256 -> conv 512 x 3
28x28x512 -> pool
14x14x512 -> conv 512 x 3
14x14x512 -> pool
7x7x512 -> FC
4096 -> FC
4096 ->
1 (softmax 1000)
Deep NN struggle with vanishing and exploding gradients.
Residual block (reference)
a^[l] -> a^[l + 1] -> a [l + 2]
a^[l] -> linear -> RelU -> a^[l + 1] -> Linear -> RelU -> a^[l + 2]
z^[l + 1] = W^[l + 1] a^[l] + b^[l + 1]; a^[l + 1] = g(z^[l + 1])
z^[l + 2] = W^[l + 2] a^[l + 1] + b^[l + 2]; a^[l + 2] = g(z^[l + 2])
Take the second Linear and add a^[l] (shortcut/skip connection)
a^[l + 1] = g(z^[l + 1] + a^[l])
10 layers -> 5 Res blocks
In reality training error increase after a while with the number of layers (over-fitting? more parameters to train?)
ResNet error decreases over time.
x -> Big NN -> a^[l]
x -> Big NN -> a^[l] -> ResNet -> a^[l+2], a >= 0
a^[l+2] = g(z^[l+2] + a^[l]) = g (w^[l+2] * a^[l+1] + b^[l+2] + a ^[l]). L2 regularization with shrink W.
If W^[l+2] = 0, b^[l+2] = 0 => g(a^[l]) = a^[l]
Identity function is easy for residual block to learn
Added a ResNet block at the end doesn't hurt performance.
Use Same Conv's so output of a^[l] is the same size as input of a^[l+2].
Can add a Ws before:
a^[l+2] (256) = g(Ws * a^[l]) = g(a^[l]); Where Ws is 256 x 128 and a^[l] is 128. This is when there is a pooling layer which changes dimension.
If channel is 1 then you simply scale the object
If channel is 32 it is like a 32 unit NN.
Element wise product of 32 channels and 32 channels in the filter.
6x6x32 * 1x1x32 = 6x6x#filters
28x28x192 * 1x1x32 = 28x28x32. Can shrink the number of channels.
reference (GooLeNet)
What size filter? Pooling?
28x28x192 * 1x1 = 28x28x64
* 3x3 = 28x28x128 (Stack this volume next to the first volume)
* 5x5 = 28x28x32
* max pool (with padding) = 28x28x32
Computationally expensive
28x28x192 * CONV 5x5,same,32 = 28x28x32
32 filters of 5x5x192
Calculations: 28x28x32 * 5x5x192 = 120m
28x28x192 * CONV 1x1,16,192 = 28x28x16 (bottle neck layer) * CONV 5x5,same,32,16 = 28x28x32
Calculations: 28x28x16 * 192 = 2.4m; 28x28x32 * 5x5x16 = 10m; 2.4m + 10m = 120m
Previous activation (28x28x192) -> 1x1 CONV -> 3x3 CONV ->
-> 1x1 CONV -> 5x5 CONV ->
-> 1x1 CONV ->
-> 3x3 MAXPOOL -> 28x28x32 CONV -> Channel Concat (28x28x256)
Side branches make softmax predictions
git clone https://github.com/KaimingHe/deep-residual-networks.git
cd deep-residual-networks
cd prototxt
more ResNet-101-...
# Uses Caffe
ImageNet,... datasets you can use.
You can download pre-trained weights.
If you are only classifying 3 class lose the softmax later and and add your own softmax layer. Only train the parameters for the softmax layer.
Freeze the other layers parameters (can freeze a layer). Some code offers freeze as a variable and trainable parameter as a variables. Save this to disk the other layers so you have have the output activation layer.
You could choose to freeze only the first four layers if you have multiple classes. Then use smaller layers.
You could keep the weights and train the whole network.
Mirroring (flip image)
Random cropping
Rotation
Shearing
Local warping
Color shifting (R+20, G-20, B+20)
PCA color augmentation used in AlexNet
Training data stored on hard-disk -> CPU thread (loads image) and distorts (mini-batch) -> Training
Little data <--------------------------------------------------------------------------------------------------------------------------------------------------> lots of data
More hand-engineering Object detection Image recognition Speech recognition Simpler algorithms
(bounding boxes)
Labeled data
Hand engineering features/network architecture/other components
Tips for doing well on benchmarks:
Keras documentation: https://keras.io/models/model/
With keras if you run fit()
again, the model
will continue to train with the parameters it has already learnt instead of reinitializing them
Drawing a bounding box (localization).
Classification with localization - one object in image.
Detection - Multiple objects (e.g. pedestrian, car, motorcycle, background)
NN outputs four more numbers (b_x, b_y, b_h, b_w) as well as a class label. Upper left of image is (0,0) and lower right is (1,1). b_x, b_y is mid point of object, b_h is height, b_w is width.
y=[pc - is there an object?, b_x, b_y, b_h and b_w, C_1, C_2, C_3 (classes)]; 8 components
L(y^, y) = squared error ((y^_1 - y_1)^2 + (y^_2 - y_2)^2 + ... + (y^_8 - y_8)) if y_1 = 1
If pc = 0 then don't care about the other objects.
If you are interested in a point you could add point l_1x, l_1y, l_2x, l_2y, l_3x, l_3y, l_4x, l_4y, l_nx, l_ny e.g. landmarks on a face.
People pose position e.g landmarks on body e.g. shoulder, head, foot.
Have a data-set with closely cropped images containing a car or empty (1 or 0), ConvNet to predict image.
Sliding window detection -> ConvNet. Run through each section of image.
Repeat using a larger window x 2 (bigger second time). However, expensive.
FC layers -> Conv layers.
5x5 filter in FC (400 filters) -> 1 x 1 x 400
-> 1x1 filter(1x1x400)
-> 1x1x4
Model has input 14x14x3 but test set is 16x16x3. Use 14x14x3 with a stride of 2 (at all steps).
The sliding windows share a lot of information.
Making additional padding on the images so you only run the CCN once.
YOLO reference and reference2
Split image into 9 cells and create label (8: 3 classes, bounding box and if picture).
Bounding box could be greater than 1.
(IoU). Union is area in both boxes. Intersection is the shared space between two bounding boxes. Correct if IoU >= 0.5.
Detect each object only once.
Mid-point should only be in one grid cell.
Multiple detection per object.
Pc - probability of detection. Takes largest value and highlights that.
Boxes with high IoU will get suppressed.
Discard all boxes Pc <= 0.6; Pick with box with the highest Pc; Discard any remaining box with IoU >= 0.5
What is a grid cell whats to detect multiple objects (e.g. pedestrian in front of a car).
Pre-define two different shapes (anchor boxes).
8 outputs with anchor box 1. Then 8 outputs with anchor box 2. Output is (3x3x16) or (3x3x2x8).
(grid cell, anchor box) for each object.
y ix 3 x 3 x 2 (anchors) x 8 (5 + # of classes). 3 x 3 x 16
For each grid cell get 2 predicted bounding boxes
For each class use non-max suppression to generate final predictions.
R-CNN reference
Segmentation algorithm. What could be objects. 2,000 blobs and run classifying algorithm on the blobs. Ouput label + bounding box.
Fast R-CNN reference
Propose windows. Use CNN of sliding windows to classify all the proposed regions.
Faster R-CNN reference
Use CNN to propose regions
https://www.drive.ai/ - car dataset
Deep CNN. Factor reduction of 32 (608 x 608) -> (19 x 19). 80 classes > 85 outputs. 5 anchor boxes. 5 x 18 = 425. The model predicts: 19x19x5 = 1805 boxes
Facial recognition and liveliness (i.e. don't recognize a picture)
Face verification vs. face recognition. Verification 1:1 want 99.9% accuracy. Recognition have a database of K persons. Get an output image
Recognize a person with only one picture.
Image of person -> CNN -> Softmax (n people + none). Doesn't really work with limited sample size and what if you have a new picture
Learn a similarity function
d(img1, img2) = Degree of difference between images
If d(img1, img2) <= tau (same)
> tau (different)
reference (DeepFace)
x^(1) image -> CNN -> FC f(x^(1)) as 128 numbers (encoding of x^(1))
x^(2) -> 128 numbers (encoding of x^(2))
d(x^(1), x^(2)) = ||f(x^(1) - f(x^(2))||^2.
Run the two CNN in parallel
Learn parameters so if x^(1) and x^(2), ||f(x^(1) - f(x^(2))||^2 is small
reference (FaceNet)
Anchor & Positive (same person) d(A, P) = 0.5
Anchor & Negative (different person) d(A, N) = 0.7
Want ||f(A) - f(P)||^2 <= ||f(A) - f(N)||^2
||f(A) - f(P)||^2 - ||f(A) - f(N)||^2 + alpha (margin) <= 0. Can't have zeros
Given 3 images A, P, N
L(A, P, N) = max(||f(A) - f(P)||^2 - ||f(A) - f(N)||^2 + alpha, 0)
J = sum(L(A, P, N) )
Training set: 10k pictures of 1k persons. Need ~10 pictures of each person.
After training apply to one-shot learning.
If A, P, N are chosen randomly then d(A, P) + alpha <= d(A, N) is easily satisfied.
||f(A) - f(P)||^2 + alpha <= ||f(A) - f(N)||^2
Choose triplets that are "hard" to train on
d(A, P) + alpha <= d(A, N). Makes the network computationally effective.
Both CNNs can be put into a LR function to out 1 if same person or 0 if 0
y^ = sigma ( sum |f(x^(i))_k - f(x^(j))_k| + b) where k is 128 images.
Chi squared formular is DeepFace paper (above squared / same but + ).
You can pre-compute the parameter.
Ways to improve facial recognition.
Take image (C) and put in style (S) of Van Gough for example
Content (C) + Style (S) => Generated Image (G)
Neural Style Transfer (NST) uses a previously trained convolutional network, and builds on top of that
AlexNet
Pick a unit in layer 1. Find the nine image patches that maximize the unit's activation. Image patch will detect edges.
Next hidden unit is looking for edge in different direction or color grouping.
In deeper layers sees larger image patches.
Cost function on J(G) = alpha * Jcontent(C, G). How similar is C to G. + beta * Jstyle(S, G). How similar is S to G.
Initiate G randomly (100 x 100 x 3).
Use gradient descent to minimize J(G).
G := G - d/dG J(G)
Most of the algorithms you've studied optimize a cost function to get a set of parameter values. In Neural Style Transfer, you'll optimize a cost function to get pixel values!
Jcontent(C, G)
Use hidden layer l to compute content cost (in middle of network)
Use a pre-trained ConvNet (e.g. VGG network).
Let a^[l](C) and a^[l](G) be the activation of layer l on the images
If a^[l](C) and a^[l](G) are similar, both images have similar content
Jcontent(C, G) = ||a^[l](C) - a^[l](G||^2
Use layer l is measure the style.
Style is defined as correlation between activations across channels.
e.g. 5 channels. Pairs of numbers in n_w and n_h. Correlated e.g. vertical lines and orange color. Uncorrelated verticals lines and no orange color.
Style matrix
a^[l]_i,j,k = activation at (i,j,k). G^[l] is nc^[l] x nc^[l].
G^[l]_k,k' = sum sum a^[l],i,j,k * a^[l],i,j,k'. Unnormalized cross covariance. "gram matrix".
Style of image S and style of image G.
J^[l]_style(S,G) = ||G^[l](S) - G^[l](G)||^2_F = 1 / constant (frobenius norm).
Jstyle(S, G) - sum_l * lambda^[l] * Jstyle^[l](S, G).
Can apply to 1D data and 3D data (not just 2D).
1D data e.g. time series of ECG. Apply Gaussian curve. 14 * 5 -> 10.
3D data e.g. MRI scan (another example is movies) (nw, nh, nd). Apply 3d filter. 14 x 14 x 14 * 5x5x5 -> 10x10x10. Detect features.
Neural Style Transfer references