Convolutional Neural Networks
May 2019
Module 1: Foundations of convolutional neural networks
Computer vision
Image classification, object detection, Neural Style Transfer (transfer one one of image to a type of painting).
Edge detection example
Detect vertical edges -> detect horizontal edges.
6x6 image. Construct a 3x3 filter as [[1,1,1][0,0,0][-1,-1,-1]] (~kernal). Convolute 6x6 with 3x3 -> 4x4 (~sliding window), element-wise product (3x3) x (3x3) = (1,1 * 1,1) + (2,1 * 2,1),...
python: conv-forward. tf: tf.nn.conv26. keras: conv2d
If image is 10's all on left and 0's all on right, convoluted with (1's on left, 0's in middle and -1 on right). -> 4x4 with 30's in the middle
More edge detection
Vertical edge filter is [1,1,1][0,0,0][-1,-1,-1]] and horizontal edge filter is [[1,0,-1][1,0,-1][1,0,-1]].
Sobel filter [[1,2,1][0,0,0,][-1,-2,-1]]; Scharr filter [[3,10,3][0,0,0][-3,-10,-3]]
You can learn these values as weights using back prop. Can get edges at 45o, 73o, etc.
Padding
6x6 * 3x3 = 4x4; nxn * fxf = n-f+1 x n-f+1
Can only do a few times as your picture shrinks.
Corners only used once. Throw away information.
Pad image with one pixel. 6x6 -> 8x8 * 3x3= 6x6. p=1 (padding)
n+2p-f+1 x n+2p-f+1
Valid convolution: no padding (n-f+1 x n-f+1)
Same convolution: pad to give same size as image. n+2p-f+1=n => p = (f-1)/2
f is usually odd.
Strided convolutions
7x7 * 3x3 with stride =2 = 3x3. Step the filter over two pixels.
nxn * fxf. floor(((n+2p -f) / s) + 1) + floor(((n+2p -f) / s) + 1)
cross-correlation vs convolution, Flip filter. (A*B)*C = A*(B*C) (association).
Convolutions over volume
On RGB image (6x6x3). Convolve with 3x3x3 = 4x4.
Multiply the numbers in all channels
If you want to detect edges in the R channel could have numbers in the R filter and 0's in G, B part of the channel.
Multiple filters: Could end up with a 4x4x2 volume (two different filters).
n x n x nc (chanel; depth). * f x f x nc -> n - f + 1 x n - f + 1 x nc
One layer of a Convolutional Network
6 x 6 x 3 * 3 x 3 x 3 -> relu(4 x 4 + b1) -> 4 x 4 ->
6 x 6 x 3 * 3 x 3 x 3 -> relu(4 x 4 + b2) -> 4 x 4 -> 4 x4 x 2
Similar to
Z^[1] = W^[1]a^[0] + b^[1]
a^[1] = g(Z^[1])
10 filters that are 3 x 3 x 3. 27 parameters + bias (28 parameters). 280 parameters.
f^[l] = filter size. p^[l] = padding. s^[l] = stride
Input: n_h^[l-1] x n_w^[l-1] x n_c^[l-1]
Output: n_h^[l] x n_[w]^[l] x n_c^[l]
n_[h/w]^[l] = [((n_[h/w]^[l-1] + 2p^[l] - f^[l]) / s^[l]) + 1]
each filter is: f^[l] x f^[l] x n_c^[l-1]
activations: a^[l] -> n_h^[l] x n_w^[l] x n_c^[l]
Batch gradient descent A^[l] -> m x n_h^[l] x n_w^[l] x n_c^[l]
Weights: f^[l] x f^[l] x n_c^[l-1] x n_c^[l]
Bias: n_c^[l] - (1,1,1,n_c^[l])
Simple Convolutional Network Example
39 x 39 x 3; n_H^[0] = n_W^[0] = 39; n_c^[0] = 3
f^[l] = 3; s^[l] = 1; p^[l] = 0. 10 filters
Next layer: 37x37x10 (n+2p-f / s) + 1
f^[2] = 5; s^[2] = 2; p^[2] = 0
Next layer: 17x17x20
f^[3] = 5; s^[3] = 2; f=40
Next layer: 7x7x40
Flatten this to a vector and feed to a logistic regression / softmax. -> y^
Types of layers: Convolutions (CONV), Pooling (POOL), Fully connected (FC)
Pooling layers
Speed up computation and make features more robust.
Max pooling:
4x4 grid. Split into 2x2 grids then keep the max value in each grid -> 2x2
Hyper-parameters: f=2, s=2. e.g. may pick up a cat whisker. No parameters to learn.
5x3 with f=3, s=1. Take maximum value in the filter.
Max pooling is done independently on each channel.
Average pooling:
Average the values in the filter.
e.g. 7x7x1000 > 1x1x1000 (with a 7x7 filter).
CNN Example
leNet-5
32x32x3. f=5, s=1 ->
28x28x5 (CONV1). max pooling f=2, s=2 ->
14x14x6 (POOL1). Both of these can be layer 1. f=5, s=1 ->
10x10x10 (CONV2). max pooling f=2, s=2 ->
5x5x10 (POOL2). Both of these can be layer 2. Flatten this to 400 x 1 ->
120x1 (FC3). W^[3] (120,400), b^[3] (120) ->
84X1 (FC4) ->
Softmax (10 outputs).
Choose others hyper-parameters in the literature.
Throughout the network n_H, n_W decrease and n_C increase
Activation shape Activation Size # Parameters
Input: (32, 32, 3) 3,072 (a^[0]) 0
CONV1 (f=5, s=1) (28, 28, 8) 6,272 208
POOL1 (14, 14, 8) 1,568 0
CONV2 (f=5, s=1) (10, 10, 16) 1,600 416
POOL2 (5, 5, 16) 400 0
FC3 (120, 1) 120 48,001
FC4 (84, 1) 84 10,081
Softmax (10, 1) 10 841
Why Convolutions?
32x32x3 (3,072) f=5, 6 filters -> 28x28x6 (4,704). The weight matrix of this would be huge (3072 x 4704 = 14 m)
Number of parameters = (5 * 5 + 1) * 6 = 156.
Parameter sharing: A feature detector that is useful in one part of the image is useful in another part of the image.
Sparsity of connections: In each layer, each output value depends only on a small number of inputs.
Cat detector
(x^(1), y^(1))...(x^(m),y^(m))
Cost, J = 1/ m sum L(y^^i, y^i)
Use gradient descent to optimize parameters to reduce J
Module 2: Deep convolutional models: case studies
Why Look at case studies?
Classic networks:
Classic Networks
LeNet-5 (32x32x1) -> 5x5, s=1
28x28x6 -> avg pool, f=2, s=2
14x14x6 -> 5x5, s=1
10x10x16 -> avg pool f=2, s=2
5x5x16 (400) -> FC
120 neurons -> FC
84 neurons ->
1 neuron (yhat)
AlexNet (227x227x3) -> 11x11, s=4
55x55x96 -> max pool, s=2
27x27x96 -> 5x5, same
27x27x256 -> max pool, s=2
13x13x384 -> 3x3, same
13x13x384 -> 3x3, same
13x13x256 -> max pool, s=2
6x6x256 -> FC
4096 neurons -> FC
4096 neurons ->
1 neuron, softmax 1000
Used local response normalization (LRN) normalize values across channels - don't have much effect.
VGG-16 (224x224x3) -> conv 64 x 2
224x224x62 -> pool
112x112x64 -> conv 128 x 2
112x112x128 -> pool
56x56x128 -> conv 256 x 3
56x56x256 -> pool
28x28x256 -> conv 512 x 3
28x28x512 -> pool
14x14x512 -> conv 512 x 3
14x14x512 -> pool
7x7x512 -> FC
4096 -> FC
4096 ->
1 (softmax 1000)
Residual Networks (ResNet)
Deep NN struggle with vanishing and exploding gradients.
Residual block (reference)
a^[l] -> a^[l + 1] -> a [l + 2]
a^[l] -> linear -> RelU -> a^[l + 1] -> Linear -> RelU -> a^[l + 2]
z^[l + 1] = W^[l + 1] a^[l] + b^[l + 1]; a^[l + 1] = g(z^[l + 1])
z^[l + 2] = W^[l + 2] a^[l + 1] + b^[l + 2]; a^[l + 2] = g(z^[l + 2])
Take the second Linear and add a^[l] (shortcut/skip connection)
a^[l + 1] = g(z^[l + 1] + a^[l])
10 layers -> 5 Res blocks
In reality training error increase after a while with the number of layers (over-fitting? more parameters to train?)
ResNet error decreases over time.
Why ResNets Work
x -> Big NN -> a^[l]
x -> Big NN -> a^[l] -> ResNet -> a^[l+2], a >= 0
a^[l+2] = g(z^[l+2] + a^[l]) = g (w^[l+2] * a^[l+1] + b^[l+2] + a ^[l]). L2 regularization with shrink W.
If W^[l+2] = 0, b^[l+2] = 0 => g(a^[l]) = a^[l]
Identity function is easy for residual block to learn
Added a ResNet block at the end doesn't hurt performance.
Use Same Conv's so output of a^[l] is the same size as input of a^[l+2].
Can add a Ws before:
a^[l+2] (256) = g(Ws * a^[l]) = g(a^[l]); Where Ws is 256 x 128 and a^[l] is 128. This is when there is a pooling layer which changes dimension.
Networks in Networks and 1x1 Convolutions
If channel is 1 then you simply scale the object
If channel is 32 it is like a 32 unit NN.
Element wise product of 32 channels and 32 channels in the filter.
6x6x32 * 1x1x32 = 6x6x#filters
28x28x192 * 1x1x32 = 28x28x32. Can shrink the number of channels.
Inception Network Motivation
reference (GooLeNet)
What size filter? Pooling?
28x28x192 * 1x1 = 28x28x64
* 3x3 = 28x28x128 (Stack this volume next to the first volume)
* 5x5 = 28x28x32
* max pool (with padding) = 28x28x32
Computationally expensive
28x28x192 * CONV 5x5,same,32 = 28x28x32
32 filters of 5x5x192
Calculations: 28x28x32 * 5x5x192 = 120m
28x28x192 * CONV 1x1,16,192 = 28x28x16 (bottle neck layer) * CONV 5x5,same,32,16 = 28x28x32
Calculations: 28x28x16 * 192 = 2.4m; 28x28x32 * 5x5x16 = 10m; 2.4m + 10m = 120m
Inception Network
Previous activation (28x28x192) -> 1x1 CONV -> 3x3 CONV ->
-> 1x1 CONV -> 5x5 CONV ->
-> 1x1 CONV ->
-> 3x3 MAXPOOL -> 28x28x32 CONV -> Channel Concat (28x28x256)
Side branches make softmax predictions
Using Open Source Implementations
git clone https://github.com/KaimingHe/deep-residual-networks.git
cd deep-residual-networks
cd prototxt
more ResNet-101-...
# Uses Caffe
Transfer Learning
ImageNet,... datasets you can use.
You can download pre-trained weights.
If you are only classifying 3 class lose the softmax later and and add your own softmax layer. Only train the parameters for the softmax layer.
Freeze the other layers parameters (can freeze a layer). Some code offers freeze as a variable and trainable parameter as a variables. Save this to disk the other layers so you have have the output activation layer.
You could choose to freeze only the first four layers if you have multiple classes. Then use smaller layers.
You could keep the weights and train the whole network.
Data Augmentation
Mirroring (flip image)
Random cropping
Rotation
Shearing
Local warping
Color shifting (R+20, G-20, B+20)
PCA color augmentation used in AlexNet
Training data stored on hard-disk -> CPU thread (loads image) and distorts (mini-batch) -> Training
State of Computer Vision
Little data <--------------------------------------------------------------------------------------------------------------------------------------------------> lots of data
More hand-engineering Object detection Image recognition Speech recognition Simpler algorithms
(bounding boxes)
Labeled data
Hand engineering features/network architecture/other components
Tips for doing well on benchmarks:
- Ensembling (train several networks independently (3-15) and average their outputs (y hat)
- Multi-crop at test time (run classifier on multiple versions of test images and average results) e.g. 4 crops of an image
- Use open source; architectures of networks published; pretrained models
Keras documentation: https://keras.io/models/model/
With keras if you run fit()
again, the model
will continue to train with the parameters it has already learnt instead of reinitializing them
Module 3: Detection algorithms
Object Localization
Drawing a bounding box (localization).
Classification with localization - one object in image.
Detection - Multiple objects (e.g. pedestrian, car, motorcycle, background)
NN outputs four more numbers (b_x, b_y, b_h, b_w) as well as a class label. Upper left of image is (0,0) and lower right is (1,1). b_x, b_y is mid point of object, b_h is height, b_w is width.
y=[pc - is there an object?, b_x, b_y, b_h and b_w, C_1, C_2, C_3 (classes)]; 8 components
L(y^, y) = squared error ((y^_1 - y_1)^2 + (y^_2 - y_2)^2 + ... + (y^_8 - y_8)) if y_1 = 1
If pc = 0 then don't care about the other objects.
Landmark Detection
If you are interested in a point you could add point l_1x, l_1y, l_2x, l_2y, l_3x, l_3y, l_4x, l_4y, l_nx, l_ny e.g. landmarks on a face.
People pose position e.g landmarks on body e.g. shoulder, head, foot.
Object Detection
Have a data-set with closely cropped images containing a car or empty (1 or 0), ConvNet to predict image.
Sliding window detection -> ConvNet. Run through each section of image.
Repeat using a larger window x 2 (bigger second time). However, expensive.
Convolutional Implementation of Sliding Windows
FC layers -> Conv layers.
5x5 filter in FC (400 filters) -> 1 x 1 x 400
-> 1x1 filter(1x1x400)
-> 1x1x4
Model has input 14x14x3 but test set is 16x16x3. Use 14x14x3 with a stride of 2 (at all steps).
The sliding windows share a lot of information.
Making additional padding on the images so you only run the CCN once.
Bounding Box Predictions
YOLO reference and reference2
Split image into 9 cells and create label (8: 3 classes, bounding box and if picture).
Bounding box could be greater than 1.
Intersection Over Union
(IoU). Union is area in both boxes. Intersection is the shared space between two bounding boxes. Correct if IoU >= 0.5.
Non-max Suppression
Detect each object only once.
Mid-point should only be in one grid cell.
Multiple detection per object.
Pc - probability of detection. Takes largest value and highlights that.
Boxes with high IoU will get suppressed.
Discard all boxes Pc <= 0.6; Pick with box with the highest Pc; Discard any remaining box with IoU >= 0.5
Anchor Boxes
What is a grid cell whats to detect multiple objects (e.g. pedestrian in front of a car).
Pre-define two different shapes (anchor boxes).
8 outputs with anchor box 1. Then 8 outputs with anchor box 2. Output is (3x3x16) or (3x3x2x8).
(grid cell, anchor box) for each object.
YOLO Algorithm
y ix 3 x 3 x 2 (anchors) x 8 (5 + # of classes). 3 x 3 x 16
For each grid cell get 2 predicted bounding boxes
For each class use non-max suppression to generate final predictions.
Region Proposals
R-CNN reference
Segmentation algorithm. What could be objects. 2,000 blobs and run classifying algorithm on the blobs. Ouput label + bounding box.
Fast R-CNN reference
Propose windows. Use CNN of sliding windows to classify all the proposed regions.
Faster R-CNN reference
Use CNN to propose regions
Object detection
https://www.drive.ai/ - car dataset
Deep CNN. Factor reduction of 32 (608 x 608) -> (19 x 19). 80 classes > 85 outputs. 5 anchor boxes. 5 x 18 = 425. The model predicts: 19x19x5 = 1805 boxes
Module 4: Special applications: Face recognition & Neural style transfer
What is facial recognition?
Facial recognition and liveliness (i.e. don't recognize a picture)
Face verification vs. face recognition. Verification 1:1 want 99.9% accuracy. Recognition have a database of K persons. Get an output image
One Shot Learning
Recognize a person with only one picture.
Image of person -> CNN -> Softmax (n people + none). Doesn't really work with limited sample size and what if you have a new picture
Learn a similarity function
d(img1, img2) = Degree of difference between images
If d(img1, img2) <= tau (same)
> tau (different)
Siamese network
reference (DeepFace)
x^(1) image -> CNN -> FC f(x^(1)) as 128 numbers (encoding of x^(1))
x^(2) -> 128 numbers (encoding of x^(2))
d(x^(1), x^(2)) = ||f(x^(1) - f(x^(2))||^2.
Run the two CNN in parallel
Learn parameters so if x^(1) and x^(2), ||f(x^(1) - f(x^(2))||^2 is small
Triplet loss
reference (FaceNet)
Anchor & Positive (same person) d(A, P) = 0.5
Anchor & Negative (different person) d(A, N) = 0.7
Want ||f(A) - f(P)||^2 <= ||f(A) - f(N)||^2
||f(A) - f(P)||^2 - ||f(A) - f(N)||^2 + alpha (margin) <= 0. Can't have zeros
Given 3 images A, P, N
L(A, P, N) = max(||f(A) - f(P)||^2 - ||f(A) - f(N)||^2 + alpha, 0)
J = sum(L(A, P, N) )
Training set: 10k pictures of 1k persons. Need ~10 pictures of each person.
After training apply to one-shot learning.
If A, P, N are chosen randomly then d(A, P) + alpha <= d(A, N) is easily satisfied.
||f(A) - f(P)||^2 + alpha <= ||f(A) - f(N)||^2
Choose triplets that are "hard" to train on
d(A, P) + alpha <= d(A, N). Makes the network computationally effective.
Face Verification and Binary Classification
Both CNNs can be put into a LR function to out 1 if same person or 0 if 0
y^ = sigma ( sum |f(x^(i))_k - f(x^(j))_k| + b) where k is 128 images.
Chi squared formular is DeepFace paper (above squared / same but + ).
You can pre-compute the parameter.
Ways to improve facial recognition.
- Put more images of each person (under different lighting conditions, taken on different days, etc.) into the database. Then given a new image, compare the new face to multiple pictures of the person. This would increae accuracy.
- Crop the images to just contain the face, and less of the "border" region around the face. This preprocessing removes some of the irrelevant pixels around the face, and also makes the algorithm more robust.
What is neural style transfer
Take image (C) and put in style (S) of Van Gough for example
Content (C) + Style (S) => Generated Image (G)
Neural Style Transfer (NST) uses a previously trained convolutional network, and builds on top of that
What are deep ConvNets learning?
AlexNet
Pick a unit in layer 1. Find the nine image patches that maximize the unit's activation. Image patch will detect edges.
Next hidden unit is looking for edge in different direction or color grouping.
In deeper layers sees larger image patches.
Neural Style Transfer Cost Function
Cost function on J(G) = alpha * Jcontent(C, G). How similar is C to G. + beta * Jstyle(S, G). How similar is S to G.
Initiate G randomly (100 x 100 x 3).
Use gradient descent to minimize J(G).
G := G - d/dG J(G)
Most of the algorithms you've studied optimize a cost function to get a set of parameter values. In Neural Style Transfer, you'll optimize a cost function to get pixel values!
Content Cost Function
Jcontent(C, G)
Use hidden layer l to compute content cost (in middle of network)
Use a pre-trained ConvNet (e.g. VGG network).
Let a^[l](C) and a^[l](G) be the activation of layer l on the images
If a^[l](C) and a^[l](G) are similar, both images have similar content
Jcontent(C, G) = ||a^[l](C) - a^[l](G||^2
Style Cost Function
Use layer l is measure the style.
Style is defined as correlation between activations across channels.
e.g. 5 channels. Pairs of numbers in n_w and n_h. Correlated e.g. vertical lines and orange color. Uncorrelated verticals lines and no orange color.
Style matrix
a^[l]_i,j,k = activation at (i,j,k). G^[l] is nc^[l] x nc^[l].
G^[l]_k,k' = sum sum a^[l],i,j,k * a^[l],i,j,k'. Unnormalized cross covariance. "gram matrix".
Style of image S and style of image G.
J^[l]_style(S,G) = ||G^[l](S) - G^[l](G)||^2_F = 1 / constant (frobenius norm).
Jstyle(S, G) - sum_l * lambda^[l] * Jstyle^[l](S, G).
1D and 3D Generalizations
Can apply to 1D data and 3D data (not just 2D).
1D data e.g. time series of ECG. Apply Gaussian curve. 14 * 5 -> 10.
3D data e.g. MRI scan (another example is movies) (nw, nh, nd). Apply 3d filter. 14 x 14 x 14 * 5x5x5 -> 10x10x10. Detect features.
Neural Style Transfer references
- Leon A. Gatys, Alexander S. Ecker, Matthias Bethge, (2015). A Neural Algorithm of Artistic Style (https://arxiv.org/abs/1508.06576)
- Harish Narayanan, Convolutional neural networks for artistic style transfer. https://harishnarayanan.org/writing/artistic-style-transfer/
- Log0, TensorFlow Implementation of "A Neural Algorithm of Artistic Style". http://www.chioka.in/tensorflow-implementation-neural-algorithm-of-artistic-style
- Karen Simonyan and Andrew Zisserman (2015). Very deep convolutional networks for large-scale image recognition (https://arxiv.org/pdf/1409.1556.pdf)
- MatConvNet. http://www.vlfeat.org/matconvnet/pretrained/