Convolutional Neural Networks

May 2019

Module 1: Foundations of convolutional neural networks

Computer vision

Image classification, object detection, Neural Style Transfer (transfer one one of image to a type of painting).

Edge detection example

Detect vertical edges -> detect horizontal edges.

6x6 image. Construct a 3x3 filter as [[1,1,1][0,0,0][-1,-1,-1]] (~kernal). Convolute 6x6 with 3x3 -> 4x4 (~sliding window), element-wise product (3x3) x (3x3) = (1,1 * 1,1) + (2,1 * 2,1),...

python: conv-forward. tf: tf.nn.conv26. keras: conv2d

If image is 10's all on left and 0's all on right, convoluted with (1's on left, 0's in middle and -1 on right). -> 4x4 with 30's in the middle

More edge detection

Vertical edge filter is [1,1,1][0,0,0][-1,-1,-1]] and horizontal edge filter is [[1,0,-1][1,0,-1][1,0,-1]].

Sobel filter [[1,2,1][0,0,0,][-1,-2,-1]]; Scharr filter [[3,10,3][0,0,0][-3,-10,-3]]

You can learn these values as weights using back prop. Can get edges at 45o, 73o, etc.

Padding

6x6 * 3x3 = 4x4; nxn * fxf = n-f+1 x n-f+1

Can only do a few times as your picture shrinks.

Corners only used once. Throw away information.

Pad image with one pixel. 6x6 -> 8x8 * 3x3= 6x6. p=1 (padding)

n+2p-f+1 x n+2p-f+1

Valid convolution: no padding (n-f+1 x n-f+1)

Same convolution: pad to give same size as image. n+2p-f+1=n => p = (f-1)/2

f is usually odd.

Strided convolutions

7x7 * 3x3 with stride =2 = 3x3. Step the filter over two pixels.

nxn * fxf. floor(((n+2p -f) / s) + 1) + floor(((n+2p -f) / s) + 1)

cross-correlation vs convolution, Flip filter. (A*B)*C = A*(B*C) (association).

Convolutions over volume

On RGB image (6x6x3). Convolve with 3x3x3 = 4x4.

Multiply the numbers in all channels

If you want to detect edges in the R channel could have numbers in the R filter and 0's in G, B part of the channel.

Multiple filters: Could end up with a 4x4x2 volume (two different filters).

n x n x nc (chanel; depth). * f x f x nc -> n - f + 1 x n - f + 1 x nc

One layer of a Convolutional Network

6 x 6 x 3 * 3 x 3 x 3 -> relu(4 x 4 + b1) -> 4 x 4 ->

6 x 6 x 3 * 3 x 3 x 3 -> relu(4 x 4 + b2) -> 4 x 4 -> 4 x4 x 2

Similar to

Z^[1] = W^[1]a^[0] + b^[1]

a^[1] = g(Z^[1])

10 filters that are 3 x 3 x 3. 27 parameters + bias (28 parameters). 280 parameters.

f^[l] = filter size. p^[l] = padding. s^[l] = stride

Input: n_h^[l-1] x n_w^[l-1] x n_c^[l-1]

Output: n_h^[l] x n_[w]^[l] x n_c^[l]

n_[h/w]^[l] = [((n_[h/w]^[l-1] + 2p^[l] - f^[l]) / s^[l]) + 1]

each filter is: f^[l] x f^[l] x n_c^[l-1]

activations: a^[l] -> n_h^[l] x n_w^[l] x n_c^[l]

Batch gradient descent A^[l] -> m x n_h^[l] x n_w^[l] x n_c^[l]

Weights: f^[l] x f^[l] x n_c^[l-1] x n_c^[l]

Bias: n_c^[l] - (1,1,1,n_c^[l])

Simple Convolutional Network Example

39 x 39 x 3; n_H^[0] = n_W^[0] = 39; n_c^[0] = 3

f^[l] = 3; s^[l] = 1; p^[l] = 0. 10 filters

Next layer: 37x37x10 (n+2p-f / s) + 1

f^[2] = 5; s^[2] = 2; p^[2] = 0

Next layer: 17x17x20

f^[3] = 5; s^[3] = 2; f=40

Next layer: 7x7x40

Flatten this to a vector and feed to a logistic regression / softmax. -> y^

Types of layers: Convolutions (CONV), Pooling (POOL), Fully connected (FC)

Pooling layers

Speed up computation and make features more robust.

Max pooling:

4x4 grid. Split into 2x2 grids then keep the max value in each grid -> 2x2

Hyper-parameters: f=2, s=2. e.g. may pick up a cat whisker. No parameters to learn.

5x3 with f=3, s=1. Take maximum value in the filter.

Max pooling is done independently on each channel.

Average pooling:

Average the values in the filter.

e.g. 7x7x1000 > 1x1x1000 (with a 7x7 filter).

CNN Example

leNet-5

32x32x3. f=5, s=1 ->

28x28x5 (CONV1). max pooling f=2, s=2 ->

14x14x6 (POOL1). Both of these can be layer 1. f=5, s=1 ->

10x10x10 (CONV2). max pooling f=2, s=2 ->

5x5x10 (POOL2). Both of these can be layer 2. Flatten this to 400 x 1 ->

120x1 (FC3). W^[3] (120,400), b^[3] (120) ->

84X1 (FC4) ->

Softmax (10 outputs).

Choose others hyper-parameters in the literature.

Throughout the network n_H, n_W decrease and n_C increase

Activation shape Activation Size # Parameters

Input: (32, 32, 3) 3,072 (a^[0]) 0

CONV1 (f=5, s=1) (28, 28, 8) 6,272 208

POOL1 (14, 14, 8) 1,568 0

CONV2 (f=5, s=1) (10, 10, 16) 1,600 416

POOL2 (5, 5, 16) 400 0

FC3 (120, 1) 120 48,001

FC4 (84, 1) 84 10,081

Softmax (10, 1) 10 841

Why Convolutions?

32x32x3 (3,072) f=5, 6 filters -> 28x28x6 (4,704). The weight matrix of this would be huge (3072 x 4704 = 14 m)

Number of parameters = (5 * 5 + 1) * 6 = 156.

Parameter sharing: A feature detector that is useful in one part of the image is useful in another part of the image.

Sparsity of connections: In each layer, each output value depends only on a small number of inputs.

Cat detector

(x^(1), y^(1))...(x^(m),y^(m))

Cost, J = 1/ m sum L(y^^i, y^i)

Use gradient descent to optimize parameters to reduce J

https://towardsdatascience.com/a-comprehensive-guide-to-convolutional-neural-networks-the-eli5-way-3bd2b1164a53

Module 2: Deep convolutional models: case studies

Why Look at case studies?

Classic networks:

Classic Networks

LeNet-5 (32x32x1) -> 5x5, s=1

28x28x6 -> avg pool, f=2, s=2

14x14x6 -> 5x5, s=1

10x10x16 -> avg pool f=2, s=2

5x5x16 (400) -> FC

120 neurons -> FC

84 neurons ->

1 neuron (yhat)


AlexNet (227x227x3) -> 11x11, s=4

55x55x96 -> max pool, s=2

27x27x96 -> 5x5, same

27x27x256 -> max pool, s=2

13x13x384 -> 3x3, same

13x13x384 -> 3x3, same

13x13x256 -> max pool, s=2

6x6x256 -> FC

4096 neurons -> FC

4096 neurons ->

1 neuron, softmax 1000

Used local response normalization (LRN) normalize values across channels - don't have much effect.


VGG-16 (224x224x3) -> conv 64 x 2

224x224x62 -> pool

112x112x64 -> conv 128 x 2

112x112x128 -> pool

56x56x128 -> conv 256 x 3

56x56x256 -> pool

28x28x256 -> conv 512 x 3

28x28x512 -> pool

14x14x512 -> conv 512 x 3

14x14x512 -> pool

7x7x512 -> FC

4096 -> FC

4096 ->

1 (softmax 1000)

Residual Networks (ResNet)

Deep NN struggle with vanishing and exploding gradients.

Residual block (reference)

a^[l] -> a^[l + 1] -> a [l + 2]

a^[l] -> linear -> RelU -> a^[l + 1] -> Linear -> RelU -> a^[l + 2]

z^[l + 1] = W^[l + 1] a^[l] + b^[l + 1]; a^[l + 1] = g(z^[l + 1])

z^[l + 2] = W^[l + 2] a^[l + 1] + b^[l + 2]; a^[l + 2] = g(z^[l + 2])


Take the second Linear and add a^[l] (shortcut/skip connection)

a^[l + 1] = g(z^[l + 1] + a^[l])


10 layers -> 5 Res blocks

In reality training error increase after a while with the number of layers (over-fitting? more parameters to train?)

ResNet error decreases over time.

Why ResNets Work

x -> Big NN -> a^[l]

x -> Big NN -> a^[l] -> ResNet -> a^[l+2], a >= 0

a^[l+2] = g(z^[l+2] + a^[l]) = g (w^[l+2] * a^[l+1] + b^[l+2] + a ^[l]). L2 regularization with shrink W.

If W^[l+2] = 0, b^[l+2] = 0 => g(a^[l]) = a^[l]

Identity function is easy for residual block to learn

Added a ResNet block at the end doesn't hurt performance.

Use Same Conv's so output of a^[l] is the same size as input of a^[l+2].

Can add a Ws before:

a^[l+2] (256) = g(Ws * a^[l]) = g(a^[l]); Where Ws is 256 x 128 and a^[l] is 128. This is when there is a pooling layer which changes dimension.

Networks in Networks and 1x1 Convolutions

reference

If channel is 1 then you simply scale the object

If channel is 32 it is like a 32 unit NN.

Element wise product of 32 channels and 32 channels in the filter.

6x6x32 * 1x1x32 = 6x6x#filters

28x28x192 * 1x1x32 = 28x28x32. Can shrink the number of channels.

Inception Network Motivation

reference (GooLeNet)

What size filter? Pooling?

28x28x192 * 1x1 = 28x28x64

* 3x3 = 28x28x128 (Stack this volume next to the first volume)

* 5x5 = 28x28x32

* max pool (with padding) = 28x28x32

Computationally expensive

28x28x192 * CONV 5x5,same,32 = 28x28x32

32 filters of 5x5x192

Calculations: 28x28x32 * 5x5x192 = 120m

28x28x192 * CONV 1x1,16,192 = 28x28x16 (bottle neck layer) * CONV 5x5,same,32,16 = 28x28x32

Calculations: 28x28x16 * 192 = 2.4m; 28x28x32 * 5x5x16 = 10m; 2.4m + 10m = 120m

Inception Network

Previous activation (28x28x192) -> 1x1 CONV -> 3x3 CONV ->

-> 1x1 CONV -> 5x5 CONV ->

-> 1x1 CONV ->

-> 3x3 MAXPOOL -> 28x28x32 CONV -> Channel Concat (28x28x256)

Side branches make softmax predictions

Using Open Source Implementations

git clone https://github.com/KaimingHe/deep-residual-networks.git
cd deep-residual-networks
cd prototxt
more ResNet-101-...
# Uses Caffe

Transfer Learning

ImageNet,... datasets you can use.

You can download pre-trained weights.

If you are only classifying 3 class lose the softmax later and and add your own softmax layer. Only train the parameters for the softmax layer.

Freeze the other layers parameters (can freeze a layer). Some code offers freeze as a variable and trainable parameter as a variables. Save this to disk the other layers so you have have the output activation layer.

You could choose to freeze only the first four layers if you have multiple classes. Then use smaller layers.

You could keep the weights and train the whole network.

Data Augmentation

Mirroring (flip image)

Random cropping

Rotation

Shearing

Local warping

Color shifting (R+20, G-20, B+20)

PCA color augmentation used in AlexNet

Training data stored on hard-disk -> CPU thread (loads image) and distorts (mini-batch) -> Training

State of Computer Vision

Little data <--------------------------------------------------------------------------------------------------------------------------------------------------> lots of data

More hand-engineering Object detection Image recognition Speech recognition Simpler algorithms

(bounding boxes)

Labeled data

Hand engineering features/network architecture/other components

Tips for doing well on benchmarks:

  • Ensembling (train several networks independently (3-15) and average their outputs (y hat)
  • Multi-crop at test time (run classifier on multiple versions of test images and average results) e.g. 4 crops of an image
  • Use open source; architectures of networks published; pretrained models


Keras documentation: https://keras.io/models/model/

With keras if you run fit() again, the model will continue to train with the parameters it has already learnt instead of reinitializing them

Module 3: Detection algorithms

Object Localization

Drawing a bounding box (localization).

Classification with localization - one object in image.

Detection - Multiple objects (e.g. pedestrian, car, motorcycle, background)

NN outputs four more numbers (b_x, b_y, b_h, b_w) as well as a class label. Upper left of image is (0,0) and lower right is (1,1). b_x, b_y is mid point of object, b_h is height, b_w is width.

y=[pc - is there an object?, b_x, b_y, b_h and b_w, C_1, C_2, C_3 (classes)]; 8 components

L(y^, y) = squared error ((y^_1 - y_1)^2 + (y^_2 - y_2)^2 + ... + (y^_8 - y_8)) if y_1 = 1

If pc = 0 then don't care about the other objects.

Landmark Detection

If you are interested in a point you could add point l_1x, l_1y, l_2x, l_2y, l_3x, l_3y, l_4x, l_4y, l_nx, l_ny e.g. landmarks on a face.

People pose position e.g landmarks on body e.g. shoulder, head, foot.

Object Detection

Have a data-set with closely cropped images containing a car or empty (1 or 0), ConvNet to predict image.

Sliding window detection -> ConvNet. Run through each section of image.

Repeat using a larger window x 2 (bigger second time). However, expensive.

Convolutional Implementation of Sliding Windows

Reference

FC layers -> Conv layers.

5x5 filter in FC (400 filters) -> 1 x 1 x 400

-> 1x1 filter(1x1x400)

-> 1x1x4

Model has input 14x14x3 but test set is 16x16x3. Use 14x14x3 with a stride of 2 (at all steps).

The sliding windows share a lot of information.

Making additional padding on the images so you only run the CCN once.

Bounding Box Predictions

YOLO reference and reference2

Split image into 9 cells and create label (8: 3 classes, bounding box and if picture).

Bounding box could be greater than 1.

Intersection Over Union

(IoU). Union is area in both boxes. Intersection is the shared space between two bounding boxes. Correct if IoU >= 0.5.

Non-max Suppression

Detect each object only once.

Mid-point should only be in one grid cell.

Multiple detection per object.

Pc - probability of detection. Takes largest value and highlights that.

Boxes with high IoU will get suppressed.

Discard all boxes Pc <= 0.6; Pick with box with the highest Pc; Discard any remaining box with IoU >= 0.5

Anchor Boxes

What is a grid cell whats to detect multiple objects (e.g. pedestrian in front of a car).

Pre-define two different shapes (anchor boxes).

8 outputs with anchor box 1. Then 8 outputs with anchor box 2. Output is (3x3x16) or (3x3x2x8).

(grid cell, anchor box) for each object.

YOLO Algorithm

y ix 3 x 3 x 2 (anchors) x 8 (5 + # of classes). 3 x 3 x 16

For each grid cell get 2 predicted bounding boxes

For each class use non-max suppression to generate final predictions.

Region Proposals

R-CNN reference

Segmentation algorithm. What could be objects. 2,000 blobs and run classifying algorithm on the blobs. Ouput label + bounding box.

Fast R-CNN reference

Propose windows. Use CNN of sliding windows to classify all the proposed regions.

Faster R-CNN reference

Use CNN to propose regions


Object detection

https://www.drive.ai/ - car dataset

Deep CNN. Factor reduction of 32 (608 x 608) -> (19 x 19). 80 classes > 85 outputs. 5 anchor boxes. 5 x 18 = 425. The model predicts: 19x19x5 = 1805 boxes

Module 4: Special applications: Face recognition & Neural style transfer

What is facial recognition?

Facial recognition and liveliness (i.e. don't recognize a picture)

Face verification vs. face recognition. Verification 1:1 want 99.9% accuracy. Recognition have a database of K persons. Get an output image

One Shot Learning

Recognize a person with only one picture.

Image of person -> CNN -> Softmax (n people + none). Doesn't really work with limited sample size and what if you have a new picture

Learn a similarity function

d(img1, img2) = Degree of difference between images

If d(img1, img2) <= tau (same)

> tau (different)

Siamese network

reference (DeepFace)

x^(1) image -> CNN -> FC f(x^(1)) as 128 numbers (encoding of x^(1))

x^(2) -> 128 numbers (encoding of x^(2))

d(x^(1), x^(2)) = ||f(x^(1) - f(x^(2))||^2.

Run the two CNN in parallel

Learn parameters so if x^(1) and x^(2), ||f(x^(1) - f(x^(2))||^2 is small

Triplet loss

reference (FaceNet)

Anchor & Positive (same person) d(A, P) = 0.5

Anchor & Negative (different person) d(A, N) = 0.7

Want ||f(A) - f(P)||^2 <= ||f(A) - f(N)||^2

||f(A) - f(P)||^2 - ||f(A) - f(N)||^2 + alpha (margin) <= 0. Can't have zeros

Given 3 images A, P, N

L(A, P, N) = max(||f(A) - f(P)||^2 - ||f(A) - f(N)||^2 + alpha, 0)

J = sum(L(A, P, N) )

Training set: 10k pictures of 1k persons. Need ~10 pictures of each person.

After training apply to one-shot learning.

If A, P, N are chosen randomly then d(A, P) + alpha <= d(A, N) is easily satisfied.

||f(A) - f(P)||^2 + alpha <= ||f(A) - f(N)||^2

Choose triplets that are "hard" to train on

d(A, P) + alpha <= d(A, N). Makes the network computationally effective.

Face Verification and Binary Classification

Both CNNs can be put into a LR function to out 1 if same person or 0 if 0

y^ = sigma ( sum |f(x^(i))_k - f(x^(j))_k| + b) where k is 128 images.

Chi squared formular is DeepFace paper (above squared / same but + ).

You can pre-compute the parameter.

Ways to improve facial recognition.

  • Put more images of each person (under different lighting conditions, taken on different days, etc.) into the database. Then given a new image, compare the new face to multiple pictures of the person. This would increae accuracy.
  • Crop the images to just contain the face, and less of the "border" region around the face. This preprocessing removes some of the irrelevant pixels around the face, and also makes the algorithm more robust.

What is neural style transfer

Take image (C) and put in style (S) of Van Gough for example

Content (C) + Style (S) => Generated Image (G)

Neural Style Transfer (NST) uses a previously trained convolutional network, and builds on top of that

What are deep ConvNets learning?

reference

AlexNet

Pick a unit in layer 1. Find the nine image patches that maximize the unit's activation. Image patch will detect edges.

Next hidden unit is looking for edge in different direction or color grouping.

In deeper layers sees larger image patches.

Neural Style Transfer Cost Function

reference

Cost function on J(G) = alpha * Jcontent(C, G). How similar is C to G. + beta * Jstyle(S, G). How similar is S to G.

Initiate G randomly (100 x 100 x 3).

Use gradient descent to minimize J(G).

G := G - d/dG J(G)

Most of the algorithms you've studied optimize a cost function to get a set of parameter values. In Neural Style Transfer, you'll optimize a cost function to get pixel values!

Content Cost Function

Jcontent(C, G)

Use hidden layer l to compute content cost (in middle of network)

Use a pre-trained ConvNet (e.g. VGG network).

Let a^[l](C) and a^[l](G) be the activation of layer l on the images

If a^[l](C) and a^[l](G) are similar, both images have similar content

Jcontent(C, G) = ||a^[l](C) - a^[l](G||^2

Style Cost Function

Use layer l is measure the style.

Style is defined as correlation between activations across channels.

e.g. 5 channels. Pairs of numbers in n_w and n_h. Correlated e.g. vertical lines and orange color. Uncorrelated verticals lines and no orange color.

Style matrix

a^[l]_i,j,k = activation at (i,j,k). G^[l] is nc^[l] x nc^[l].

G^[l]_k,k' = sum sum a^[l],i,j,k * a^[l],i,j,k'. Unnormalized cross covariance. "gram matrix".

Style of image S and style of image G.

J^[l]_style(S,G) = ||G^[l](S) - G^[l](G)||^2_F = 1 / constant (frobenius norm).

Jstyle(S, G) - sum_l * lambda^[l] * Jstyle^[l](S, G).

1D and 3D Generalizations

Can apply to 1D data and 3D data (not just 2D).

1D data e.g. time series of ECG. Apply Gaussian curve. 14 * 5 -> 10.

3D data e.g. MRI scan (another example is movies) (nw, nh, nd). Apply 3d filter. 14 x 14 x 14 * 5x5x5 -> 10x10x10. Detect features.


Neural Style Transfer references