convolution network design
LeNet-5
This is an early design of convolutional neural network by LeCun.
Conv - pool - Conv - pool - FC - FC
Filters are 5x5 at stride 1, 2x2pooling at stride 2
AlexNet (ILSVRC 12 winner)
A more complex convolutional network compared with LeNet.
Bigger filters at early layers.
Conv1(11x11) | MaxPool| Conv2(5x5) |MaxPool| Conv3, 4 and 5(3x3) |MaxPool | FC6... FC8
[227x227x3] INPUT
[55x55x96] CONV1: 96 11x11 filters at stride 4, pad 0
[27x27x96] MAX POOL1: 3x3 filters at stride 2
[27x27x96] NORM1: Normalization layer
[27x27x256] CONV2: 256 5x5 filters at stride 1, pad 2
[13x13x256] MAX POOL2: 3x3 filters at stride 2
[13x13x256] NORM2: Normalization layer
[13x13x384] CONV3: 384 3x3 filters at stride 1, pad 1
[13x13x384] CONV4: 384 3x3 filters at stride 1, pad 1
[13x13x256] CONV5: 256 3x3 filters at stride 1, pad 1
[6x6x256] MAX POOL3: 3x3 filters at stride 2
[4096] FC6: 4096 neurons
[4096] FC7: 4096 neurons
[1000] FC8: 1000 neurons (class scores)
First use of ReLU
used Norm layers (not batchNorm)
dropout 0.5
batch size 128
Momentum 0.9
learning rate 1e-2 reduced by 10 when accuracy plateaus
L2 weight decay 5e-4
...
ZFNet (ILSVRC 12 winner)
Fine tuning of AlexNet.
Changed 11x11 filter to 7x7
Big size of Conv3,4 and 5
VGGNET (ILSVRC 14 runner up)
Key idea is: Smaller filters but deeper network
Only 3x3 filters. Use two 3x3 filters to simulate a 5x5 filter, and three 3x3 filters to simulate a 7x7 filter
Simply stack conv, pool and FC layers.
conv1,2 (3x3) |pool| conv3,4 (3x3) |pool| conv5,6(3x3) |pool| conv7,8,9(3x3) |pool| conv10,11,12(3x3) |pool| FC 13,14,15
So effectively it is similar to
conv1 (5x5) |pool| conv2(5x5) |pool| conv3(5x5) |pool| conv4(7x7) |pool| conv5(7x7) |pool| FC6,7,8
but use less parameters and more non-linearilities
GoogLeNet (ILSVRC 14 winner)
Key idea is: Deeper network, Inception module
22 layers, No FC layers (so a lot less parameters)
Inception module
An inception module is a good local network topology (network within a network) and stack the modules on top of each other.
An output from a previous layer is applied:
1. 3x3 max pooling (to carry on original output values?)
2. 1x1 conv (preserving the receptive field of the previous layers)
3. 3x3 conv
4. 5x5 conv
The stride and padding are adjusted so all the obove 4 yield same size image (different channels)
Then a depth concatenation is applied to combine the 4 outputs into 1 output, i.e. stack the channels, image size is still the same
previous layer
_____________________|____________________
| | | |
3x3 maxpool 1x1 conv 3x3 conv 5x5 conv
| stride 1 stride 1 stride 1
| padding 0 padding 1 padding 2
|_______________|___________|____________|
|
concatenate channels
& output
Because the computational cost (and parameters) are increased.
1x1 conv is applied for the conv and max pool operations to reduce channels.
1x1 conv preserves the spatial dimentions, but reduces depth/channels.
Therefore, it becomes:
previous layer
_____________________|____________________
| | | |
3x3 maxpool 1x1 conv 1x1 conv 1x1 conv
| stride 1 | |
| padding 0 3x3 conv 5x5 conv
| stride 1 stride 1
1x1 conv padding 1 padding 2
|_______________|___________|____________|
|
concatenate channels
& output
A full demonstration of GoogLeNet
ResNet (ILSVRC 15 winner)
Very deep network 152 layers...
A traditional neural network learns about a function F(x) given input x.
ResNet learns about the residual of F(x), i.e. F(x) - x, assuming that it's easier to optimize the residual than to optimize the original function.
A traditional design
x
|
--layer1--
|
--layer2--
|
F(x)
The two layers simulate F(x)
ResNet design simply adds x to F(x) two (or more) layers after. The link of feeding x to layers after is called a shortcut.
x _________
| |
--layer1-- |
| |
--layer2-- |
| |
| <-------|
= F(x) + x
As F(x) + x = actual function / underlying mapping
Then F(x) = actual function -x is the residual.
So ResNet learns about the residual.
Not much explanation about why learning residual is easier than learning the actual function.
But one extreme example is when the actual function is an identity function, i.e. y = x
Then F(x) + x = x means F(x) = 0.
It seems easier to set F(x) =0 then to learn about the actual y=x.
Maybe in real cases, many actual functions (or some of the functions down the learning track) are near identity function, so ResNet would benefit from it.
Obviously the basic unit is as above with x feeding into a few layers after. Jumping one layer is not making sense in this design as it's equivalent to a linear function y = wx + x = (w+1)*x
ALso the dimensions of x must be the same as that of F(x). If not the same size, you can do linear projection to adjust x's dimension / adding zeros.
With the above basic units (Residual Block) stacked up to 100+ layers appear to work really well, beating GoogLeNet and other competitors.
A standard design of Residual Block in ResNet contains two 3x3 conv layers without pooling.
- Periodically after every a few residual blocks, downsample (halve) the data once by doing a stride 2 convolution, and double the # of filters (channels).
- An additional conv layer (not in the form of residual block) at the beginning. This is usually a big filter (e.g. 7x7 with stride 2) and with a pooling.
- An pooling after all residual blocks and then proceed to a fully connected layer which serves at the output.
- Batch Normalization after every conv layer
- Xavier/2 initialization from He et al
- SGD + Momentum (0.9)
- Learning rate: 0.1, divided by 10 when validation error plateaus
- Mini-batch size 256
- Weight decay of 1e-5
- No dropout used
When ResNet gets deeper (e.g. 50+) use "BottleNeck" layer to improve efficiency. Similar to GoogLeNet's 1x1 conv layer.
input 28x28x256 --> conv 1x1x64 --> conv 3x3x64 --> conv 1x1x256 --> output 28x28x256
In this example the 1x1 reduces dimensions to 64 first, and then do 3x3 conv on the lower dimensional space, and then increase the dimensions back to 256.
A full 34 layer ResNet:
Inception V4
As the Inception Module used by GoogLeNet and the Residual Block used by ResNet are so powerful.
Now people combine the two. ResNet + Inception = Inception V4