convolution network design

LeNet-5

This is an early design of convolutional neural network by LeCun.

  Conv - pool - Conv - pool - FC - FC

Filters are 5x5 at stride 1, 2x2pooling at stride 2

AlexNet (ILSVRC 12 winner)

A more complex convolutional network compared with LeNet.

Bigger filters at early layers.

Conv1(11x11) | MaxPool| Conv2(5x5) |MaxPool| Conv3, 4 and 5(3x3) |MaxPool | FC6... FC8

[227x227x3] INPUT

[55x55x96] CONV1: 96 11x11 filters at stride 4, pad 0

[27x27x96] MAX POOL1: 3x3 filters at stride 2

[27x27x96] NORM1: Normalization layer

[27x27x256] CONV2: 256 5x5 filters at stride 1, pad 2

[13x13x256] MAX POOL2: 3x3 filters at stride 2

[13x13x256] NORM2: Normalization layer

[13x13x384] CONV3: 384 3x3 filters at stride 1, pad 1

[13x13x384] CONV4: 384 3x3 filters at stride 1, pad 1

[13x13x256] CONV5: 256 3x3 filters at stride 1, pad 1

[6x6x256] MAX POOL3: 3x3 filters at stride 2

[4096] FC6: 4096 neurons

[4096] FC7: 4096 neurons

[1000] FC8: 1000 neurons (class scores)

First use of ReLU

used Norm layers (not batchNorm) 

dropout 0.5

batch size 128

Momentum 0.9

learning rate 1e-2 reduced by 10 when accuracy plateaus

L2 weight decay 5e-4

...

ZFNet (ILSVRC 12 winner)

Fine tuning of AlexNet.

Changed 11x11 filter to 7x7

Big size of Conv3,4 and 5

VGGNET (ILSVRC 14 runner up)

Key idea is: Smaller filters but deeper network

Only 3x3 filters. Use two 3x3 filters to simulate a 5x5 filter, and three 3x3 filters to simulate a 7x7 filter

Simply stack conv, pool and FC layers.

conv1,2 (3x3) |pool| conv3,4 (3x3) |pool| conv5,6(3x3) |pool| conv7,8,9(3x3) |pool| conv10,11,12(3x3) |pool| FC 13,14,15

So effectively it is similar to

conv1 (5x5) |pool| conv2(5x5) |pool| conv3(5x5) |pool| conv4(7x7) |pool| conv5(7x7) |pool| FC6,7,8

but use less parameters and more non-linearilities

GoogLeNet (ILSVRC 14 winner)

Key idea is: Deeper network, Inception module

22 layers, No FC layers (so a lot less parameters)

Inception module

An inception module is a good local network topology (network within a network) and stack the modules on top of each other.

An output from a previous layer is applied:

   1. 3x3 max pooling (to carry on original output values?)

   2. 1x1 conv             (preserving the receptive field of the previous layers)

   3. 3x3 conv

   4. 5x5 conv

The stride and padding are adjusted so all the obove 4 yield same size image (different channels)

Then a depth concatenation is applied to combine the 4 outputs into 1 output, i.e. stack the channels, image size is still the same

                                   previous layer

_____________________|____________________

|                          |                    |                      |

 3x3 maxpool   1x1 conv         3x3 conv        5x5 conv

        |                  stride 1            stride 1           stride 1

        |                 padding 0        padding 1       padding 2

        |_______________|___________|____________|

                                             |

   concatenate channels

 & output

 

Because the computational cost (and parameters) are increased.

1x1 conv is applied for the conv and max pool operations to reduce channels.

1x1 conv preserves the spatial dimentions, but reduces depth/channels.

Therefore, it becomes: 

                                    previous layer

_____________________|____________________

|                         |                     |                       |

 3x3 maxpool    1x1 conv         1x1 conv       1x1 conv                   

        |                   stride 1                |                       |

        |                 padding 0  3x3 conv          5x5 conv

        |                                         stride 1            stride 1

  1x1 conv                                  padding 1      padding 2

        |_______________|___________|____________|

                                             |

   concatenate channels

 & output

A full demonstration of GoogLeNet

ResNet (ILSVRC 15 winner)

Very deep network 152 layers...

A traditional neural network learns about a function F(x) given input x.

ResNet learns about the residual of F(x), i.e. F(x) - x, assuming that it's easier to optimize the residual than to optimize the original function.

A traditional design

      x

      |

 --layer1--

      |

 --layer2--

      |

     F(x)

The two layers simulate F(x)

ResNet design simply adds x to F(x) two (or more) layers after. The link of feeding x to layers after is called a shortcut.

      x _________

      |                 |

 --layer1--      |

      |                 |

 --layer2--      |

      |                 |

      | <-------|

    = F(x) + x

As F(x) + x = actual function / underlying mapping

Then F(x) = actual function -x is the residual.

So ResNet learns about the residual.

Not much explanation about why learning residual is easier than learning the actual function.

But one extreme example is when the actual function is an identity function, i.e. y = x

Then F(x) + x = x means F(x) = 0. 

It seems easier to set F(x) =0 then to learn about the actual y=x.

Maybe in real cases, many actual functions (or some of the functions down the learning track) are near identity function, so ResNet would benefit from it.

Obviously the basic unit is as above with x feeding into a few layers after. Jumping one layer is not making sense in this design as it's equivalent to a linear function y = wx + x = (w+1)*x

ALso the dimensions of x must be the same as that of F(x). If not the same size, you can do linear projection to adjust x's dimension / adding zeros.

With the above basic units (Residual Block) stacked up to 100+ layers appear to work really well, beating GoogLeNet and other competitors.

A standard design of Residual Block in ResNet contains two 3x3 conv layers without pooling. 

- Periodically after every a few residual blocks, downsample (halve) the data once by doing a stride 2 convolution, and double the # of filters (channels).

- An additional conv layer (not in the form of residual block) at the beginning. This is usually a big filter (e.g. 7x7 with stride 2) and with a pooling.

- An pooling after all residual blocks and then proceed to a fully connected layer which serves at the output. 

- Batch Normalization after every conv layer

- Xavier/2 initialization from He et al

- SGD + Momentum (0.9)

- Learning rate: 0.1, divided by 10 when validation error plateaus

- Mini-batch size 256

- Weight decay of 1e-5

- No dropout used

When ResNet gets deeper (e.g. 50+) use "BottleNeck" layer to improve efficiency. Similar to GoogLeNet's 1x1 conv layer.

   input 28x28x256 --> conv 1x1x64 --> conv 3x3x64 --> conv 1x1x256 --> output 28x28x256

In this example the 1x1 reduces dimensions to 64 first, and then do 3x3 conv on the lower dimensional space, and then increase the dimensions back to 256.

A full 34 layer ResNet:

Inception V4

As the Inception Module used by GoogLeNet and the Residual Block used by ResNet are so powerful.

Now people combine the two. ResNet + Inception = Inception V4