convolution notes

Number of filters & Number of parameters

A separate filter for every input channel. The convolution results are aggregated and sent to an output channel.

When there is a second channel, it doesn't reuse the previous filter, otherwise it will be duplicating the same information to the second channel.

There needs to be a different filter for a different output channel.

The total number of filters =  #input channels * #output channels 

The total filter weight parameters = #filters * filter size = INPUT * OUTPUT * WIDTH * HEIGHT

Bias is applied at every output channel. i.e. the sum of conv results from all input channels + bias -> output channel

so the number of biases = # of output channels.

The total conv parameters = #weights + #biases =   #input channels * #output channels * filter width * filter height + #output channels 

1x1 convolution filter

A 1x1 filter essentially times the whole input matrix with one (only one) weight and yields an output matrix with the same shape.

For a multi-channel input, e.g. 28 x 28 images with 3 channels (R, G, B), a 1x1 filter is actually a 1x1x3 filter. Technically 3 1x1 filters 

applying on each of the channel separately. The output matrix from each filter is added up to the output matrix of 28x28.

1x1 filter can be used to reduce/increase dimensionality

e.g. an input with 28x28x128channels --> 1x1 filter --> 28x28x64channels

     an input with 28x28x128channels --> 1x1 filter --> 28x28x256channels

 

Reduce the number of parameters

e.g. 28x28x64 -> (5x5 conv) -> 24x24x128, will have ~ 64 x 5 x 5 x 128 = 204800 parameters

if use 1x1 conv to reduce dimension first

e.g. 28x28x64 -> (1x1 conv) -> 28x28x32 ->(5x5 conv)-> 24x24x128, will have 64 x 1 x 1 x 32 + 32 x 5 x 5 x 128 = 104448 parameters

which nearly halve the # of parameters

Concatenated ReLu

For earlier convolutional layers, the filters learnt tend to be "paired" such that one filter is the opposite phase of another filter. 

i.e. filter1 = -filter2

Visualizing the filters we can see the filters in pair. Same pattern but with opposite values. So negating one filter and you will get the paired filter roughly.

The phenomenon is caused by the use of ReLU activation function which ignores the negative phase, ReLU = max(0, x), but the model somehow manages to learn about the negative phase... still don't understand...

Anyway, instead of leaving the model to learn the redundant half of filters, an idea is to construct a new activation function

    Concatenated Relu or CReLu =[ max(0, x), max(0, -x) ]

Which takes the negative phase into account, and the negative phase is derived directly from the positive phase (-x).

So hopefully it achieves the same outcome without actually learning the negative phase.

When the negative phase of data is already presented in the activation output, it wouldn't learn the paired filters, so the pairing phenomenon disappears.

That also means you can reduce the # of filters by half without losing any learning capability.

Therefore, with CReLU, the activation output's size is doubled, but the # of filters (=# of channels) can be halved.

The network's parameters are thus halved as well!! 

This is important. Less parameters mean less complexity and less overfitting. Also the learning speed will be faster.

 

Experiments show that a network using CReLU and half # of the channels, beats a network using ReLU and full # of the channels.

Note, when the activation output's size is doubled, some parameters increase as well. e.g. Batch Norm is applied on every dimension, so the size of data related to the size of parameters.

Leaky Relu

Relu could lose half the information as Relu = max(0, x)

Leaky Relu try to keep a bit of info when x < 0, Leaky Relu = x (x>0) or 0.01x (otherwise)

Receptive field

Using a 3x3 conv filter maps 3x3 neurons in layer1 to a neuron N in layer2, so the receptive field of N is 3x3.

If use another 3x3 conv filter on layer2 to get layer3, a neuron M in layer3 corresponds to 3x3 neurons in layer2 which correspond to 5x5 neurons in layer1.

So we say the receptive field of M in layer1 is 5x5.

The idea is that we can use 2 stacking 3x3 conv filters (without pooling) to simulate a 5x5 conv filter, or 3 stacking 3x3 filters to simulate a 7x7 filter.

The benefits are (1) deeper network, so more non-linear. (2) less parameters.

Assuming layer1=28x28x1, using one 5x5 filter requires 1x5x5 = 25 parameters. Using two 3x3 filters requires, 1x3x3 + 1x3x3 = 18 parameters

Inception module

An inception module is a good local network topology (network within a network) and stack the modules on top of each other.

An output from a previous layer is applied:

   1. 3x3 max pooling (to carry on original output values?)

   2. 1x1 conv             (preserving the receptive field of the previous layers)

   3. 3x3 conv

   4. 5x5 conv

The stride and padding are adjusted so all the obove 4 yield same size image (different channels)

Then a depth concatenation is applied to combine the 4 outputs into 1 output, i.e. stack the channels, image size is still the same

                                   previous layer

_____________________|____________________

|                          |                    |                      |

 3x3 maxpool   1x1 conv         3x3 conv        5x5 conv

        |                  stride 1            stride 1           stride 1

        |                 padding 0        padding 1       padding 2

        |_______________|___________|____________|

                                             |

    concatenate channels

  & output

Usually an extra 1x1 conv is applied before the 3x3 and 5x5 convs, and after the 3x3 maxpool to reduce dimensions.

Residual Block

ResNet design simply adds x to F(x) two (or more) layers after. The link of feeding x to layers after is called a shortcut.

      x _________

      |                 |

 --layer1--      |

      |                 |

 --layer2--      |

      |                 |

      | <-------|

    = F(x) + x

As F(x) + x = actual function / underlying mapping

Then F(x) = actual function -x is the residual.

So ResNet learns about the residual.