As mentioned above, one factor that influences current DL success is the new algorithmic advances, which have alleviated some problems that prevented NN applications from properly working. Those problems are:
◦ Rectified linear units (ReLU) as activation funtions to improve gradient backward flow (in contrast with sigmoid and hyperbolic-tangent functions).
◦ Shortcut connections to connect distant part of the networks through identity mappings.
◦ Batch normalization layers to improve the internal covariance shift problem [Ioffe 2015]. These methods have enabled to train networks as deep as 1000 layers [He 2016b].
◦ Weight decay (e.g. L1, L2), which penalizes layers weights that become too high.
◦ Dropout layers that block a random number of units in a layer (usually around 50%) each training cycle [Srivastava 2014]. The random blocking provides incentive for kernels to learn more robust filters. At inference, all connections are used with corrective constant that is equal to the percentage of blocked connections.
◦ Network pruning which represents a way to combat overfitting by discarding unimportant connections [Han 2016]. The advantage of this method is that the number of parameters is significantly reduced leading to smaller memory and energy requirements.
◦ Deep compression significantly reduces the network parameters with the aim of reducing memory requirements so the whole deep network model can fit into the onchip memory. The process starts with network pruning when the importance of each connection is learnt. It is followed by quantizing the network and weight sharing and finally Huffman coding is applied [Han 2015].
◦ Sparse computation that imposes the use of sparse representations along the network allow memory and computation benefits.
◦ Low precision data types [Konsor 2012], smaller than 32-bits (e.g. half-precision or integer) with experimentation even with 1-bit computation [Courbariaux 2016]. This speeds up algebra calculation as well as greatly decreasing memory consumption at the cost of a slightly less accurate model. In these recent years, most DNNs are starting to support 16-bit and 8-bit computation.