KAU Data Scienсe Center - Problems in Deep Learning and advanced algorithmic solutions

2.2.3. Problems in Deep Learning and advanced algorithmic solutions

As mentioned above, one factor that influences current DL success is the new algorithmic advances, which have alleviated some problems that prevented NN applications from properly working. Those problems are:

The vanishing gradient problem describes the fact that the gradient signal barely reaches the first layers after being propagated from the final layers, causing very slow learning in very deep networks. This problem can be alleviated by using several components:

◦ Rectified linear units (ReLU) as activation funtions to improve gradient backward flow (in contrast with sigmoid and hyperbolic-tangent functions).

◦ Shortcut connections to connect distant part of the networks through identity mappings.

◦ Batch normalization layers to improve the internal covariance shift problem [Ioffe 2015]. These methods have enabled to train networks as deep as 1000 layers [He 2016b].

The overfitting problem describes the fact that network perform very well on the training set but fail to generalize test data. This can be fought by using:

◦ Weight decay (e.g. L1, L2), which penalizes layers weights that become too high.

◦ Dropout layers that block a random number of units in a layer (usually around 50%) each training cycle [Srivastava 2014]. The random blocking provides incentive for kernels to learn more robust filters. At inference, all connections are used with corrective constant that is equal to the percentage of blocked connections.

◦ Network pruning which represents a way to combat overfitting by discarding unimportant connections [Han 2016]. The advantage of this method is that the number of parameters is significantly reduced leading to smaller memory and energy requirements.

The model size problem describes the fact that modern high performing models can be highly computationally and memory intensive. DNNs can have millions or even billions of parameters due to their rich connectivity. This increases the computational, memory bandwidth, and storage demands. To minimize these demands one can use:

◦ Deep compression significantly reduces the network parameters with the aim of reducing memory requirements so the whole deep network model can fit into the onchip memory. The process starts with network pruning when the importance of each connection is learnt. It is followed by quantizing the network and weight sharing and finally Huffman coding is applied [Han 2015].

◦ Sparse computation that imposes the use of sparse representations along the network allow memory and computation benefits.

◦ Low precision data types [Konsor 2012], smaller than 32-bits (e.g. half-precision or integer) with experimentation even with 1-bit computation [Courbariaux 2016]. This speeds up algebra calculation as well as greatly decreasing memory consumption at the cost of a slightly less accurate model. In these recent years, most DNNs are starting to support 16-bit and 8-bit computation.

Return to Contemt

Google Sites

Report abuse