- Always use an activation function. Otherwise the network won't be deep.
- ReLU causes bias shift because the mean output is > 0 and has 0 gradient in half the places.
- The best options right now are PReLU and ELU.
- Good hyperparameters are:
- Leaky ReLU 0.01 slope
- PReLU 0.25 initial slope
- RRelu uniform random slope between 1/8 and 1/3 at train time. (1/8 + 1/3)/2 at test time.