Activation functions introduces non-linearity in the neural networks. Non-linearity increases complexity. Then why neural network needs such complexity? This document helps in this regard.
Non-linearity in neural networks simply mean that the output at any unit cannot be reproduced from a linear function of the input.
Without non-linearity, multi layer neural network can be mathematically similar to single layer. No matter how many layers it had, would behave just like a single-layer perceptron, because summing these layers would give you just another linear function(as shown in below diagram and mathematics calculation)
y = h2 * W3 + b3 = (h1 * W2 + b2) * W3 + b3 = h1 * W2 * W3 + b2 * W3 + b3 = (x * W1 + b1) * W2 * W3 + b2 * W3 + b3 = x * W1 * W2 * W3 + b1 * W2 * W3 + b2 * W3 + b3 = x * W' + b'
In general, Activation functions cannot be linear because neural networks with a linear activation function are effective only one layer deep, regardless of how complex their architecture is. Input to networks is usually linear transformation (input * weight), but real world and problems are non-linear. To make the incoming data nonlinear, we use nonlinear mapping called activation function. An activation function is a decision making function that determines the presence of a particular neural feature. It is mapped between 0 and 1, where zero means absence of the feature, while one means its presence.
If the expected output reflects the linear regression(below image for reference) as shown below then linear activation functions can be used.
Linear activation function looks like below.
ReLU
It has become the default activation function for many types of neural networks because a model that uses it is easier to train and often achieves better performance.
It also avoids vanishing gradient problem. Note that it is not fully solved using ReLu. During training, if a neuron’s weights get updated such that the weighted sum of the neuron’s inputs is negative, it will start outputting 0.
GeLU(Gaussian Error Linear Unit.)
It is recently introduced in transformer architectures like BERT, BART etc. where it performs best. It also avoids the vanishing gradient problem.
Sigmoid
It was widely used at earlier stage of data science. It suffers with issue of vanishing gradient problem.
https://stackoverflow.com/questions/9782071/why-must-a-nonlinear-activation-function-be-used-in-a-backpropagation-neural-net
https://www.quora.com/What-do-you-mean-by-introducing-non-linearity-in-a-neural-network
https://machinelearningmastery.com/rectified-linear-activation-function-for-deep-learning-neural-networks/
https://adventuresinmachinelearning.com/vanishing-gradient-problem-tensorflow/
https://mlfromscratch.com/activation-functions-explained/#/
https://missinglink.ai/guides/neural-network-concepts/7-types-neural-network-activation-functions-right/
https://www.amazon.in/Hands-Machine-Learning-Scikit-Learn-TensorFlow/dp/1491962291