Purpose of non-linearity in neural networks

Introduction

Laymen explanation

Activation functions introduces non-linearity in the neural networks. Non-linearity increases complexity. Then why neural network needs such complexity? This document helps in this regard.

Technical explanation

Non-linearity in neural networks simply mean that the output at any unit cannot be reproduced from a linear function of the input.

Without non-linearity, multi layer neural network can be mathematically similar to single layer. No matter how many layers it had, would behave just like a single-layer perceptron, because summing these layers would give you just another linear function(as shown in below diagram and mathematics calculation)

y = h2 * W3 + b3 = (h1 * W2 + b2) * W3 + b3 = h1 * W2 * W3 + b2 * W3 + b3 = (x * W1 + b1) * W2 * W3 + b2 * W3 + b3 = x * W1 * W2 * W3 + b1 * W2 * W3 + b2 * W3 + b3 = x * W' + b'

Purpose of non-linearity

In general, Activation functions cannot be linear because neural networks with a linear activation function are effective only one layer deep, regardless of how complex their architecture is. Input to networks is usually linear transformation (input * weight), but real world and problems are non-linear. To make the incoming data nonlinear, we use nonlinear mapping called activation function. An activation function is a decision making function that determines the presence of a particular neural feature. It is mapped between 0 and 1, where zero means absence of the feature, while one means its presence.

Special case for using linear activation function

If the expected output reflects the linear regression(below image for reference) as shown below then linear activation functions can be used.

Linear activation function looks like below.

Non-linear activation functions

ReLU

It has become the default activation function for many types of neural networks because a model that uses it is easier to train and often achieves better performance.

It also avoids vanishing gradient problem. Note that it is not fully solved using ReLu. During training, if a neuron’s weights get updated such that the weighted sum of the neuron’s inputs is negative, it will start outputting 0.

GeLU(Gaussian Error Linear Unit.)

It is recently introduced in transformer architectures like BERT, BART etc. where it performs best. It also avoids the vanishing gradient problem.

Sigmoid

It was widely used at earlier stage of data science. It suffers with issue of vanishing gradient problem.

Reference

https://stackoverflow.com/questions/9782071/why-must-a-nonlinear-activation-function-be-used-in-a-backpropagation-neural-net

https://www.quora.com/What-do-you-mean-by-introducing-non-linearity-in-a-neural-network

https://machinelearningmastery.com/rectified-linear-activation-function-for-deep-learning-neural-networks/

https://adventuresinmachinelearning.com/vanishing-gradient-problem-tensorflow/

https://mlfromscratch.com/activation-functions-explained/#/

https://missinglink.ai/guides/neural-network-concepts/7-types-neural-network-activation-functions-right/

https://www.amazon.in/Hands-Machine-Learning-Scikit-Learn-TensorFlow/dp/1491962291

Page updated

Google Sites

Report abuse