Artificial Neural Network

Artificial Neural Network (ANN)

ANN is a network based on statistical learning models which implements machine learning techniques (Fig. 0) where algorithms can learn from and make prediction on data. It is used to estimate or approximate functions that can depend on a large number of inputs and are generally unknown. The network connections have numeric weights that can be tuned based on experience, making neural nets adaptive to inputs and capable of learning.

Fig. 0: Types of Machine Learning Algorithms in one picture [3]

ANN consists of the network of smallest functional unit called neuron (Fig. 1). Let's implement a NAND logic using an artificial neuron called perceptron [1].

Fig. 1: NAND logic implementation using a single perceptron [1]

The binary inputs for the perceptron x = [x1 x2] are associated with weights (w = [w1 w2], where w1 = w2 = -2 and the bias b = 3. Only when both inputs are 1, the partial output is smaller than 0 and output y is 0.

If we assume bias b as an weight w0 and keep the constant value 1 for the input x0, the equation can be written in terms of only weights as showe

So, the output from the perceptron is 0. For other input combinations, the partial output is greater than zero which produces output 1. Since NAND is the universal gate, network of such perceptrons, therefore, can implement any logical functions.

While the problem like NAND gate logic can be implemented with a single neuron, it is not possible to implement XOR with a single neuron. Consider the input data space for NAND gate in Fig. 2. The data space with inputs (0,1), (0,0), and (1,0) that results in 1 are separable from the data space with input (1,1) that outputs 0. A single neuron can categorize its input into two groups irrespective of the number of inputs and hence can draw a single separator.

Fig. 2: Input Data Space for NAND gate

For XOR gate, it requires two separators each implemented by a neuron to categorize the input data space for XOR as showed in Fig. 3. The data space for inputs (0,0) and (1,1) in orange results in output 1 whereas the inputs (0,1) and (1,0) in blue output to 0.

Fig 3: Input Data Space for XOR gate

The output from two neurons need to be combined to get a single decision. So, they form a hidden layer. The data sets for linearly and non-linearly separable classes are also showed in Fig. 4.

The addition of hidden layers of neurons as showed in Fig. 5 allows to tackle not linearly separable cases by breaking down the problem into smaller problems, and such network is called multi-layer perceptrons (MLP).

Fig. 5: Multiple Layer Neural Networks showing input layer, output layer, and multiple hidden layers [1].

Neural Network Learning

By replacing slope m with wight w and intercept b with bias w0,the cost function or loss function for the linear regression in Basic Statistics for Deep Learning , becomes:

For neural network, the observed data yi is the known output from the training data. Also, in case of neural network, there are multiple input features in contrast to one dimensional linear regression problem, and hence, cost minimization is done iteratively by adjusting the weights which is called learning. The small changes in the weights of the perceptrons can even flip the output producing very unlikely results. So, to propagate small changes to the output from the network sigmoid neurons are used which is represented by the function which gives the S shaped plot as showed in Fig. 6.

Fig. 6: Sigmoid Function

The quadratic function helps to smoothen the cost function that makes it easier to figure out how to make small changes in the weights so as to get improvement in the cost using algorithms like Gradient Descent (GD) algorithm. The widely used transfer function or activation function is the hyperbolic tangent sigmoid (tansig) function as showed in Fig. 7.

Fig. 7: Transig function

The computational cost for calculating gradient for each quadratic term for large training data set is expensive and it will take longer to learn. Stochastic Gradient Descent (SGD) algorithm can be used to learn by randomly picking out small number "m" out of large "n" inputs from training data set. Neural Network trained with mini batches until the training inputs are exhausted which is said to complete an epoch of training. SGD is a commonly used and powerful technique; a basis for learning in neural networks.

In real world problems, the error surfaces are usually complex and may resemble the one in Fig 8(a) with numerous valleys, and iteration can stuck at local minima as showed with the trapped ball in the figure.

Fig. 8(a): The iteration is stucked at the local minimum represented by the trapped ball.

The progress is possible only by climbing higher before descending to the global minima. The randomness or noise introduced by SGD surface may help to bounce off the local minima as long as they are not severe.

The moment term is added to the iteration of SGD algorithm to allow faster learning rate which speed up the convergence by damping the oscillations, and at the same time, avoid local minima as showed in Fig. 8.

Fig. 8: Bouncing off the local minima

The idea is to stabilize the weight change by the non-radical revisions using a combination of gradient descent term with a fraction of the previous weight change and the GD iteration in Gradient Descent (GD) algorithm becomes:

xn = μxn-1 - αC'(xn-1)

Here, μ is the momentum term.

Back-propagation or backward error propagation algorithm, especially in supervised learning, implements GD algorithm in each layer to keep track of small perturbations (errors) to the wights as they propagate through the layers so as to lower the overall cost of the function. Unlike supervised learning where the network is trained by feeding known input and target vectors during training, there is little or no idea about the output in unsupervised learning except in auto-encoder where the output is used to reconstruct the input. The reinforced learning, on the other hand, is enabled through the notion of cumulative reward without presenting input/output pairs. For an example, ants ultimately find the shortest path (reinforced with pheromone deposits) to the food source i.e. agents generate the inputs by interacting with the environment.

Find XOR logic implementation through supervised learning using MATLAB under "Using Software" section below. As explained earlier, unlike NAND logic, it requires one hidden layer in a feed-forward network to train it. The number of hidden layers depends on the complexity of the problem but in general you can keep on adding layers until it over-fits the training data. Also, the number of neurons in each layer is not that sensible. Usually, the first layer has neurons between the value of number of inputs and the outputs, and gets decreased slowly until the output layer.

In the feed-forward network information moves in a forward direction. There is no directed cycle as in recurrent network where there are connections between neuron units to form internal memory.

Initialization is the process of setting the weights and biases of input and layers of the pre-trained network. Proper initialization breaks symmetry, i.e. all neurons will not learn the same thing. Also, it is desirable to keep the weights in such a way that activation functions are in linear zone, so that gradients are optimal. The initialization function based on Nguyen-Wdrow initialization algorithm (e.g. initnw in Matlab) as an example chooses random weight values such that they are approximately evenly distributed in the active region of each neuron having transfer function (e.g. transig) with finite input range.

Using Software

XOR Implementation using MATLAB (from scratch): The Matlab code below helps you build the feed-forward network from scratch. See the comments for information. The Matlab script file "xorNetwork.m" is also located at /usr/local/doc/DEEPLEARNING/neural-network.

Copy the directory and cd to it

cp -r /usr/local/doc/DEEPLEARNING/neural-network .

cd neural-network

Request a compute node:

srun --x11 --pty bash

Load the Matlab module

module load matlab

Open Matlab Terminal:

matlab &

Run the matlab script "xorNetwork.m" from the prompt

>>xorNetwork

xorNetwork.m

% XOR Implementation from the scratch % Create a network with one input and 2 layers - hidden and the output net = network(1,2); % Let’s change the transfer functions to tansig and logsig for hidden and output layer: net.layers{1}.transferFcn = 'tansig'; % net.layers{2}.transferFcn = 'logsig'; % Now, assign the number of neurons as 4, 3 and 2 for first, second and the third layer net.layers{1}.size = 3; net.layers{2}.size = 1; % Change the initialization function for layers, as follows: net.layers{1}.initFcn = 'initnw'; net.layers{1}.initFcn = 'initnw'; % Toggle the value to 1 for all layers to connect bias net.biasconnect = [1;1]; % equivalent net.biasconnect{1} = 0 % Connect the input to the first layer net.inputConnect(1,1) = 1; % input weight connection going to the 1st layer to the first input % Connect the output layer to the hidden layer net.layerconnect = [0 0; 1 0]; % Connect the output to the output layer net.outputconnect=[0 1]; % Input size (x & y) i.e. 2 net.inputs{1}.size = 2; % Displays on the bottom of the layer % Layers' Name net.layers{1}.name = 'Hidden Layer'; % Displays on the top of layer net.layers{2}.name = 'Output Layer'; % Xor needs at least one hidden layer % Set the initialization function to initialize according to layer % initialization function net.initFcn = 'initlay'; % Set the learn function 'learngdm' to the weights going to the 2nd layer from the first layer net.layerWeights{2,1}.learnFcn = 'learngdm'; % Learning occurs according to the Learning Parameters (LP) - learning rate (lr) and momentum constant (mc) given by: LP.lr = 0.01; % default is 0.01 LP.mc = 0.9; % default is 0.9 % Set the performance function to mse (mean squared error) and the training function to trainlm (Levenberg-Marquardt backpropagation) to meet the final requirement of the custom network. net.performFcn = 'mse'; net.trainFcn = 'trainlm'; % Set the divide function to dividerand (divide training data randomly). net.divideFcn = 'dividerand'; % Initialize the Network with the initial weights and biases net = init(net); % Input Data X = [ 0 0 1 1 0 0 1 1 ; 0 1 0 1 0 1 0 1 ] %{ X = 0 0 1 1 0 0 1 1 0 1 0 1 0 1 0 1 %} % The target output is: T = [0 1 1 0 0 1 1 0]; % Before Training, the output Y is: Y = sim(net,X) % Train the network net = train(net,X,T); % After the training, the output Y is: Y = sim(net,X)

Check out the XOR implementation "xor.m" using MATLAB feedforwardnet() function at /usr/local/doc/DEEPLEARNING/neural-network.

You can view the network (see Fig. 9) by issuing view command:

view (net)

output:

Fig. 9: Neural network implementing XOR logic. The number of inputs, output, and neurons are represented by the integers at the bottom and the name of the layers are on the top. So, there is one hidden layer with 3 neurons. Both the layers have weight (w) and bias (b) attached. Hidden and Output layers use "tansig" and "purelin" transfer or activation functions respectively

Matlab Initialization:

Check the active range for tansig transfer function:

>> tansig('active')

ans =

-2 2

After training (running Matlab scripts xor.m and xorNetwork.m):

Input Weights:

>> net.IW{1,1} ans = 1.1562 -3.2860 1.8765 -1.8521 -1.1633 -2.2281

Layer weights:

>> net.LW{2,1} ans = -1.2306 0.2908 -0.4481

Pre-training Weights:

Reset the network

>> net = init(net);

Input Weights after initialization

>> net.IW{1,1} ans = -1.3075 -2.0422 -1.4774 -1.9229 0.7758 2.2974

Layer weight after initialization >> net.LW{2,1} ans = 0.0203 0.8127 0.2578

Try Running the script xorNetwork.m without initialization i.e. comment the following lines

net.layers{1}.initFcn = 'initnw';

net.layers{2}.initFcn = 'initnw';

Run and check the layer weights

>> net.LW{2,1} ans = 0.9004 1.0131 -0.9263

References:

[1] Neural Networks and Deep Learning - http://neuralnetworksanddeeplearning.com/chap1.html

[2] Matlab Neural Network & Deep Learning Functions - https://www.mathworks.com/help/nnet/functionlist.html

[3] Dat Science Central

Page updated

Report abuse