Deep Learning
bottleneck in using neural networks is the availability of abundant training data
Artificial Neural Networks, are a collection of a large number of simple devices called
artificial neurons
Perceptron
take some signals as inputs and perform a set of simple calculations to arrive at a decision
perceptron takes a weighted sum of multiple inputs (along with a bias) as the cumulative input and applies a step function on the cumulative input
Structure of Neural Networks
Neurons in a neural network are arranged in layers
The first and the last layer are called the input and output layers
Input layers have as many neurons as the number of attributes in the data set
output layer has
as many neurons as the number of classes of the target variable (for a classification problem).
For a regression problem, the number of neurons in the output layer would be 1
Components
Network Topology
Input Layer
Output Layer
Weights
Activation functions
Biases
softmax output
multiclass logistic function commonly used to compute the 'probability' of an input belonging to one of the multiple classes
Softmax function translates to a sigmoid function in the special case of binary classification
in the case of a sigmoid output, there is only one neuron in the output layer
simplifying assumptions:
Neurons are arranged in layers and the layers are arranged sequentially.
Neurons within the same layer do not interact with each other.
All the inputs enter the network through the input layer and all the outputs go out of the network through the output layer.
Neurons in consecutive layers are densely connected, i.e. all neurons in layer l are connected to all neurons in layer l+1.
Every interconnection in the neural network has a weight associated with it, and every neuron has a bias associated with it.
All neurons in a particular layer use the same activation function.
Parameters (which the network learns during training )
weights
biases
Hyperparameters of Neural Networks (which need to specify beforehand )
number of layers,
activation functions
number of neurons in
input layers ,
hidden layers
output layers
notations
W is for weight matrix
b shall stand for the bias
x stands for input
y is the ground truth label
p is the probability vector of the predicted output
h is the output of the hidden layers
superscript stands for layer number
subscript stands for the index of the individual neuron
Activation Functions
should be smooth
make the inputs and outputs non-linear
non-linearity helps in making neural networks more compact
Types
Logistic function
1/1+e−x
Hyperbolic tangent function - similar to the sigmoid function.
output=tanh(x)
ex−e−x / ex+e−x
Relu -Rectilinear Unit
output=x for x > = 0 and 0 otherwise.
Leaky Relu
output=x for x > = 0 output=αx otherwise.
Number of interconnections = number of neurons in layer 'l' x number of neurons in layer 'l-1'
feedforward neural networks
no loops in the network
output from one layer is used as input to the next layer
pij= ewj.hL / ∑ct=1ewt.hL is often
feedforward algorithm :
h0=xi or B
for l in [1,2,.......,L]:
hl=σ(Wl.hl−1+bl)
pi = eWo.hL
pi = normalize(pi)
dimension of B : d*m
d: dim of X (row*col)
m : data points of X stacked side by side,
gradient descent
the parameter being optimised is iterated in the direction of reducing cost
Move in the direction of reducing loss by changing the weights
backpropagation algorithm:
Feedforward the ith data point
Compute the loss of the ith data point
Aggregate (compute the average of) m losses
Compute the gradient of loss with respect to weights and biases
Update the weights and biases
backpropagation gradients
dz3 = q−y = p−y
dW3 = dz3.(h2)T
dh2 = (W3)T.dz3
dz2 = dh2⊗.σ′(z2)
dW2 = dz2.(h1)T
dh1 = (W2)T.dz2
dz1 = dh1⊗.σ′(z1)
dW1 = dz1.(x)T
SGD (stochastic gradient descent ) training procedure is as follows:
computationally faster
helps you reach the global minima
Steps
You specify the number of epochs (typical values are 10, 20, 50, 100 etc.) - more epochs require more computational power
You specify the number of batches m (typical values are 32, 64, 128, etc.)
At the start of each epoch, the data set is reshuffled and divided into m batches.
The average gradient of each batch is then used to make a weight update.
The training is complete at the end of all the epochs
Regularization
leads to better generalizability of the model.
What happens when we try to reduce the value of the loss function
Bias decreases
What happens when we increase the value of the regularization term
Variance decreases
Bias increases
two types of regularization techniques followed in neural networks:
1. L1 norm / lasso regression : λf(θ) = ||θ||1 is the sum of all the model parameters
2. L2 norm: λf(θ) = ||θ||2 is the sum of squares of all the model parameters
Dropouts
help in symmetry breaking
reducing the complexity of the model
important points to note
Dropouts can be applied only to some layers of the network
The mask α is generated independently for each layer during feedforward, and the same mask is used in backpropagation
The mask changes with each minibatch/iteration, are randomly generated in each iteration
Batch Normalization
layer o/p is composite function of all the weights and biases from the previous layers
Batch normalisation is usually done for all the layer outputs except the output layer.
There are six main steps in building a model using Keras:
Load the data
Define the model
Compile the model
Fit the model
Evaluate the model
Make predictions
Define the model
nn_model = Sequential()
input vectors which is 28x28 = 784
35 neuron hidden layer with
'relu' activation function. You might see we have used
'Dense' which is used to specify that the layers are fully connected, i.e. every neuron in one layer will be connected to every other layer in the adjacent layers.
nn_model.add(Dense(35, input_dim=784, activation='relu'))
In case we want to use dropout in the particular layer, we add a Dropout as follows:
nn_model.add(Dropout(0.3))
The argument '0.3' is the percentage of neurons that dropout in a particular iteration.
We add another dense hidden layer with 21 neurons and the 'relu' activation function.
nn_model.add(Dense(21, activation='relu'))
The last layer for our problem is a softmax layer with 10 classes defined as follows:
nn_model.add(Dense(10, activation='softmax'))
Let's consolidate the above code:
nn_model = Sequential()
nn_model.add(Dense(35, input_dim=784, activation='relu'))
nn_model.add(Dropout(0.3))
nn_model.add(Dense(21, activation='relu'))
nn_model.add(Dense(10, activation='softmax'))
Compile Model
nn_model.compile(loss='categorical_crossentropy', optimizer='adam', metrics=['accuracy'])
Fit Model
nn_model.fit(train_set_x, train_set_y, epochs=10, batch_size=10)
Evaluate Model
In this step we see the accuracy scores that we finally achieved using the following command:
scores_train = nn_model.evaluate(train_set_x, train_set_y)
print("\n%s: %.2f%%" % (nn_model.metrics_names[1], scores_train[1]*100))
To get the score on the test data, we can write the following:
scores_test = nn_model.evaluate(test_set_x, test_set_y)
print("\n%s: %.2f%%" % (nn_model.metrics_names[1], scores_test[1]*100))
Please note that we only changed the dataset from train to test.
Predict
The predictions can be performed using '.predict()' in the following way:
predictions = nn_model.predict(test_set_x)
Convolutional Neural Networks
Applications of CNNs
Object localization
Semantic segmentation
Optical Character Recognition
the layers in the network should do something like this:
The first layer extracts raw features, like vertical and horizontal edges
The second layer extracts more abstract features such as textures (using the features extracted by the first layer)
The subsequent layers may identify certain parts of the image such as skin, hair, nose, mouth etc. based on the textures.
Layers further up may identify faces, limbs etc.
Finally, the last layer may classify the image as 'human', 'cat' etc.
VGGNet
Convolution, and why it 'shrinks' the size of the input image
convolution operation is the summation of the element-wise product of two matrices
an (n, n) image will produce an (n-2, n-2) output on convolving with a (3, 3) filter.
The output size is (n+2p−k / s)+1. With
p=padding
k=convolution filter size
s=stride
p=1, k=3, s=1, the output will always be (n, n)
stride
you can move the filter by an arbitrary number of pixels
padding
helps maintain the size of the output arrays and avoid information loss
Padding of 'x' means that 'x units' of rows/columns are added all around the image.
Pooling layers
Pooling has the advantage of making the representation more compact by reducing the spatial size(height and width) of the feature maps,
Max pooling
Average pooling
points
It makes the network invariant to local transformations. (tilted a little, or an object being located in a different region )
makes the representation of the feature map more compact, thereby reducing the number of parameters in the network.
educes only the width and the height.
helps control overfitting
Feature maps