Deep Learning
- bottleneck in using neural networks is the availability of abundant training data
- Artificial Neural Networks, are a collection of a large number of simple devices called
- artificial neurons
- Perceptron
- take some signals as inputs and perform a set of simple calculations to arrive at a decision
- perceptron takes a weighted sum of multiple inputs (along with a bias) as the cumulative input and applies a step function on the cumulative input
- Structure of Neural Networks
- Neurons in a neural network are arranged in layers
- The first and the last layer are called the input and output layers
- Input layers have as many neurons as the number of attributes in the data set
- output layer has
- as many neurons as the number of classes of the target variable (for a classification problem).
- For a regression problem, the number of neurons in the output layer would be 1
- Components
- Network Topology
- Input Layer
- Output Layer
- Weights
- Activation functions
- Biases
- softmax output
- multiclass logistic function commonly used to compute the 'probability' of an input belonging to one of the multiple classes
- Softmax function translates to a sigmoid function in the special case of binary classification
- in the case of a sigmoid output, there is only one neuron in the output layer
- simplifying assumptions:
- Neurons are arranged in layers and the layers are arranged sequentially.
- Neurons within the same layer do not interact with each other.
- All the inputs enter the network through the input layer and all the outputs go out of the network through the output layer.
- Neurons in consecutive layers are densely connected, i.e. all neurons in layer l are connected to all neurons in layer l+1.
- Every interconnection in the neural network has a weight associated with it, and every neuron has a bias associated with it.
- All neurons in a particular layer use the same activation function.
- Parameters (which the network learns during training )
- weights
- biases
- Hyperparameters of Neural Networks (which need to specify beforehand )
- number of layers,
- activation functions
- number of neurons in
- input layers ,
- hidden layers
- output layers
- notations
- W is for weight matrix
- b shall stand for the bias
- x stands for input
- y is the ground truth label
- p is the probability vector of the predicted output
- h is the output of the hidden layers
- superscript stands for layer number
- subscript stands for the index of the individual neuron
- Activation Functions
- should be smooth
- make the inputs and outputs non-linear
- non-linearity helps in making neural networks more compact
- Types
- Logistic function
- 1/1+e−x
- Hyperbolic tangent function - similar to the sigmoid function.
- output=tanh(x)
- ex−e−x / ex+e−x
- Relu -Rectilinear Unit
- output=x for x > = 0 and 0 otherwise.
- Leaky Relu
- output=x for x > = 0 output=αx otherwise.
- Logistic function
- Number of interconnections = number of neurons in layer 'l' x number of neurons in layer 'l-1'
- feedforward neural networks
- no loops in the network
- output from one layer is used as input to the next layer
- pij= ewj.hL / ∑ct=1ewt.hL is often
- feedforward algorithm :
- h0=xi or B
- for l in [1,2,.......,L]:
- hl=σ(Wl.hl−1+bl)
- pi = eWo.hL
- pi = normalize(pi)
- dimension of B : d*m
- d: dim of X (row*col)
- m : data points of X stacked side by side,
- gradient descent
- the parameter being optimised is iterated in the direction of reducing cost
- Move in the direction of reducing loss by changing the weights
- backpropagation algorithm:
- Feedforward the ith data point
- Compute the loss of the ith data point
- Aggregate (compute the average of) m losses
- Compute the gradient of loss with respect to weights and biases
- Update the weights and biases
backpropagation gradients
- dz3 = q−y = p−y
- dW3 = dz3.(h2)T
- dh2 = (W3)T.dz3
- dz2 = dh2⊗.σ′(z2)
- dW2 = dz2.(h1)T
- dh1 = (W2)T.dz2
- dz1 = dh1⊗.σ′(z1)
- dW1 = dz1.(x)T
SGD (stochastic gradient descent ) training procedure is as follows:
- computationally faster
- helps you reach the global minima
- Steps
- You specify the number of epochs (typical values are 10, 20, 50, 100 etc.) - more epochs require more computational power
- You specify the number of batches m (typical values are 32, 64, 128, etc.)
- At the start of each epoch, the data set is reshuffled and divided into m batches.
- The average gradient of each batch is then used to make a weight update.
- The training is complete at the end of all the epochs
Regularization
- leads to better generalizability of the model.
- What happens when we try to reduce the value of the loss function
- Bias decreases
- What happens when we increase the value of the regularization term
- Variance decreases
- Bias increases
- two types of regularization techniques followed in neural networks:
1. L1 norm / lasso regression : λf(θ) = ||θ||1 is the sum of all the model parameters
2. L2 norm: λf(θ) = ||θ||2 is the sum of squares of all the model parameters
- Dropouts
- help in symmetry breaking
- reducing the complexity of the model
- important points to note
- Dropouts can be applied only to some layers of the network
- The mask α is generated independently for each layer during feedforward, and the same mask is used in backpropagation
- The mask changes with each minibatch/iteration, are randomly generated in each iteration
- Batch Normalization
- layer o/p is composite function of all the weights and biases from the previous layers
- Batch normalisation is usually done for all the layer outputs except the output layer.
There are six main steps in building a model using Keras:
- Load the data
- Define the model
- Compile the model
- Fit the model
- Evaluate the model
- Make predictions
Define the model
nn_model = Sequential()
- input vectors which is 28x28 = 784
- 35 neuron hidden layer with
- 'relu' activation function. You might see we have used
- 'Dense' which is used to specify that the layers are fully connected, i.e. every neuron in one layer will be connected to every other layer in the adjacent layers.
nn_model.add(Dense(35, input_dim=784, activation='relu'))
In case we want to use dropout in the particular layer, we add a Dropout as follows:
nn_model.add(Dropout(0.3))
The argument '0.3' is the percentage of neurons that dropout in a particular iteration.
We add another dense hidden layer with 21 neurons and the 'relu' activation function.
nn_model.add(Dense(21, activation='relu'))
The last layer for our problem is a softmax layer with 10 classes defined as follows:
nn_model.add(Dense(10, activation='softmax'))
Let's consolidate the above code:
nn_model = Sequential()
nn_model.add(Dense(35, input_dim=784, activation='relu'))
nn_model.add(Dropout(0.3))
nn_model.add(Dense(21, activation='relu'))
nn_model.add(Dense(10, activation='softmax'))
Compile Model
nn_model.compile(loss='categorical_crossentropy', optimizer='adam', metrics=['accuracy'])
Fit Model
nn_model.fit(train_set_x, train_set_y, epochs=10, batch_size=10)
Evaluate Model
In this step we see the accuracy scores that we finally achieved using the following command:
scores_train = nn_model.evaluate(train_set_x, train_set_y)
print("\n%s: %.2f%%" % (nn_model.metrics_names[1], scores_train[1]*100))
To get the score on the test data, we can write the following:
scores_test = nn_model.evaluate(test_set_x, test_set_y)
print("\n%s: %.2f%%" % (nn_model.metrics_names[1], scores_test[1]*100))
Please note that we only changed the dataset from train to test.
Predict
The predictions can be performed using '.predict()' in the following way:
predictions = nn_model.predict(test_set_x)
Convolutional Neural Networks
Applications of CNNs
- Object localization
- Semantic segmentation
- Optical Character Recognition
the layers in the network should do something like this:
- The first layer extracts raw features, like vertical and horizontal edges
- The second layer extracts more abstract features such as textures (using the features extracted by the first layer)
- The subsequent layers may identify certain parts of the image such as skin, hair, nose, mouth etc. based on the textures.
- Layers further up may identify faces, limbs etc.
- Finally, the last layer may classify the image as 'human', 'cat' etc.
VGGNet
- Convolution, and why it 'shrinks' the size of the input image
- convolution operation is the summation of the element-wise product of two matrices
- an (n, n) image will produce an (n-2, n-2) output on convolving with a (3, 3) filter.
- The output size is (n+2p−k / s)+1. With
- p=padding
- k=convolution filter size
- s=stride
- p=1, k=3, s=1, the output will always be (n, n)
- stride
- you can move the filter by an arbitrary number of pixels
- padding
- helps maintain the size of the output arrays and avoid information loss
- Padding of 'x' means that 'x units' of rows/columns are added all around the image.
- Pooling layers
- Pooling has the advantage of making the representation more compact by reducing the spatial size(height and width) of the feature maps,
- Max pooling
- Average pooling
- points
- It makes the network invariant to local transformations. (tilted a little, or an object being located in a different region )
- makes the representation of the feature map more compact, thereby reducing the number of parameters in the network.
- educes only the width and the height.
- helps control overfitting
- Feature maps