Deep Learning

bottleneck in using neural networks is the availability of abundant training data
Artificial Neural Networks, are a collection of a large number of simple devices called
- artificial neurons
Perceptron
- take some signals as inputs and perform a set of simple calculations to arrive at a decision
- perceptron takes a weighted sum of multiple inputs (along with a bias) as the cumulative input and applies a step function on the cumulative input
Structure of Neural Networks
- Neurons in a neural network are arranged in layers
- The first and the last layer are called the input and output layers
- Input layers have as many neurons as the number of attributes in the data set
- output layer has
  - as many neurons as the number of classes of the target variable (for a classification problem).
  - For a regression problem, the number of neurons in the output layer would be 1
Components

1. Network Topology
2. Input Layer
3. Output Layer
4. Weights
5. Activation functions
6. Biases

softmax output
- multiclass logistic function commonly used to compute the 'probability' of an input belonging to one of the multiple classes
- Softmax function translates to a sigmoid function in the special case of binary classification
- in the case of a sigmoid output, there is only one neuron in the output layer
simplifying assumptions:

1. Neurons are arranged in layers and the layers are arranged sequentially.
2. Neurons within the same layer do not interact with each other.
3. All the inputs enter the network through the input layer and all the outputs go out of the network through the output layer.
4. Neurons in consecutive layers are densely connected, i.e. all neurons in layer l are connected to all neurons in layer l+1.
5. Every interconnection in the neural network has a weight associated with it, and every neuron has a bias associated with it.
6. All neurons in a particular layer use the same activation function.

Parameters (which the network learns during training )
- weights
- biases
Hyperparameters of Neural Networks (which need to specify beforehand )
- number of layers,
- activation functions
- number of neurons in
  - input layers ,
  - hidden layers
  - output layers
notations

1. W is for weight matrix
2. b shall stand for the bias
3. x stands for input
4. y is the ground truth label
5. p is the probability vector of the predicted output
6. h is the output of the hidden layers
7. superscript stands for layer number
8. subscript stands for the index of the individual neuron

Activation Functions
- should be smooth
- make the inputs and outputs non-linear
- non-linearity helps in making neural networks more compact
- Types
  - Logistic function
    - 1/1+e−x
  - Hyperbolic tangent function - similar to the sigmoid function.
    - output=tanh(x)
    - ex−e−x / ex+e−x
  - Relu -Rectilinear Unit
    - output=x for x > = 0 and 0 otherwise.
  - Leaky Relu
    - output=x for x > = 0 output=αx otherwise.
- Number of interconnections = number of neurons in layer 'l' x number of neurons in layer 'l-1'

feedforward neural networks
- no loops in the network
- output from one layer is used as input to the next layer
pij= ewj.hL / ∑ct=1ewt.hL is often
feedforward algorithm :

1. h0=xi or B
2. for l in [1,2,.......,L]:
  1. hl=σ(Wl.hl−1+bl)
3. pi = eWo.hL
4. pi = normalize(pi)

dimension of B : d*m
- d: dim of X (row*col)
- m : data points of X stacked side by side,

gradient descent
- the parameter being optimised is iterated in the direction of reducing cost
- Move in the direction of reducing loss by changing the weights
backpropagation algorithm:

1. Feedforward the ith data point
2. Compute the loss of the ith data point
3. Aggregate (compute the average of) m losses
4. Compute the gradient of loss with respect to weights and biases
5. Update the weights and biases

backpropagation gradients

1. dz3 = q−y = p−y
2. dW3 = dz3.(h2)T
3. dh2 = (W3)T.dz3
4. dz2 = dh2⊗.σ′(z2)
5. dW2 = dz2.(h1)T
6. dh1 = (W2)T.dz2
7. dz1 = dh1⊗.σ′(z1)
8. dW1 = dz1.(x)T

SGD (stochastic gradient descent ) training procedure is as follows:

- computationally faster
- helps you reach the global minima

Steps
- You specify the number of epochs (typical values are 10, 20, 50, 100 etc.) - more epochs require more computational power
- You specify the number of batches m (typical values are 32, 64, 128, etc.)
- At the start of each epoch, the data set is reshuffled and divided into m batches.
- The average gradient of each batch is then used to make a weight update.
- The training is complete at the end of all the epochs

Regularization

- leads to better generalizability of the model.
- What happens when we try to reduce the value of the loss function
  - Bias decreases
- What happens when we increase the value of the regularization term
  - Variance decreases
  - Bias increases
  - two types of regularization techniques followed in neural networks:

1. L1 norm / lasso regression : λf(θ) = ||θ||1 is the sum of all the model parameters

2. L2 norm: λf(θ) = ||θ||2 is the sum of squares of all the model parameters

- Dropouts
  - help in symmetry breaking
  - reducing the complexity of the model
  - important points to note
    - Dropouts can be applied only to some layers of the network
    - The mask α is generated independently for each layer during feedforward, and the same mask is used in backpropagation
    - The mask changes with each minibatch/iteration, are randomly generated in each iteration
- Batch Normalization
  - layer o/p is composite function of all the weights and biases from the previous layers
  - Batch normalisation is usually done for all the layer outputs except the output layer.

There are six main steps in building a model using Keras:

Load the data
Define the model
Compile the model
Fit the model
Evaluate the model
Make predictions

Define the model

nn_model = Sequential()

input vectors which is 28x28 = 784
35 neuron hidden layer with
'relu' activation function. You might see we have used
'Dense' which is used to specify that the layers are fully connected, i.e. every neuron in one layer will be connected to every other layer in the adjacent layers.

nn_model.add(Dense(35, input_dim=784, activation='relu'))

In case we want to use dropout in the particular layer, we add a Dropout as follows:

nn_model.add(Dropout(0.3))

The argument '0.3' is the percentage of neurons that dropout in a particular iteration.

We add another dense hidden layer with 21 neurons and the 'relu' activation function.

nn_model.add(Dense(21, activation='relu'))

The last layer for our problem is a softmax layer with 10 classes defined as follows:

nn_model.add(Dense(10, activation='softmax'))

Let's consolidate the above code:

nn_model = Sequential()

nn_model.add(Dense(35, input_dim=784, activation='relu'))

nn_model.add(Dropout(0.3))

nn_model.add(Dense(21, activation='relu'))

nn_model.add(Dense(10, activation='softmax'))

Compile Model

nn_model.compile(loss='categorical_crossentropy', optimizer='adam', metrics=['accuracy'])

Fit Model

nn_model.fit(train_set_x, train_set_y, epochs=10, batch_size=10)

Evaluate Model

In this step we see the accuracy scores that we finally achieved using the following command:

scores_train = nn_model.evaluate(train_set_x, train_set_y)

print("\n%s: %.2f%%" % (nn_model.metrics_names[1], scores_train[1]*100))

To get the score on the test data, we can write the following:

scores_test = nn_model.evaluate(test_set_x, test_set_y)

print("\n%s: %.2f%%" % (nn_model.metrics_names[1], scores_test[1]*100))

Please note that we only changed the dataset from train to test.

Predict

The predictions can be performed using '.predict()' in the following way:

predictions = nn_model.predict(test_set_x)

Convolutional Neural Networks

Applications of CNNs

Object localization
Semantic segmentation
Optical Character Recognition

the layers in the network should do something like this:

The first layer extracts raw features, like vertical and horizontal edges
The second layer extracts more abstract features such as textures (using the features extracted by the first layer)
The subsequent layers may identify certain parts of the image such as skin, hair, nose, mouth etc. based on the textures.
Layers further up may identify faces, limbs etc.
Finally, the last layer may classify the image as 'human', 'cat' etc.

VGGNet

Convolution, and why it 'shrinks' the size of the input image
- convolution operation is the summation of the element-wise product of two matrices
- an (n, n) image will produce an (n-2, n-2) output on convolving with a (3, 3) filter.
- The output size is (n+2p−k / s)+1. With
  - p=padding
  - k=convolution filter size
  - s=stride
  - p=1, k=3, s=1, the output will always be (n, n)
- stride
  - you can move the filter by an arbitrary number of pixels
- padding
  - helps maintain the size of the output arrays and avoid information loss
  - Padding of 'x' means that 'x units' of rows/columns are added all around the image.
Pooling layers
- Pooling has the advantage of making the representation more compact by reducing the spatial size(height and width) of the feature maps,
- Max pooling
- Average pooling
- points
  - It makes the network invariant to local transformations. (tilted a little, or an object being located in a different region )
  - makes the representation of the feature map more compact, thereby reducing the number of parameters in the network.
  - educes only the width and the height.
  - helps control overfitting
Feature maps

Google Sites

Report abuse