After completing this learning module, students will be able to:
Describe CAPTCHA and impact on real world.
Explain CNN algorithm and how it can identify letters and numbers from CAPTCHAs.
Apply CNN to analyze CAPTCHAs.
Captcha:
Captcha basically stands for “Completely Automated Public Turing test to tell Computers and Humans Apart”. The purpose of Captcha is to distinguish between real users and automated machine users. Captchas have letters and numbers and combinations of them, which can be detected through human ability.
The common form of captcha is an image of distorted letters and numbers. The idea is to make the machines difficult to understand this distorted combination of images but humans can determine it. The bending and noise of images and letters made it hard for bots to decipher the content.
The various uses of captcha can be understood as security can be breached at any time. For example, it can keep bots from spamming message sheets, review sites or contract forms that reside in various web pages or blogs. Ticket websites also use captcha in order to prevent ticket scalpers from purchasing tickets for large events. Another use can be mentioned, which is that spammers can use bots to create spam email accounts. So, captcha can help prevent this type of spamming.
Convolutional Neural Network:
In deep learning, a convolutional neural network is a class of deep neural network, most commonly applied to analyze visual imagery. The role of CNN is to reduce the images into a form that is easier to process, without losing features that are critical for getting a good prediction.
The architecture of a convolutional neural network is analogous to that of the connectivity pattern of Neurons in the Human Brain and was inspired by the organization of the Visual Cortex. Individual neurons respond to stimuli only in a restricted region of the visual field known as the Receptive Field. A collection of such fields overlaps to cover the entire visual area.
In neural networks, a Convolutional neural network is one of the main categories to do image recognition, image classification. Object detection, recognition of faces etc., are some of the areas where CNNs are widely used.
First let us understand images and how they are represented. An RGB image is nothing but a matrix of pixel values having three planes whereas a grayscale image is the same but it has a single plane. Here on CNN planes are also represented as channels.
The above image can be an input image. CNN image classifications take an input image, process it and classify it under certain categories. Computer sees an input image as an array of pixels and it depends on the image resolution. Based on the image resolution, it will see h x w x d (h = Height, w = Width, d = Dimension). eg., An image of 4 x 4 x 3 array of matrix of RGB (3 refers to RGB values) and an image of 4 x 4 x 1 array of matrix of grayscale image.
For simplicity, a grayscale image can be shown to understand how CNNs work.
The above image shows what a convolution is. A filter/kernel (3×3 matrix) is needed and applies it to the input image to get the convolved feature. This convolved feature is passed on to the next layer. Let’s further understand using the animation below.
In the above demonstration, the green section resembles a 5x5x1 input image. The element involved in carrying out the convolution operation in the first part of a Convolutional Layer is called the Kernel/Filter, K, represented in the color yellow. Here k is a 3x3x1 matrix.
In the case of RGB color, the channel takes a look at this animation to understand it’s working.
Stride: It is the number of pixels shifted over the input matrix. When the stride is 1 then the filters move to 1 pixel at a time. When the stride is 2 then the filters move to 2 pixels at a time and so on.
Padding: There are two types of results to the operation — one in which the convolved feature is reduced in dimensionality as compared to the input, and the other in which the dimensionality is either increased or remains the same. This is done by applying Valid Padding in case of the former, or Same Padding in the case of the latter.
According to the above animation when the 5x5x1 image is augmented into a 6x6x1 image and then apply the 3x3x1 kernel over it, then the convolved matrix turns out to be of dimensions of 5x5x1. Hence the name is Same Padding. On the other hand, if the same operation is performed without padding, it is presented with a matrix which has dimensions of the Kernel (3x3x1) itself which refers to Valid Padding.
Non-Linearity (Relu): ReLU stands for Rectified Linear Unit for a non-linear operation. The output is ƒ(x) = max(0,x). ReLU’s purpose is to introduce non-linearity in CNN. Since, the real world data would want CNN to learn non-negative linear values.
Pooling: Pooling layers section would reduce the number of parameters when the images are too large. Spatial pooling is also called subsampling or down-sampling which reduces the dimensionality of each map but retains important information. Spatial Pooling can be of different types:
Max Pooling
Average Pooling
Max Pooling and Average Pooling. Max Pooling returns the maximum value from the portion of the image covered by the Kernel. On the other hand, Average Pooling returns the average of all the values from the portion of the image covered by the Kernel.
Fully Connected Layer:
In the above diagram, the feature map matrix will be converted as a column vector (x1, x2, …). With the fully connected layers, these combined features will create a model. Finally, for the classification an activation function such as softmax or sigmoid classification will be used in order to classify the outputs.
From all the understanding we got from how CNN works finally it is understood how CNN can help break or detect a captcha like a real user. In order to break a captcha the whole procedure is needed as mentioned above.
In this captcha image 5 characters need to be individually detected using convolutional neural network. In order to detect the first of the 5 characters the letter ‘2’ needs to be detected and a demo of the procedure is shown below. In CNN the first layer needs to capture the low-level features such as edges, color, gradient orientation etc. with added layers the architecture adapts to the high-level features. After that, Max Pooling comes to the picture. In Max Pooling the maximum value of the pixel from a portion of the image covered by the kernel can be found. Max pooling also performs as a Noise Suppressant which is needed as it discards the noisy activations altogether and also performs de-noising along with dimensionality reduction in captcha images. So, by performing CNN in order to classify all 5 characters, it gives a wholesome understanding of the images in the dataset, similar to how we look.