Machine Learning and Convolutional Neural Networks

In this project, Machine Learning is used to decoded the messages sent, even in the presence of degradation and distortion. But in order to under stand how this is possible, we will first dive into brief introductions of the topic of artificial intelligence and deep machine learning

What is Artificial Intelligence and Machine Learning?

Artificial Intelligence (AI) is the practice of developing computer programs that are able to “think”. The goal is to remove human involvement in the computing process, which leads to many practical applications. The first AI systems were developed around programmed rules that had little direct application to the real world, such as an AI system designed to play chess or non-player-controlled characters in a video game. These systems are relatively simple, because there is a concrete set of rules to follow, known and programmed in by humans ahead of time. The more difficult application of AI is to use these systems to solve problems that are intuitive for humans, but not necessarily for computers, such as facial recognition [1]. This problem can be solved with classic machine learning. Machine learning is the ability for the AI to acquire knowledge by extracting patterns from raw data. However, with classic machine learning, a systems ability to do this is dependent on the human-designed representations of the data. A simple example of the dependence on representation is the way we perform quick mathematical operations with Arabic numerals, but struggle to do so at the same speed with Roman numerals [1]. The same is true for AI systems as well, and given the right representation, classic machine learning models can be useful for many applications.

Representations can be broken down into pieces of data, called features, where a set of features is a representation. When designing features, the goal is to separate the factors of variation, which are the high-level concepts or abstracts that help the machine make sense of the data. An example of factors of variation for a speech recognition program could include the speaker’s sex, age, accent, and words, among other things [1]. However, not every problem has a clear representation that can be coded in by engineers. This problem was solved by the use of representation machine learning, which uses machine learning to teach the AI system not only how to get from representation to output, but also the representation of the input itself. Representation learning algorithms can vastly speed up the time it takes to identify representations for more complex problems, as well as allowing the system to adapt more quickly to changes in the input. However, factors of variation can be highly complex, such as the factors of variation in image classification. For this research, this might be the number of rings and pedals in the image, the intensity, the size of the vortex etc. Sorting through these factors of variation with representation learning algorithms can be even more of a challenge than the original classification problem. This is the short-coming of representation learning [1]. This problem is solved through the use of deep learning. Deep learning is a layered approach that understands the data in a hierarchy of representations, where more complicated representations are built out of simpler ones. This is useful for image classification because the image can be broken down into simpler features that are then passed to the next layer of the network. In order to understand how deep learning benefits image classification machine learning algorithms, it is important to first fundamentally understand how machine learning networks work.

Neural Networks

A network, or neural network, describes all machine learning algorithms, which are made up of layers of computational nodes, called neurons [2]. A classic machine learning network uses fully connected layers of neurons to perform the function of the algorithm. Fully connected layers are named so because every neuron in the previous layer is connected to every neuron in the next layer. For classic machine learning, the information passed from neuron to neuron is a single number, so in this way, each layer of neurons takes in a vector of values, and outputs a vector of values. For a single neuron, this process can be seen in the figure, where the input values are x₁ and x₂ and the output is y.

Inside the neuron, the output value is a result of a computational process, the parameters of which are called the weights and bias. For two inputs into a node, this is shown in the equation, where y is the output, x represents an input, w represents a weight, and b is the bias [2].

It is easy to imagine how a system could be expanded to have hundreds, or even thousands, of neural nodes in a single layer to do increasingly complex computations. A neural network with multiple layers (ranging from tens to thousands), is known as a deep neural network.

Computational process for a single neural node

Deep Convolutional Neural Networks

One such type of deep neural network is a Convolutional Neural Network (CNN). CNNs are specifically designed for image classification, as the traditional machine learning model of numbers as independent inputs and outputs cannot be used, because the spatial structure of the pixels contains necessary information for understanding the image as a whole. In a CNN, each layer takes in a set of images as the input, and produces a set of images as an output, preserving the proximity relationship of the pixels. CNNs are typically composed of many sets of three layers: a 2-D convolutional layer, a rectified linear unit layer, and a max pooling layer. The outputs from these layers are then used to classify the image in the output layers.

The purpose of the convolutional layers, rectified linear unit layers, and max pooling layers are to extract features from a given input image, in the form of activations. An activation is the resulting image after a filter has been applied to the original image, representing where a specific feature is prevalent in the image. This is done in the convolutional layer. The rectified linear unit layer then performs a threshold operation on each element of the input so that everything less than zero (negative) is set to zero. The max pooling layer is responsible for passing only the strongest activations within a region of the input, reducing the complexity of the network by discarding irrelevant information. The rectified linear unit and max pooling layers are important in preparing the output activation of the convolutional layer to be an input for the next convolutional layer. The convolutional layer can be thought of as an array of nodes, where each node filters for a specific feature in the image. The filters are the parameters (weights) for that node, and are learned during the training process. This mathematical process is called a convolution, where the 2-D filter, typically small, like 3 x 3 pixels, is applied across the entire image like a sliding window. The filter activates white for values that meet the criteria, black for values that are the opposite, and gray if neither. This process produces a grayscale image highlighting a specific feature or pattern in the input image [2]. An illustration of this can be seen in the figure below, which shows a filter that activates on vertical edges, or light pixels to the left of dark pixels. When this is true, the activation shows bright spots. The dark spots occur when the opposite is true, or where dark pixels are to the left of light ones. The gray occurs in areas that feature no edges.

Simply using this edge filter to classify the input image is not enough, as this is only part of the complex features that make up a cat. Determining the complex features is difficult because the many factors of variation that are present in images. This is the purpose of deep CNNs, which break complicated features into simple ones that are built on in the following layers. It also uses representation learning to teach itself the features necessary to determine image correlation in a given set. This is extremely useful because it removes the human element in feature determination. More often than not, the ML algorithm finds features that are different from what humans use to classify images. The figure to the right provides a simple illustration for this concept, where the feature extraction is broken into many layers. Each successive convolutional layer identifies more complicated features than the last, building on the knowledge of the previous layers. Then, it is all tied together at the final three layers: the fully connected, softmax, and classification layers.

At the fully connected layer, the activations are mapped to the output classes, the number of which is specified by whoever designed the network. The fully connected layer is usually the third to last layer used for image classification. Then at the softmax layer, values for each output are converted from the fully connected layer to normalized scores, using a normalized exponential function. This can be thought of as the probability that a certain input image belongs to a specific class. These values are then passed to the last layer, the classification layer, which returns the name of the most likely class [2]. In this way, deep CNNs become extremely effective at image classification problems as they bypass the need for human intervention in feature determination.

Typical Network Architecture

References

[1] Goodfellow et al, “Deep Learning,” MIT Press. (2016). https://www.deeplearningbook.org/

[2] https://matlabacademy.mathworks.com/R2020a/portal.html?course=mldl

Page updated

Report abuse