Introduction to Artificial Neural Networks
Need for Neural Approach
A modern computer can perform millions of calculations in a fraction of a second. The computer is also capable of executing the programmed tasks much more efficiently than a human being. However, the general operating domain of a computer is greatly constrained by the so-called lack of intelligence. Also, the current computing paradigm has not lent itself to effective parallel utilization.
Because of this inherent inflexibility, many tasks that seem trivial to human beings have turned out to be overwhelmingly complex for computers. Instead, many seemingly complex tasks, such as mastering chess, have been computerized successfully. Today, a computer with the most modern chess program can beat the best human chess player.
Traditionally, computers have been programmed by humans. Even to this date, the trace of intelligence within the current computer software is the result of the programmers' efforts only. During the laborious design phase of a computer software, the future operation of the software is determined. A certain set of operations is defined for each input that the program is expected to receive. This heuristic approach can be very effective within the problem area to which it is appplied. However, it is usually a very difficult and time-consuming process to update the heuristic rules as the problem changes. In addition, a totally distinct set of heuristic rules is needed for every problem.
If, by mistake, the program gets an input it has not programmed to receive, the consequences are most likely disastrous or at least unexpected. That is why the software designers try to cover all the states and inputs the program can run into. One major purpose of a graphical user interface is to limit the number of inputs that the user can give to the program.
In contrast to heuristic approach, various algorithms have been found to be a more efficient means of solving computational problems. Although any one algorithm is capable of modelling only a small fraction in a functioning whole, sometimes a single algorithm can solve an entire range of problems. An example of this is the minimax algorithm which has found its way to drive practically all types of board games. The common denominator in most board games is that the player strives to maximize the score for itself and minimize the score for the opponent. This is exactly the problem the minimax algorithm can tackle.
It seems clear that nobody can build fixed, universally applicable heuristic rules. This would mean that the computer is programmed so that it takes into account all the possible situations that 'there is' in order to face problems of any kind. Also, there obviously exist no algorithms which could be presented any kind of problem and be expected to come up with an answer to that. This inevitably results to the concept of learning. Is it possible to make the computer, to some extent, learn the data it is presented with?
Biological Neural Network
The biological neural network serves as a natural engineering example of a working, intelligent information processor. As a result of millions of years of evolution, the brain has evolved into a compact, optimum package of computing power capable of dealing with the myriad situations that it can run into. The power of the biological brain lies both in its unsurmountable flexibility and massive parallelism.
A neuron is the fundamental unit of a biological neural network. The structure of a neuron is represented in Figure 5. A neuron is made up of a nucleus, a cell body (soma), dendrites and an axon. The cell body and the enclosed nucleus do not play a significant role in the processing of incoming and outgoing data. Rather, the cell body is a place for the mechanisms that provide the cell its energy and cause the activation of the cell. Dendrites receive the signals from other neurons. The axon transmits the signal to other neurons. At the junction of the signal-sending axon and the signal-receiving dendrite lies a small gap called a synapse.
As the neurons learn to react to certain signals, the synaptic connections between neurons either get stronger or weaker. The strength of the synaptic connection determines how strong the receiving neuron finds the signal. The signals from different neurons are thus weighted differently based on the strength of the synaptic connections. If the total effect of all the received signals is adequate, the neuron is activated and it will begin to send a signal to the other neurons via its axon.
Neurons that are connected with each other with synaptic connections constitute a biological neural network. The human brain consists of about 100 billion (10^11) interconnected neurons. A single neuron has 1000 to 10 000 connections with other neurons.
Figure 5: Biological neuron.
Artificial Neural Network
An artificial neural network (ANN) is either a hardware implementation or a computer program which strives to simulate the information processing capabilities of its biological exemplar. ANNs are typically composed of a great number of interconnected artificial neurons. The artificial neurons are simplified models of their biological counterparts.
The typical characteristics of ANNs differ very much from what is normally expected of a computer. These new properties include adaptive learning, self-organization, error tolerance, real-time operation and parallel information processing.
Learning in the context of ANNs means, that the network can adopt different behavior on the basis of the data that is given to the network. Unlike telling the network how to react to each data vector separately, as would be the case in the conventional programming, the network itself is able to find properties from the presented data. The network learning can be continued as new data becomes available. Learning is thus adaptive.
As data is given to the ANN, it organizes its structure to reflect the properties of the given data. In most ANN models, the term self-organization refers to the determination of the connection strengths between neurons. The way the internal structure of an ANN is altered is determined by the used learning algorithm. Several distinct neural network models can be distinguished both from their internal architecture and from the learning algorithms that they use.
Error tolerance is an important aspect of an ANN. It refers to the network's ability to model the essential features of the given data. In other words, an ANN is capable of finding a generalization for the data. This powerful characteristic makes it possible to process new, imperfect and distorted data with neural networks.
Due to the parallel nature of the information processing in ANNs, real-time operation becomes possible. Basically, three entities characterize an ANN:
The network topology, or interconnection of neural 'units'
The characteristics of individual units or artificial neurons
The strategy for pattern learning or training
History of the Artificial Neural Networks
The history of the ANNs stems from the 1940s, the decade of the first electronic computer. However, the first significant step took place in 1957 when Rosenblatt introduced the first concrete neural model, the perceptron. Rosenblatt also took part in constructing the first successful neurocomputer, the Mark I Perceptron. After this initial impulse, the development of ANNs has proceeded as described in Figure 1.
Rosenblatt's original perceptron model contained only one layer. From this, a multi-layered model was derived in 1960. At first, the use of the multi-layer perceptron (MLP) was complicated by the lack of a suitable learning algorithm. In 1974, Werbos came to rescue by introducing a so-called backpropagation algorithm for the three-layered perceptron network. The application area of the MLP networks remained rather limited until the breakthrough in 1986 when a general backpropagation algorithm for a multi-layered perceptron was introduced by Rummelhart and Mclelland.
Hopfield brought out his idea of a neural network in 1982. Unlike the neurons in MLP, the Hopfield network consists of only one layer whose neurons are fully connected with each other. Since then, new versions of the Hopfield network have been developed. The Boltzmann machine has been influenced by both the Hopfield network and the MLP.
Adaptive Resonance Theory (ART) was first introduced by Carpenter and Grossberg in 1983. The development of ART has continued and resulted in the more advanced ART II and ART III network models.
Radial Basis Function (RBF) networks were first introduced by Broomhead & Lowe in 1988. Although the basic idea of RBF was developed 30 years ago under the name method of potential function, the work by Broomhead & Lowe opened a new frontier in the neural network community.
A totally unique kind of network model is the Self-Organizing Map (SOM) introduced by Kohonen in 1982. SOM is a certain kind of topological map which organizes itself based on the input patterns that it is trained with. The SOM originated from the LVQ (Learning Vector Quantization) network the underlying idea of which was also Kohonen's in 1972.
Figure 1: The evolution of the most popular artificial neural networks.
Structure of the ANN
The artificial neural networks can be classified according to the structure that they exhibit. Figure 2 represents four commonly used neural network structures.
Figure 2 a) represents the structure of a multi-layered feedforward network. The neurons in this ANN model are grouped in layers which are connected to the direction of the passing signal (from left to right in this case). There are no lateral connections within each layer and also no feedbackward connections within the network. The best-known ANN of this type is the perceptron network.
Figure 2 b) depicts a single-layered fully connected network model where each neuron is laterally connected to all neighbouring neurons in the layer. In this ANN model, all neurons are both input and output neurons. The best-known ANN of this type is the Hopfield network.
Figure 2 c) demonstrates the connections in a two-layered feedforward/feedbackward network. The layers in this ANN model are connected to both directions. As a pattern is presented to the network, it 'resonates' a certain number of times between the layers before a response is received from the output layer. The best-known ANN of this type is the Adaptive Resonance Theory (ART) network.
Figure 2 d) illustrates the idea of a topologically organized feature map. In this model, each neuron in the network contains a so-called feature vector. As a pattern from the training data is given to the network, the neuron whose feature vector is closest to the input vector is activated. The activated neuron is called the best matching unit (BMU) and it is updated to reflect input vector causing the activation. In the process of updating the BMU, the neighbouring neurons are updated towards the input vector or away from it (according to the learning algorithm in use). The network type exhibiting this kind of behaviour is the Self-Organizing Map of Kohonen.
Figure 2: Different ANN structures. a) Multi-layered feedforward network, b) single-layered fully connected network, c) two-layered feedforward/feedbackward network and d) topographically organized vector map.
Supervised vs. Unsupervised Learning
An important aspect of an ANN model is whether it needs guidance in learning or not. Based on the way they learn, all artificial neural networks can be divided into two learning categories - supervised and unsupervised.
In supervised learning, a desired output result for each input vector is required when the network is trained. An ANN of the supervised learning type, such as the multi-layer perceptron, uses the target result to guide the formation of the neural parameters. It is thus possible to make the neural network learn the behaviour of the process under study.
In unsupervised learning, the training of the network is entirely data-driven and no target results for the input data vectors are provided. An ANN of the unsupervised learning type, such as the self-organizing map, can be used for clustering the input data and find features inherent to the problem.
Application Areas of ANNs
Although a certain type of ANN has been engineered to address certain kinds of problems, there exist no definite rules as to what the exact application domains of ANNs are. The general application areas of ANNs are: robust pattern recognition, filtering, data segmentation, data compression, adaptive control, optimization, modelling complex functions and associative pattern recognition.
Figure 3 illustrates the use of well-known neural networks.
Figure 4 lists the application areas grouped according to the ANN structure.
Figure 3: Application areas of different neural networks.
Figure 4: Application areas of different ANNs grouped by network structure.
Introduction to SOM
Self-adaptive topological maps were initially inspired by modelling perception systems found in mammalian brain. A perception system involves the reception of external signals and their processing inside the nervous system. The complex mammalian skills, such as seeing and hearing seemed to bear similarity to each other in the way they worked. Namely, the primary characteristic of these systems is that neighbouring neurons encode input signals which are similar to each other.
General idea of the SOM model
The Self-Organizing Map (SOM) was introduced by Teuvo Kohonen in 1982. The SOM (also known as the Kohonen feature map) algorithm is one of the best known artificial neural network algorithms. In contrast to many other neural networks using supervised learning, the SOM is based on unsupervised learning.
The SOM is quite a unique kind of neural network in the sense that it constructs a topology preserving mapping from the high-dimensional space onto map units in such a way that relative distances between data points are preserved. The map units, or neurons, usually form a two-dimensional regular lattice where the location of a map unit carries semantic information. The SOM can thus serve as a clustering tool of high-dimensional data. Because of its typical two-dimensional shape, it is also easy to visualize.
Another important feature of the SOM is its capability to generalize. In other words, it can interpolate between previously encountered inputs.
The SOM algorithm
The basic idea of SOM is simple yet effective. The SOM defines a mapping from high dimensional input data space onto a regular two-dimensional array of neurons. Every neuron i of the map is associated with an n-dimensional reference vector
, where n denotes the dimension of the input vectors. The reference vectors together form a codebook. The neurons of the map are connected to adjacent neurons by a neighbourhood relation, which dictates the topology, or the structure, of the map. The most common topologies in use are rectangular and hexagonal.
Adjacent neurons belong to the neighbourhood Ni of the neuron i. In the basic SOM algorithm, the topology and the number of neurons remain fixed from the beginning. The number of neurons determines the granularity of the mapping, which has an effect on the accuracy and generalization of the SOM.
During the training phase, the SOM forms an elastic net that folds onto the "cloud" formed by input data. The algorithm controls the net so that it strives to approximate the density of the data. The reference vectors in the codebook drift to the areas where the density of the input data is high. Eventually, only few codebook vectors lie in areas where the input data is sparse.
The learning process of the SOM goes as follows:
One sample vector x is randomly drawn from the input data set and its similarity (distance) to the codebook vectors is computed by using e.g. the common Euclidean distance measure:
After the BMU has been found, the codebook vectors are updated. The BMU itself as well as its topological neighbours are moved closer to the input vector in the input space i.e. the input vector attracts them. The magnitude of the attraction is governed by the learning rate. As the learning proceeds and new input vectors are given to the map, the learning rate gradually decreases to zero according to the specified learning rate function type. Along with the learning rate, the neighbourhood radius decreases as well.
The update rule for the reference vector of unit i is the following:
The steps 1 and 2 together consitute a single training step and they are repeated until the training ends. The number of training steps must be fixed prior to training the SOM because the rate of convergence in the neighbourhood function and the learning rate is calculated accordingly.
After the training is over, the map should be topologically ordered. This means that n topologically close (using some distance measure e.g. Euclidean) input data vectors map to n adjacent map neurons or even to the same single neuron.
Map quality measures
After the SOM has been trained, it is important to know whether it has properly adapted itself to the training data. Because it is obvious that one optimal map for the given input data must exist, several map quality measures have been proposed. Usually, the quality of the SOM is evaluated based on the mapping precision and the topology preservation.
Mapping precision
The mapping precision measure describes how accurately the neurons 'respond' to the given data set. For example, if the reference vector of the BMU calculated for a given testing vector xi is exactly the same xi, the error in precision is then 0. Normally, the number of data vectors exceeds the number of neurons and the precision error is thus always different from 0.
A common measure that calculates the precision of the mapping is the average quantization error over the entire data set:
Topology preservation
The topology preservation measure describes how well the SOM preserves the topology of the studied data set. Unlike the mapping precision measure, it considers the structure of the map. For a strangely twisted map, the topographic error is big even if the mapping precision error is small.
A simple method for calculating the topographic error:
where
is 1 if the first and second BMUs of are not next to each other. Otherwise is 0.
Visualizing the SOM
The SOM is easy to visualize and over the years, several visualization techniques have been devised. Due to the inherently intricate nature of the SOM, however, not one of the visualization methods discovered so far, has proven to be superior to others. At times, several different visualizations of the same SOM are needed to fully see the state of the map. From this, in can be concluded that every existing visualization method has its merits and demerits. The rest of this section provides an overview to three of them, two of which visualize the reference vectors while the remaining one visualizes the data histograms.
Visualizing the reference vectors
The Sammon mapping represents the SOM in much the same way as Figure 6 does. The idea is that the originally high-dimensional reference vector space is compressed into two dimensions, making the visualization of the data possible. The Sammon mapping tries to find an optimum non-linear projection for the high-dimensional data so that the resulting vectors projected onto a two-dimensional surface would still retain the same relative mutual Euclidean distances as they did in the higher dimensionality.
Unified distance matrix, or u-matrix, is perhaps the most popular method of displaying SOMs. It represents the map as a regular grid of neurons as illustrated in Figure 9. The size and topology of the map can readily be observed from the picture where each element represents a neuron. First, when generating a u-matrix, a distance matrix between the reference vectors of adjacent neurons of two-dimensional map is formed. Then, some representation for the matrix is selected, e.g. a gray-level image. The colors in the figure have been selected so that the lighter the color between two neurons is, the smaller is the relative distance between them.
Figure 6: The u-matrix visualization of the SOM. The map size is 12 x 8 neurons and the topology is hexagonal. The map has also been labelled.
Visualizing the data histograms
The purpose of a data histogram is to show how input data vectors are clustered by the SOM. The data histogram visualization shows how many vectors belong to a cluster defined by each neuron. The histogram can be generated by using a trained SOM and a data set. The BMU is searched for each data vector, after which a hit counter associated with the BMU is increased by one. There are several possible ways of visualizing the resulting histograms, one of which is shown in Figure 7.
Figure 7: A 3D data histogram visualization. The map size is 12 x 8 neurons and the topology is hexagonal.
Hierarchical SOMs
The input of one SOM can be taken from the output of another. The input can also be formed of several output vectors from many SOMs. This kind of structure is referred to as a hierarchical SOM. The topmost SOM in the structure (see Figure 8) clusters the outputs and provides a means of monitoring the operations of the underlying SOMs. As the SOM hierarchy is traversed upwards, the information becomes more and more abstract.
Figure 8: Hierarchical SOM with the BMUs indicated. The topmost SOM takes input from the outputs of the three lower level SOMs.
Applications of SOM
The most important practical applications of SOMs are in exploratory data analysis, pattern recognition, speech analysis, robotics, industrial and medical diagnostics, instrumentation and control. The SOM can also be applied to hundreds of other tasks where large amounts of unclassified data is available.