Artificial neural networks (ANNs) have been used in different areas and problems such as classification problems, computer vision, speech recognition, and more. How well an ANN performs is affected by the configuration of the hyperparameters, dataset to be used, and network topology. The network topology, or also called as architecture or structure, is a representation of how the different layers and neurons of an ANN are connected (Fiesler & Beale, 1996). The design of the network topology of an ANN greatly affects its success (Augasta & Kathirvalavakumar, 2013).
The main components of the network topology of an ANN are the neurons, connections, and layers. The neurons, also referred to as nodes or cells, are perceptrons where the calculations happen. The connections or weights, represent the synapses connecting neurons to one another. An ANN is divided into three parts or layers: input, hidden, and output layer. A layer is connected to another where its outputs are used as inputs for the next. The amount of neurons in the input layer is equal to the size of the input data. The amount of neurons in the output layer is the number or type of outputs in the problem. How many hidden layers and hidden neurons per layer are chosen by the user depending on the purpose of the ANN or study.
This is the main issue in designing the topology of an ANN, configuring the amount of hidden layers and neurons to use per layer. In terms of deciding how many layers to use, a theory by Lippmann (1987) showed that two hidden layers were enough in getting the desired shape of a classification region. Another study states that one hidden layer can approximate any function one needs (Goodfellow, Bengio, & Courville, 2016). According to Reed and Marks II (1999), , even if most problems can be solved using a large single hidden layer, it can be more efficient to use more layers. However, these studies do not state the necessary amount of neurons needed per layer.
With regards to the optimal amount of hidden neurons, there is no standard and definite way to determine such a number. If the network has too many hidden neurons, it may have a high generalization error from overfitting to the training data (Augasta & Kathirvalavakumar, 2013). It can also cause the network to have redundant neurons. This leads to an ANN with high computation cost and memory wastage (Denil, Shakibi, Dinh, De Freitas, et al., 2013; Han, Pool, Tran, & Dally, 2015). It is especially true for deep neural networks (DNNs), which are ANNs with more than one hidden layer.
A network with too few neurons can end up underfitting and failing to find an optimal solution. The training time needed to reach an optimal solution or at least an acceptable state is increased. This is due to the network not having enough processing power (Reed, 1993).
Different approaches have been theorized and made to aid in designing the ANNs. First, some books and articles have proposed a \rule-of-thumb" for making the topology. An example would be that the total amount of hidden neurons should be between the size of the input and output nodes (Blum, 1992). These rules however cannot be used in most circumstances because the training dataset size, the complexity of the data to be learnt, and the amount of noise in the targets are not considered. Second, a simple approach would be trial and error but it yields sub-optimal designs and takes time (Stathakis, 2009). However, with experimentation, understanding of the problem, and existing models and literature the possible number of hidden nodes and hidden layers to start the trial and error with can be intuitively identified. Lastly, more dynamic approaches are pruning and constructive algorithms.
Pruning and constructive algorithms adjust the network structure by removing or adding connections, neurons, or layers to optimize the network (Stathakis, 2009). The objective of pruning is to reduce the size of the network without losing the network’s capabilities or even make it better (LeCun, Denker, & Solla, 1990). The standard approach is to first train an over-parameterized network, a network with a large amount of hidden neurons more than necessary, for a certain amount of epochs. One can also use a pre-trained model that have been made by others instead of training anew one. Next is to prune the trained network based on the chosen pruning criterion or method. The resulting network is then fine-tuned by retraining it to adapt and improve using its new structure (Liu, Sun, Zhou, Huang, & Darrell, 2018).
However, pruning still faces the issue of designing the initial network and configuring the hyperparameters to be used for training. There is also the problem of choosing the pruning method to be used and if the fine-tuning phase will use the same configurations as the training phase. This may cause finding a reasonable network to take much longer as opposed to creating a basic ANN. This is especially true if iterative pruning is done. It is where the network iterates over the pruning and fine-tuning phase until a condition is met to stop. The aim is to be able to find and create the smallest possible network without losing or improving its capabilities.
One of the first studies on pruning ANNs, by LeCun, Denker, and Solla (1990),created a technique called Optimal Brain Damage(OBD). The study states that parameters have a level of saliency, which is how much they affect the training error.To prune the network, parameters with small saliency are removed. These parameters when deleted have the least or do not have any effect on the resulting training error.To compute the saliencies, the second derivative of the objective function and weights are used. Iterative pruning is also utilized.
The problem with OBD is that finding the saliencies of the weights are computationally expensive, especially on large networks. Since iterative pruning is also used, this computation is done multiple times. There is also the issue of choosing which parameters to delete, meaning how many low-saliency parameters get deleted or if a threshold will be used. Other studies have focused on using this idea of finding the significance of the weights with regards to the network by speeding up or improving the calculations needed to find the parameters’ importance.
Instead of having extra calculations to find the importance of the parameters, the most basic method would be magnitude-based pruning (MBP). MBP assumes that weights with small values are irrelevant to the output of the network (Hagiwara, 1994). The only hyperparameter to set is the threshold. Any weights with a value less than the threshold are marked irrelevant and are pruned. This makes MBP simple to implement and quick.