Uncertainty Estimation in Deep Learning Networks

TEAM MEMBERS:

Varun Asthana
Saumil Shah
Sneha Nayak
Revati Naik

Inspired by the paper.

HOW DO WE DEFINE UNCERTAINTY?

Uncertainty in deep neural networks are usually due to inherent noise in sensor systems or could arise due to approximated models. These uncertainties can generally be divided into data uncertainty and model uncertainty. Getting the correct estimation of these uncertainties can prove beneficial since any discrepancies in data or model approximations can lead to large failures and can even be dangerous. Having these uncertainties could result in giving wrong predictions with really high confidence value. Hence, there is a need to better estimate these uncertainties.

We intend on using this uncertainty prediction in the case of evaluating how accurately our model predicts the class label of a given image. How this uncertainty prediction will prove to be beneficial in our case is that a lower confidence score is given to a class for which the uncertainty is high.

The way this can be achieved more robustly is by combining Bayesian belief network with Monte Carlo Sampling [1], which has proven to be predict uncertainties better than other traditional approaches. We use assumed density filtering (ADF) to propagate activation uncertainties through the network in a single pass.

We define the posterior probability distributions as p(y|x) where y are output predictions and x input samples. We define the total prediction uncertainty as

which is considering both model and data uncertainty that can be estimated using an approximation of p(y|x).

To consider the uncertainty in data ( which can arise due to noise in sensor observations), the neural network processes as input z, which is just a noisy version of the actual input x. This sensor noise can be defined as the noise characteristic v, then the mean and this noise uncertainty can be given by:

where E and V are the first and second moment of the distribution.

This data uncertainty can be computed by forward propagating this sensor noise through ADF layers, which helps generate both output predictions µ (l) and their respective uncertainties v(l)

Model uncertainty refers to the confidence a model has about its prediction [1], and hence depends on the distribution of the dateset D={X,Y}, where X,Y are training samples and labels respectively. Hence, weight distribution is p(ω|X, Y), where ω represents weights of the neural network. To approximate this distribution, we use Monte Carlo approach with dropouts at test time and considering this assumption, the model uncertainty is the variance of T Monte-Carlo samples given by [1],

Then the total variance can be given by [1],

Fig 1. Given an input sample x, associated with noise v(0), and a trained neural network, our framework computes the confidence associated to the network output. In order to do so, it first transforms the given network into a Bayesian belief network. Then, it uses an ensemble of T such networks, created by enabling dropout at test time, to generate the final prediction µ and uncertainty σtot [1]

DEEP LEARNING MODEL DESIGN AND TRAINING

For this implementation of ADF, our input to the network is a pointer to the tuple of (data, variance) since we need to perform operations of convolution, batch normalization, ReLU activation, linear (fully connected) on both the data and its variance simultaneously. They both interact with each other in the the layers of ReLU, MaxPool2d, Leaky ReLU and Softmax and finally return the modified tuple.

The layers which were custom defined to work as per the requirement to ADF include:

AvgPool2d
MaxPool2d
ReLU
LeakyReLU
Dropout
Conv2d
ConvTranspose2d
Linear
BatchNorm2d
Softmax

All of the above layers accept a tuple of data which will be processed together.

Below are the modified equations used for the custom designed layers of the network and any layer can be converted into an uncertainty propagation layer by using these equations.

The mean and variance is given by the equations, and

For Linear Layers:

Average Pooling:

ReLU Actvation Layer:

The weights obtained from the pre-trained model are used to start training the ADF model from an intermediate epoch.

WHY USE A PRE-TRAINED MODEL??

ADF is mathematically not very stable and hence, when we start training a network from the first epoch, we encounter NaN values. This is due to the heteroscedastic loss. To address this problem when training the network, we initialize the network weights from the best pre-trained model. This helps us to avoid the NaN values in the ADF model. When we say we use weights from a pre-trained model, that simply means we use a model which can classify our data up to a decent level of accuracy (more than 70%). This will provide a base to the ADF layers to increase the accuracy on the classification.

EXPERIMENTS AND RESULTS

The CIFAR-10 dataset was used to train and evaluate our framework. The framework uses the ResNet18 Architecture and we obtain the weights for about 60-75% accuracy from this model. This is done without the ADF layers being added to the model. With the pre-trained weights we now start training the ResNet18 model with the ADF layers. This model has the custom designed layers which have been written using the above equations which takes in both the mean (data) and the variance. The network outputs the classification of the object along with a confidence score which gives the system an idea of the classification being correct.

Training with ADF layers:

We use the weights from a pre-trained ResNet18 model and resume training the ResNet18 model with ADF layers from epoch 321. At this point the network accuracy (weight accuracy) is 93.67%.

Evaluation with ADF layers

We evaluate the ResNet18 ADF model over the testing data. The parameters for evaluation are as shown in the image.

Sample Testing

The below image shows a real good example of the use of this model. This image of frog was one of the top 5% error. Without the proposed model, normal resnet18 model predicts it as a dog with high confidence, but ResNet18 with ADF layers gives all the predictions below the probability of 0.4, which says that model is quite uncertain about what the image is.

Model Output on Test Data

The final evaluation output is seen with an accuracy of 94.67%.

Brier Score: The Brier score is basically the sum of squared errors of the class-wise probability estimates. It will inform you as to both how accurate the model is and how "confidently" accurate the model is. In this case, we get a Brier Score of 0.00835.

Negative log-likelihood (NLL): The NLL value becomes high whenever the network assigns high confidence at the correct class and vice-versa. In this case, our NLL value is pretty high which shows that high confidence score has been assigned to the correct class.

OBSERVATIONS

Conv2d: As per the methodology defined above, once the convolution is performed on the data with some learned weights (for the particular iteration in epoch), convolution on variance has to be done with the square of the same weights as that used for the main data. Since the initial value used for variance is a small positive number, hence in the convolution operation, variances will always be positive, and their values will be computed by performing multiplication and addition with the square of some weights.The layer was initialized with random weight initialization (which were to be learned over the time).

By virtue of this operation, variance will keep on shooting up if the weights are sufficiently large in any one of the convolution layer. And since later in the layers of ReLU, the ’data’ and the ’variance’ interact with each other, the large values propagate in the entire network. It was observed that after few conv layers the value will shoot up to such level that they will get out of scope of the data type of the ’data’ and the ’variance’.

To counter this we first experimented to have very small values for the weight initialization using Xavier Initializer. This did reduce the value range of the learned weights to some extent, but did not impose any explicit bound on the weights value. And thus the data’ and the variance’ was still becoming Nan. To impose a constraint on the upper and lower bound on the kernel weights, we implemented Kernel Constraint of max_norm(1) for each convolution layer. With this we were able to have all the learned parameters to be in the range of [-1,1]. When such values of weights were squared and treated as filters for the ’variance’, it put a check on the max value of variance to be under 1.

MaxPool2d: In the MaxPool2d layer, we were utilizing the a formulation which computed square root of the variance and this value was used in the denominator of a division. It was later debugged that the results were becoming Nan whenever the ’variance’ value was zero. As the division became undefined. To resolve this issue, a very small value of 0.001 was added to the new computed ’variance’ in each and every layer. With other small rectification, ADF layers were ready to be used for model validation. We tested our layers on CIFAR10 data for classification into 10 classes.

REFERENCES

[1] Antonio Loquercio*1, Mattia Segu*1, and Davide Scaramuzza1, "A General Frameworkfor Uncertainty Estimation in Deep Learning",IEEE pre print version, January, 2020

[2] Jochen Gast, Stefan Roth "Lightweight Probabilistic Deep Networks"May 2018

[3] A. Kendall and Y. Gal, “What uncertainties do we need in bayesiandeep learning for computer vision?” 31st Conference on Neural Information Processing Systems (NIPS 2017), 2017.

[4] T. Zhou, M. Brown, N. Snavely, and D. G. Lowe, “Unsupervised learn-ing of depth and ego-motion from video,” in 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), July 2017, pp.6612–6619.

Please refer to the GitHub repository for the software implementation github.com/revati-naik/dqn_uav.git