Traffic Sign Detection and Categorization

The Problem

Detecting and Identifying the Traffic Signs play an important role in Driver Assisting Systems, and Autonomous vehicles. These systems/vehicles can then perform certain tasks based on detecting a particular traffic sign. There can be different number of tasks, one of them could be just to create an alert for the driver, and the other one could be for actuating a response such that unfortunate consequences can be avoided. The problem therefore involves localizing and finding the size of the road signs. This will be followed by detection and categorization of the road sign into a specific sub category. The road signs have different unique shapes and colors that are visible for human eye, however when it comes to image processing, there are several other parameters such as motion blur, segmentation, illumination intensity changes, color deterioration etc. that factor in, and therefore these need to be addressed. We have intended to use deep learning architecture, Convolutional Neural Networks, that can identify these traffic signals. We have deployed tensorflow library, Keras API, pickle module and GTSRB (German Traffic Sign Benchmarks) data set in our implementation. The purpose of choosing GTSRB data set is that it is highly reliable and they provide unique physical traffic sign instances. Tensor library is used for numerical computation wherein dataflow graphs are created. Keras is a model-level library that provides high-level building blocks for developing deep learning models. Since keras can not handle itself low-level operations such as tensor products, convolutions etc, it relies on a specialized, well-optimized tensor manipulation library to do so, serving as the "backend engine" of Keras . Keras and tensor flow deep learning libraries with CUDA and CUDNN libraries for GPU accelerated training are used to implement this network model.

Related Work

Enormous research work has been done in the field of traffic sign detection including evaluation of different data sets which belong to different places used for the purpose of classification and tracking.

Feature Extraction Method

Various extraction models have been used for extraction of features. Certain models are SIFT, GLOH, HOG etc. SIFT (Scale invariant feature transform) is technique used for detecting feature points and obtaining description using DoG function. GLOH and SIFT operate similarly apart from the fact that it replaces Cartesian location grid in SIFT to log polar plane. Also, the performance here is better than SIFT. HOG, on the other hand, incorporates detection using gradient of color image and normalized plus weighted histograms.

Machine Learning in Computer Vision

Before Convolutional Neural Network (CNN) technique came into picture and was widely adopted, several machine learning algorithms were used for traffic signal classification. SVM, LDA, and sparse representations are some of these algorithms. SVM, Support Vector Machine, is a classifier which is defined by a separating hyperplane. The main objective here is to separate a set of data into different classes. For nonlinear classification line, SVM defined two tuning parameters: Kernel and Regularization, which provides more accuracy. Polynomial and exponential kernel calculates separation line in higher dimension termed as kernel trick. LDA, on the other hand, estimated probabilities on the new set of input using Bayes’ theorem. It assumes that the data is multivariate Gaussian and have same co variance.

Here, machine learning based approaches were lagging in certain aspects, hence CNN was adopted since it surpassed the performance of machine learning algorithms.

Convolution Neural Network

CNN is a multi stage neural network architecture. An image is passed through a series of operations which involve convolution, nonlinear, pooling (down sampling) and fully connected layers. For the second step, ReLU is a better way of implementing nonlinear filtering rather than tanh.

After the data is passed through all the layers, output of this network is fed to classifier for optimising the accuracy for classification.

Capsule Neural Networks

A Capsule Neural Network (CapsNet) is a machine learning system that is a type of artificial neural network (ANN) that can be used to improve model hierarchical relationships. This approach is more or less an attempt to mimic biological neural organization.

The idea is to add structures called capsules to a convolutional neural network (CNN), and to reuse output from some of those capsules to form more stable (with respect to various perturbations) representations for higher order capsules.

The output is a vector consisting of the probability of an observation, and a pose for that observation. This vector is similar to what is done for example when doing classification with localization in CNNs.

Among other benefits, capsnets addresses the "Picasso problem" in image recognition: images that have all the right parts but that are not in the correct spatial relationship (e.g., in a "face", the positions of the mouth and one eye are switched). For image recognition, capsnets exploit the fact that while viewpoint changes have nonlinear effects at the pixel level, they have linear effects at the part/object level.

The approach

the dataset

There are various data sets already available for evaluation like GTSRB, Bosch, LISA etc. In our project, we have used GTSRB data set. This data set was collected by capturing a 10 hr long video on different roads of Germany. While experimentation, 1,33,000 images were collected and these were reduced to half, which counts to 51840 approximately, considering removal of repetitive images and various other factors. These images have been segregated in training set (34799 images), validation test(4410 images) and testing set (12630 images) and size of image is 32*32*3. The data corresponds to 43 classes (e.g. Speed Limit 20km/h, No entry, Bumpy road, etc.) .

You can see on the left a sample of the images from the dataset, with labels displayed above the row of corresponding images. Some of them are quite dark so we will look to improve contrast a bit later.

There is also a significant imbalance across classes in the training set, as shown in the histogram below. Some classes have less than 200 images, while others have over 2000. This means that our model could be biased towards over-represented classes, especially when it is unsure in its predictions. We will see later how we can mitigate this discrepancy using data augmentation.

Pre-Processing

Pre-processing of the data set involves application of different techniques on the images to be trained. This includes, variation in brightness, normalisation, converting to gray scale image etc. As seen previously from the data set images, some of them were very quite dark and poor in contrast, therefore Brightness and contrast of the data has been enhanced.

New Dataset

Data-augmentation

We observed earlier that the data presented glaring imbalance across 43 classes. Some classes have more images than the other, thus network can be biased and leads to Overfitting. Therefore, in order to resort this problem we can augment that data using techniques like Flips, Rotation, Shear and Affine Transforms. We also noticed that some images in the test set were distorted. We are, therefore, going to use data augmentation techniques in an attempt to:

Extend dataset and provide additional pictures in different lighting settings and orientations.
Improve model’s ability to become more generic.
Improve test and validation accuracy, especially on distorted images.

Convolutional Neural Network Architecture

1. First layer is convolutional layer. A filter (feature identifier), also referred to as kernel is used for this layer. We mainly tried 5x5 and 3x3 filter (aka kernel) sizes, and started with a depth of 32 for our first convolutional layer. This filter slides over the image, does element by element multiplication and all of these are summed up. These consecutive operations results in 28*28 output which is called activation map.

2. After convolution layer, non-linear layer ReLU (Rectified Linear Units) is applied. Certain problems experienced with tanh and other such non linear functions before were rectified and improved results on incorporating ReLU in the design were observed.

3. The data from ReLU layer passes onto Pooling layer. Pooling layer is mainly advantageous in reducing the input volume. It is also referred to as down sampling where in a filter is applied to detect the relative position of a particular value within the filter window. Max pooling is done by applying a max filter to (usually) non-overlapping subregions of the initial representation. This helps in reducing computation cost and over-fitting.

4. This is followed by the final layer which is a fully connected layer. It takes output from the previous Pooling layer and outputs N dimensional vector corresponding to N classes that the program chooses to. Here, with GTSRB data set, traffic sign capsule layer consists of 43 capsules and each of these represent a class of German traffic sign data set.

Analysis for this approach

The accuracy predicted by this model was 93%. We found that this maybe possible due to overfitting of data (as we observed an uncalled-for loss in the validation set after some epochs) to some specific classes, which therefore leads to wrongly identified data. Due to the observed losses, we switched to the Capsule neural Network.

Capsule Neural Network

A Capsule Neural Network (CapsNet) is a machine learning system that is a type of artificial neural network (ANN) that can be used to better model hierarchical relationships. The approach is an attempt to more closely mimic biological neural organization.

The idea is to add structures called capsules to a convolutional neural network (CNN), and to reuse output from several of those capsules to form more stable (with respect to various perturbations) representations for higher order capsules.

Capsule networks consists of capsules rather than neurons. Capsule is a group artificial neural networks that perform complicated internal computations on their inputs and encapsulate the results in a small vector. Each capsule captures the relative position of the object and any changes in the object pose cause changes in the output vector orientation, making them equi-variant.

Input Layer: The input layer consists of input training images and the dimension is equal to the total training images.
Primary Capsule Layer: The first layer that follows the input layer is the primary capsule layer and for calculating the output, first two convolution layers were used. The first convolution layer consists of kernel size 9 and 256 filters and padding was not used. Rectified Linear unit (ReLU) was used as non linear activation function and a drop out of 0.7 which is fixed to be optimal after testing with different values. Output is reshaped to get the output vectors of primary capsules. Since the primary capsule layer is fully connected to the traffic sign capsule layer the output vectors have to be squashed using the squashing function. Small epsilon value is added to the squash function to avoid the vanishing gradient problem while training. Now the output of the squash function is fed to the traffic sign capsule layer.
Traffic Sign Capsule Layer: To compute the output of traffic sign capsules, calculate the predicted output vectors for each and every primary traffic sign capsule pair and implement the route by agreement algorithm. The traffic sign capsule layer consists of 43 capsules each representing a particular class of the German traffic sign dataset with size of 32 each. For each capsule i in the first layer, predict the corresponding weights and output vectors of every capsule j in the second layer.

Result and analysis

We implemented this model using Keras and tensor flow deep learning libraries with CUDA and CUDNN libraries for GPU accelerated training to implement this capsule network model. The configuration of system used was as follows:

2 x NVidia Tesla K80 GPU
Machine Type:

n1-standard-8 (8 vCPUs)
30 GB memory

Ubuntu 18.04 LTS Operating System

Start of Training

End of Training

The model is evaluated using the testing data set of 12,630 testing images. Accuracy is computed as the ratio of the correctly identified traffic signs by the total number of traffic signs, with a batch size of 50 obtained an accuracy of 97.76 percent and a final loss of 0.032343034 evaluated on the testing dataset. The performance evaluation is based on the correct classification rate(CCR) and binary loss ( 0 or 1 ) which means by counting the number of misclassification’s .

We obtained the following results using Tensorboard. The main advantage of Tensorboard is that it lets us visualize and understand the inner workings of the code as the code progresses. We have plotted 4 graphs that will help us in visualizing 4 main aspects of the training model: Accuracy, Margin-Loss, Reconstruction-Loss and Total-loss.

1. Accuracy: Accuracy is used to determine how many correct classifications have been made by the training model. For this purpose, we use the testing dataset along with the train dataset to compute the accuracy. The accuracy graph helps us to determine if the accuracy is improving with each iteration. As soon as the accuracy starts dropping, we need to stop the training to avoid Over-fitting. Accuracy is determined as:

Accuracy = (sum(correctly identified traffic signs))/Total number of traffic signs

It can be seen in the graph that the accuracy is improving throughout the time that data is being trained. We stop the training to avoid over-fitting.

2. Margin - loss: Some classes of the test set are very close that it becomes very difficult to distinguish if certain data points belong to that particular class or not. This is known as marginal loss and the main aim is to reduce the margin-loss as much as we can so that the model is able to detect even the hard-to-detect data points.

It can be seen in the graph that the margin-loss is constantly decreasing for the time the data is being trained.

3. Reconstruction - loss: It is the difference between the squares of the input image and reconstructed image

R = (Input image)^2 − (Reconstructed image)^2; where R = Reconstruction Loss

It is used to determine if the reconstructed image is close to the input image.

It can be seen in the graph that the reconstruction loss is constantly decreasing, indicating that the reconstructed image is getting close to the input image.

4. Total - Loss: The Final Loss is the sum of Margin loss and Reconstruction Loss scaled to a factor λ which acts as a scaling factor and it should be very much less than one.

F = (Margin Loss) − λ(Reconstruction Loss); where F= Final Loss, λ= 0.0005

It can be seen in the graph that the loss is constantly decreasing, indicating that the loss of the training model is decreasing.

Margin loss should always dominate the Reconstruction loss in comparison. If reconstruction loss is more than final loss, the model tries to match output image exactly with the input image of training dataset which lead to over-fitting of the model to the training data.

We tested the dataset using 8 random images from the internet and tested it against the model. The model was able to correctly predict 6 images.

Conclusion

Traffic sign detection is a challenging task and capsule networks, using their inherent ability to detect the pose and spatial variances, perform better when compared to CNN. Capsule networks increase the reliability and accuracy by correctly performing image classification and recognition tasks even on blurred,rotated and distorted images.

References

1. Yihui Wu, Yulong Liu, Jianmin Li, Huaping Liu, and Xiaolin Hu. Traffic sign detection based on convolutional neural networks. In Neural Networks (IJCNN), The 2013 International Joint Conference on, pages 1–7. IEEE, 2013.

2. Johannes Stallkamp, Marc Schlipsing, Jan Salmen, and Christian Igel. The German Traffic Sign Recognition Benchmark: a multi-class classification competition. In Neural Networks (IJCNN), The 2011 International Joint Conference on, pages 1453–1460. IEEE, 2011.

3. Amara Dinesh Kumar, R.Karthika and Latha Parameswaran. Novel Deep Learning Model for Traffic Sign Detection Using Capsule Networks, arXiv 2018, arXiv:1805.04424 .

4. N. Dalal and B. Triggs. Histograms of oriented gradients for human detection. In 2005 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR’05), volume 1, pages 886–893 vol. 1, June 2005.

Google Sites

Report abuse