Semseg
Baladhurgesh Balagurusamy Paramasivan | Madhan Suresh Babu | Vamsi Bhargav Sukamanchi
Instructor : Prof. Kyumin Lee, Computer Science department, Worcester Polytechnic Institute
Baladhurgesh Balagurusamy Paramasivan | Madhan Suresh Babu | Vamsi Bhargav Sukamanchi
Instructor : Prof. Kyumin Lee, Computer Science department, Worcester Polytechnic Institute
Having been interested in Autonomous driving we wanted to understand what makes a good semantic segmentation network for road scene understanding. Semantic Segmentation is a pixel wise classification on a given image and is used in the autonomous driving domain to enable the vehicle understand its environment completely. So we worked on analyzing the influence of certain aspects / layers of a popular semantic segmentation architecture called SegNet in producing the desired segmentation result.
SegNet is an encoder-decoder architecture that consists of 13 convolutional layers in the encoder and 13 convolutional layers in the decoder network. The novelty in SegNet is the use of transposed convolutional layers in the decoder network instead of FCNs, and the utilization of max pooling indices from the encoders for up-sampling in its corresponding decoder layer to produce sparse feature maps. Convolution operation in both the encoder and decoder networks are performed with trainable filters to produce dense feature maps.
The advantage of SegNet is the shorter time required for training due to lesser number of parameters that are to be trained compared to other networks that contain FCN for classification instead of the CNNs. Multi-class soft-max classifier is used for obtaining pixel-wise class probabilities in the final layer.
SegNet (Base architecture)
OUR RESULTS :
Input Image Segmented Output
The convolutional layers in the base architecture performs stride 1 convolution with padding (1,1) and filter size 3x3. With the aim of understanding the influence of max pooling layers in the base architecture and to possibly improve the segmentation performance we changed the convolutional layers before each max pooling layer to stride 2 convolution with padding (1,1) and filter size 3x3, and removed the max pooling layers from the architecture. In the new architecture the modified convolution layers would be performing the downsampling operation performed earlier by the max pooling layers.
OUR RESULTS :
Input Image Segmented Output
The Modified SegNet architecture was attempted with the idea that using stride 2 convolutional layers for down sampling would reduce the training time and retain most of the features from earlier layers instead of retaining only the max features which happens with max pooling layers.
However, the use of strided convolution made gradient propagation harder, resulting in degraded performance. So, in order to improve the performance of the modified architectures we introduced skip connections to the Modified SegNet architecture so that simple features represented in the earlier layers are available in the later convolutional layers in the decoders.
OUR RESULTS :
Input Image Segmented Output
Training Loss Graph
Global Class Accuracy :
SegNet : 82.27%
Modified SegNet : 63.55%
Modified SegNet with Skip connections : 86.1%
The study was performed using Cityscapes dataset which is a large-scale dataset containing street scene images recorded from 50 different cities with high-quality pixel-wise annotation for 5000 images. The dataset is a benchmark suite for research and evaluation of vision algorithms for urban scene understanding like instance based segmentation, and pixel-wise segmentation.
Training and Validation set size : 3475 images, Test dataset size : 1525 images. Link to the dataset
(1) SegNet: A Deep Convolutional Encoder-Decoder Architecture for Image Segmentation (pdf)
Baladhurgesh Balagurusamy Paramasivan | Madhan Suresh Babu | Vamsi Bhargav Sukamanchi
bbalagurusamypar@wpi.edu | msureshbabu@wpi.edu | vsukamanchi@wpi.edu