Submission by Vishnu Dutt Sharma and Kulbir Singh Ahluwalia as Mini-Project 1 for CMSC818B offered at UMD, College Park
Semantic Segmentation is a computer vision method for scene understanding. As suggested by the name, it involves segmenting the image by their semantics as shown in Fig 1. this is often achieved by assigning a semantic label to each pixel (or a similar basic unit) of the image (input). It's generally a image-to-image technique i.e. it takes an image as input and produces an image as output. Thus the output of a semantic segmentation module is another image of same size (height and width), bearing semantic labels at each pixel corresponding to the pixel at same location in input image. However, other variants like point-cloud semantic segmentation also exist. The labels can often be grouped together (e.g. blobs for 'person' in Fig 2), and this the output may also be considered as regions and structures.
Fig 1: Semantic Segmentation input and output. Output is a single channel image of same size containing labels/number indicating to which semantic label the pixel belongs [F1]
Semantic Segmentation is quite similar to Instance segmentation, but differs slightly in output representation; In semantic segmentation all object of same class are assigned same label. In contrast instance segmentation assigns labels based on instance. A closely related technique is object detection, where the goal is to identify the objects in the image. However, it's not a pixel-to-pixel mapping. Instead it predicts the objects present (classification), and sometimes draws a bounding box around the object in the input image.
Fig 2: Difference between object detection, semantic segmentation and instance segmentation. Semantic Segmentation assigns same label to entities of same type (Person), whereas instance segmentation assigns different labels to different entities. [F2]
Table of contents
Applications of semantic segmentation
Semantic Segmentation (Sem-seg) is used to identify objects in image, aligning with human understanding. Thus it finds usage across a variety of applications. Some of them are:
Detecting and locating tumors
Estimating volumes of tissues
Navigation of tools during surgery
Geographical Mapping: It can be used to identify land usage of a country or place and identify the area of land covrered by different uses like industry, agriculture, housing, wildlife sanctuaries, making maps, traffic mangement and environmental protection.
Facial segmentation: Estimate age, gender and expressions. It can be affected by lighting conditions, orientation of person and his face and the detail present in the image. 
Use in E-commerce online markeplaces: It is used to categorize objects in categories especially for identifying different types of clothes.
Use in agricultural robotics: It can be used to segment out patches of fields with low growth or high amount of weeds. Farmers can then apply their resources to those locations for maximum output from the field.
Robotics (discussed below in the next section)
Applications of semantic segmentation in Robotics
For robotics, semantic segmentation is used for scene-parsing to make decision. for example, object manipulation task requires identifying the objects present in scene. However object detection may also be used here. As many robotic application require real-time prediction, some of the architectures focus on lighter and faster models, at the cost of prediction accuracy. Prominent applications of semantic segmentation are for:
Autonomous Driving: It uses semantic segmentation for multiple types of decisions based on detection of multiple entities like lane and free road for navigation, pedestrians and other vehicles for safe driving. A 3D semantic map generated by combining semantic segmentation of the scene and depth map is used for exploring the scene. Some variants which use 3D information (RGBD) as input for semantic segmentation of point-cloud for autonomous driving.
Fig 3: Semantic segmentation for a typical automative scene [F3]
2. Semantic segmentation on aerial images: It is used for surveying large scene and helping with precision agriculture. Aerial-ground robot coordination systems are used for exploration in hazardous areas, where semantic segmentation from aerial robot, acting as the scout, is used to plan path for the ground robot navigation.
Fig 4: Semantic segmentation for an aerial scene [F4]
3. Indoor navigation of robots: While the applications mentioned about present outdoor scenarios, indoor robot navigation also leverages semantic segmentation for moving around the objects/obstacles.
Fig 5: Semantic segmentation for an aerial scene [F5]
Listed below are some of the most popular datasets used for Semantic Segmentation Tasks. In each case, ground truth is a single-channel images of same size as input containing one of semantic labels at each pixel.
Stands for Cambridge-driving Labeled Video Database . It contains images taken by a vehicle driving around Cambridge, UK. It is used for autonomous driving applications.
Input: RGB images of size 960x720
Semantic Labels: 32
Stands for PASCAL Visual Object Classes dataset used for various versions of PASCAL VOC Challenge . It contains indoor and outdoor scene and is mainly used for object classification.
Input: RGB images of size 500x334
Semantic Labels: 20
Developed by Microsoft, COCO stands for Common Objects in Context . This dataset is also popular for object detection and panoptic segmentation task.
Input: RGB images of multiple sizes
Semantic Labels: 80
This dataset was built by CSAIL, MIT and contains indoor and outdoor scene and is mainly used for scene parsing .
Input: RGB images of various sizes
Semantic Labels: 150
This dataset contains scenes from 50 cities across Germany in different environmental conditions (time of day, weather) . It is used for autonomous driving.
Input: RGB images of size 1024 x 2048
Semantic Labels: 30
NYU Depth Dataset version 2 contains depth as well as semantic labels for indoor scenes. it is used for scene understanding . A version 1 of this dataset is also there.
Input: RGB(D) images of size 480x640
Semantic Labels: 13
This dataset is a combination of NYU depth v2 , Berkeley B3DO , and SUN3D  and contains depth as well as semantic labels for indoor scenes .
Input: RGB(D) image
Semantic Labels: 19
For evaluating the performance of a semantic segmentation approach, following metrics are often used. We will use Pred to refer to the predicted semantic segmentation and GT to refer to the ground truth semantic segmentation.
- Intersection-Over-Union (IoU)
IoU is also know as Jaccard Index and is the most commonly used metric for semantic segmentation. IoU is defined as the ratio of area of overlap between Pred and GT (Pred ∩ GT), and the area of the union of the Pred and GT (Pred ∪ GT).
In presence of multiple labels, both class-IoU and mean-IoU/mIoU (average of all class-IoUs) are reported.
- Pixel Accuracy
Pixel accuracy is the same as percentage of pixels correctly classified in the image. It is thus defined as
Pixel Accuracy = (TP + TN) / (TP+TN+FP+FN)
TP: True Positive
TN: True Negative
FP: False Positive
FN: False Negative
In addition to these, inference time is also reported for approaches which target faster prediction.
As semantic segmentation requires semantic understanding to be incorporated in the solution, the most popular approaches are learning-based. As with other computer vision applications, most popular approaches for semantic segmentation use deep learning. In this section we focus on these deep approaches by first introducing some of the key-terms and then we list down some of the most popular deep learning architectures/variants for semantic segmentation.
As semantic segmentation requires processing images, Convolutional Neural Networks are generally used for this task. While a series of CNNs with output size same as input can work for this problem, such an architecture imposes huge computational and memory cost due to the number of parameters to be learnt and stored by these layers. One constrained here is that the output size should be same as input (height and width).
To mitigate the problem of large number of parameters and variables, encoder-decoder architectures are generally employed. In this setting, the architecture consists of two parts: (a) Encoder: where the number of parameters are reduced in subsequent layers. In CNN it is achieved by pooling and steps. (b) Decoder: where the number of parameters are increased in subsequent layers to increase the output size. IN CNNs this is done by upsampling.
While such architectures help with memory and computational constraints, reducing size of images, and thus information loss in subsequent layers results in boundary delineation, which means not being able to create fine boundaries between the objects.
Fig 6: Boundary delineation problem due to loss of fine-grained features [F6]
Boundary delineation arises from loss of global information due to pooling. To pass this global information, skip-connections are used which pass the boundary information from encoders to the decoders. In this scheme, each encoder layers is paired with a decoder layer such that the shape of input to the encoder layers is same as shape of the output to the decoder layer.
Fig 7: Skip-Connection technique used in U-Net for boundary delineation [F7]
Dilated convolution is an alternative method of convolution used for increasing the receptive field of the CNN. in dilated convolution, intermediate pixels are skipped, similar to stepping operation in CNN. While the normal approach uses convolution layers followed by deconvolution layers, dilated convolution is used to keep the output resolution high and it dosn't need upsampling. It also requires lesser number of parameters to be learned.
(a) Standard Convolution
(b) Dilated Convolution
Fig 8: Difference between the standard convolution and dilated convolution [F8]
Following are some of the popular deep learning architectures used for semantic segmentation:
PSPNet: Pyramid Scene Parsing Network . It is called so as it uses image pyramids to generate global context information. When proposed, this architecture provided the best mIoU accuracy over PASCAL VOC 2012 and Cityscapes dataset.
ICNet: Image Cascade Network . designed for real-time inference, this architecture first extracts features from multiple resolutions of the input image and then fuses these features for making prediction.
FRRN: Full-Resolution Residual Network . Based on ResNet , it has two branches: one for precise boundaries, other for good labelling, which are combined together for making prediction.
FCN: Fully Convolutional Network . It uses only CNNs for making predictions.
U-Net: U-Net. It is called so due to its shape . It was built for biomedical image segmentation.
Link-Net: Link-Net. It is called so due to the links between encoders and decoders . It was designed for fast inference with good accuracy.
SegNet: Segmentation Network . It targets real-time inference and shares pooling indices between the encoders and decoders. It also has a variant called Bayesian SegNet  which also provides uncertainty in prediction.
E-Net: Efficient Neural Network . it was developed as a lighter, faster option for semantic segmentation. However, it has a low mIoU accuracy.
RefineNet: Refinement Network . It uses chained residual pooling for sharing fine-grained information between layers.
The model listed above are only some of deep learning architectures for semantic segmentation. Cityscapes provides a comparison of 239 models across various dimensions including IoU (intersection over union) and inference runtime. It also provides links to their implementation. This list is available at https://www.cityscapes-dataset.com/benchmarks/
Some of the results are listed below:
Random Forest with Learned Representations for Semantic Segmentation : This approach uses a random-forest over unconstrained representation for real-time semantic-segmentation. While the accuracy for this approach is quite low, the inference time is much lesser than the existing methods.
PCAMs: Weakly Supervised Semantic Segmentation Using Point Supervision : Weakly supervision is a variation of machine learning which uses noisy or imperfect labels for training a model in supervised setting. This method is used when sufficient number of labeled data is not available.
3D Graph Neural Networks for RGBD Semantic Segmentation : While most of the semantic segmentation architecture focus on 2D data, this model uses depth data along with color information (RGBD) and thus performs semantic segmentation on 3D data
SalsaNext: Fast, Uncertainty-aware Semantic Segmentation of LiDAR Point Clouds for Autonomous Driving : This network also performs semantic segmentation on 3D data but it also incorporates uncertainty in prediction.
Open Research Questions
Following are some of the key open problems for semantic segmentation:
Faster prediction with good accuracy:The key problem in semantic segmentation is striking a balance between prediction accuracy and inference time. This need arises from the real-time prediction required in applications like autonomous driving. Recent architectures have used encoder-decoder approach to have a light model, but the loss of information in encoder (due to feature reduction) affects the performance of the models (mainly due to boundary delineation).
Better boundary delineation: Finding how to share information between encoder and decoder to tackle boundary delineation is an active area of research. The deep learning architectures listed earlier differ in their approach to enable this. Approahces to solve this problem often employ novel architectures.
Generalization: As with any deep learning model, generalization is also one of the open research problem. Here it isn't limited to data-level generalization (would a model trained on CamVid dataset work well on Cityscapes dataset?), but rather extends to architecture level generalization (does SegNet work as well for indoor datasets as it does for the outdoor datasets).
Apart from these, researchers also look into other areas which are applicable to other deep learning models as well e.g. fairness in prediction, meta-learning, etc.
We look at a code example for a semantic segmentation model for images.
For this python code to run, we need tensorflow and PixelLib installed:
Import the PixelLib library:
Import the class called "semantic segmentation":
Create an instance of the class called "semantic_segmentation_images":
Load the deeplab v3 xception model trained on the PASCAL VOC dataset available at: https://github.com/ayoolaolafenwa/PixelLib/releases/download/1.1/deeplabv3_xception_tf_dim_ordering_tf_kernels.h5
Load the function to perform semantic segmentation. Here, <path_to_image> is the input image for which we want to perform semantic segmentation and the <path_to_output_image> is the segmented output image. Setting overlay = true superimposes the segmented image with the original image.
Running this python script shall provide the segmented image.
Brostow, Gabriel J., et al. "Segmentation and recognition using structure from motion point clouds." European conference on computer vision. Springer, Berlin, Heidelberg, 2008.
Brostow, Gabriel J., Julien Fauqueur, and Roberto Cipolla. "Semantic object classes in video: A high-definition ground truth database." Pattern Recognition Letters 30.2 (2009): 88-97.
Everingham, Mark, and John Winn. "The pascal visual object classes challenge 2012 (voc2012) development kit." Pattern Analysis, Statistical Modelling and Computational Learning, Tech. Rep 8 (2011).
Zhou, Bolei, et al. "Scene parsing through ade20k dataset." Proceedings of the IEEE conference on computer vision and pattern recognition. 2017.
Zhou, Bolei, et al. "Semantic understanding of scenes through the ade20k dataset." International Journal of Computer Vision 127.3 (2019): 302-321.
Cordts, Marius, et al. "The cityscapes dataset for semantic urban scene understanding." Proceedings of the IEEE conference on computer vision and pattern recognition. 2016.
Silberman, Nathan, et al. "Indoor segmentation and support inference from rgbd images." European conference on computer vision. Springer, Berlin, Heidelberg, 2012.
Silberman, Nathan, et al. "Indoor segmentation and support inference from rgbd images." European conference on computer vision. Springer, Berlin, Heidelberg, 2012.
Janoch, Allison, et al. "A category-level 3d object dataset: Putting the kinect to work." Consumer depth cameras for computer vision. Springer, London, 2013. 141-165.
Xiao, Jianxiong, Andrew Owens, and Antonio Torralba. "Sun3d: A database of big spaces reconstructed using sfm and object labels." Proceedings of the IEEE international conference on computer vision. 2013.
Song, Shuran, Samuel P. Lichtenberg, and Jianxiong Xiao. "Sun rgb-d: A rgb-d scene understanding benchmark suite." Proceedings of the IEEE conference on computer vision and pattern recognition. 2015.
Zhao, Hengshuang, et al. "Pyramid scene parsing network." Proceedings of the IEEE conference on computer vision and pattern recognition. 2017.
Zhao, Hengshuang, et al. "Icnet for real-time semantic segmentation on high-resolution images." Proceedings of the European Conference on Computer Vision (ECCV). 2018.
Pohlen, Tobias, et al. "Full-resolution residual networks for semantic segmentation in street scenes." Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 2017.
Long, Jonathan, Evan Shelhamer, and Trevor Darrell. "Fully convolutional networks for semantic segmentation." Proceedings of the IEEE conference on computer vision and pattern recognition. 2015.
Ronneberger, Olaf, Philipp Fischer, and Thomas Brox. "U-net: Convolutional networks for biomedical image segmentation." International Conference on Medical image computing and computer-assisted intervention. Springer, Cham, 2015.
Chaurasia, Abhishek, and Eugenio Culurciello. "Linknet: Exploiting encoder representations for efficient semantic segmentation." 2017 IEEE Visual Communications and Image Processing (VCIP). IEEE, 2017.
Badrinarayanan, Vijay, Alex Kendall, and Roberto Cipolla. "Segnet: A deep convolutional encoder-decoder architecture for image segmentation." IEEE transactions on pattern analysis and machine intelligence 39.12 (2017): 2481-2495.
Paszke, Adam, et al. "Enet: A deep neural network architecture for real-time semantic segmentation." arXiv preprint arXiv:1606.02147 (2016).
Lin, Guosheng, et al. "Refinenet: Multi-path refinement networks for high-resolution semantic segmentation." Proceedings of the IEEE conference on computer vision and pattern recognition. 2017.
He, Kaiming, et al. "Deep residual learning for image recognition." Proceedings of the IEEE conference on computer vision and pattern recognition. 2016.
Lin, Tsung-Yi, et al. "Microsoft coco: Common objects in context." European conference on computer vision. Springer, Cham, 2014.
Kang, Byeongkeun, and Truong Q. Nguyen. "Random Forest with Learned Representations for Semantic Segmentation." IEEE Transactions on Image Processing 28.7 (2019): 3542-3555.
McEver, R. Austin, and B. S. Manjunath. "PCAMs: Weakly Supervised Semantic Segmentation Using Point Supervision." arXiv preprint arXiv:2007.05615 (2020).
Qi, Xiaojuan, et al. "3d graph neural networks for rgbd semantic segmentation." Proceedings of the IEEE International Conference on Computer Vision. 2017.
Kendall, Alex, Vijay Badrinarayanan, and Roberto Cipolla. "Bayesian segnet: Model uncertainty in deep convolutional encoder-decoder architectures for scene understanding." arXiv preprint arXiv:1511.02680 (2015).
Cortinhal, T., Tzelepis, G., Aksoy, E.E. "SalsaNext: Fast, uncertainty-aware se-mantic segmentationof LiDAR point clouds for autonomous driving." arXiv preprint arXiv:2007.12668 (2020).
References for images
Header Image: SegNet Project Page
[F3] Siam, Mennatullah, et al. "Deep semantic segmentation for automated driving: Taxonomy, roadmap and challenges." 2017 IEEE 20th international conference on intelligent transportation systems (ITSC). IEEE, 2017.
[F5] Wei, Yujie, and Burcu Akinci. "A vision and learning-based indoor localization and semantic mapping framework for facility operations and management." Automation in Construction 107 (2019): 102915.
Ulku, Irem, and Erdem Akagunduz. "A Survey on Deep Learning-based Architectures for Semantic Segmentation on 2D images." arXiv preprint arXiv:1912.10230 (2019).
Real-Time semantic segmentation in the browser using TensorFlow.js. Post on Towards Data Science by Hugo Zanini