Semantic Segmentation

with applications to robotics

Zhiyuan Hua

What is semantic segmentation in the first place?

Semantic segmentation for robots is a way that we teach a robot to understand the images. To do that, the robots will use a set of algorithms to process the information to recognize, understand what is in the image in a pixel wise level. Here are some examples of semantic segmented images:

What does the robot sees(understands)?

"A cow is on the grass in front of some trees and some buildings, under the sky"

"two horses are on the grass with fences around them, and there are some tress and buildings in the back"

"two men riding a bicycle on the road with building, tree, fence and car behind them"

"a table with books, cup, keyboard, tvmonitor, mouse, bottle, etc..."

Briefly, the robot takes an set of images or videos, collectively analyze the image, using algorithms to separate segmented regions and structures. Then, the robot needs to attach semantic labels to each segmented regions and structures.

  1. Input: Images

  2. Output: Segmented regions, structures

  3. What does it need to work?

    1. filters

    2. color information

    3. gradients information

    4. deep learning

    5. etc...

Formal Defination

Semantic segmentation is a task of partitioning an image into multiple segments with semantic labeling. It consists of classifying each pixel of an image into an instance, where each instance corresponds to a class. This task is essential in computer vision and robotics problems because it is part of the concept of scene perception and understanding, required to better explaining the global context of the environment. Some of the most critical areas of computer vision and robotics rely heavily on semantic segmentation, such as medical imaging, autonomous driving, human and computer interactions, etc.

Overview of the Key Results

Classical Methods

Before the Era of Deep Learning, a number of robust image processing techniques were designed to do semantic segmentation, segmenting the images into semantic areas of interest. We name a few here that are still wildly used in our current world:

  • Gray Level Segmentation
    The simplest method of semantic segmentation involves the assignment of hard-coded rules or properties that an area must fulfill for a specific label to be applied to it. The rules can be framed in terms of the properties of the pixel, such as its strength at the gray level.

see here: Split and Merge Algorithm

  • CRFs (Conditional Random Fields)
    CRFs are a group of methods of statistical modeling used for organized prediction. CRFs should regard "neighboring background" as the relationship between pixels before making predictions, unlike discrete classifiers. This makes it an excellent candidate for the segmentation of semantics. The use of CRFs for semantic segmentation is discussed in this section. A finite set of possible states is associated with each pixel in the image. The sum of the unit and pair cost of all pixels is referred to as the CRF 's energy (or cost / loss). To obtain a good segmentation output, this value can be minimized.

see here: CRFs Algorithms

  • Watershed
    Watershed segmentation is a region-based approach that uses the morphology of images. It needs at least one marker ("seed" point) to be selected within each image object, including the context as a separate object. Markers are selected by an operator or are given by an automated procedure which takes account of the objects' application-specific knowledge. They can be grown using a morphological watershed transformation once the objects are labelled.

see here: Watershed Algorithm

  • GrabCut
    The Grab Cut method addresses, given certain limitations, the challenge of separating objects from backgrounds in a colored picture. The user is asked to label a single rectangle around the object, to identify the outer part of the rectangle as a definite background, and the inner part of the rectangle as an unknown combination of the object (front) and a certain background.

see here: GrabCut Algorithm

Deep Learning Methods

There are a number of research papers on covering the current state of the arts approaches in semantic segmentation models, including:

  1. Fully Convolutional Networks for Semantic Segmentation

  2. U-Net: Convolutional Networks for Biomedical Image Segmentation

  3. Multi-Scale Context Aggregation by Dilated Convolutions

  4. Encoder-Decoder with Atrous Separable Convolution for Semantic Image Segmentation

  5. FastFCN: Rethinking Dilated Convolution in the Backbone for Semantic Segmentation

  6. Improving Semantic Segmentation via Video Propagation and Label Relaxation

  7. Weakly- and Semi-Supervised Learning of a Deep Convolutional Network for Semantic Image Segmentation

Source: Hao, S., Zhou, Y., & Guo, Y. (2020). A Brief Survey on Semantic Segmentation with Deep Learning. Neurocomputing. doi:10.1016/j.neucom.2019.11.118 

How is It Related to Decision Making for Robots?

For the development of an indoor and outdoor navigation system for a robotic system, a semantic segmentation network is presented. By adopting various techniques, such as a fully convolutional neural network (CNN), or classical methods, semantic segmentation can be applied. For example, by participating in ResNet-18 transfer learning, a residual neural network is implemented to differentiate between the floor, which is the free space for navigation, and the walls, which are the barriers.

After the learning process, for the autonomous mobile robot, the semantic segmentation floor mask is used to introduce indoor navigation and motion calculations. These calculations of motion are based on how much the estimated path differs from the vertical line of the center. To move the motors toward that direction, the highest point is used. In this way, by avoiding various obstacles, the robot can move in a real scenario.

In recent years, three-dimensional reconstruction and semantic understandings have drawn extensive attention. Current reconstruction and segmentation techniques, such as in an indoor environment or automatic self-driving cars, primarily target large-scale scenes. There are few studies for manipulator operation on small-scale and high-precision scene reconstruction, which plays an essential role in the decision-making and intelligent control system.

Brief Description of Variants


The way in which ReNet replaces convolutional layers with multi-directional recurrent neural networks (RNN) is highlighted by ReNet, which provides an alternative way of constructing the architecture of the network. The ReNet-based representative semantic segmentation technique is ReSeg. The deep residual network (ResNet) makes a much deeper network possible and achieves better performance in different vision tasks. Its key contribution is to model the residual representation into the structure of the CNN network, which solves the difficulty of training a very deep structure of the network. This is revolutionary in a sense that it gives the robotics another architecture other than the FCNN.


DenseNet links every layer to each other, unlike the conventional approach that allows a network deeper or wider. Its benefit lies in the following aspects : 1) fewer parameters, 2) more function reuse, and 3) a better training mechanism that relieves the problem of degeneration of the vanishing gradient and model. The DenseNet-based representative semantic segmentation methods include DenseASPP, FC-DenseNet, and SDNet. This may be applied in a remote high accuracy high latency robotics system.


In order to balance accuracy and computing costs, it is necessary to design networks. Various lightweight networks have been developed in this sense. MobileNetV1 implements convolution in detail, which achieves a great efficiency increase. With 4.2 M parameters, it achieves 70.6 percent precision in the ImageNet classification task. MobileNetV2 is based on an inverted residual structure, addressing the shortcomings of MobileNetV1. With even fewer parameters, MobileNetV3 achieves better performance through the incorporation of the attention framework. For real-time applications such as robotics, MobileNetV1 and MobileNetV2 are very helpful.


ResNeXt is illustrated in its homogeneous, multi-branch architecture, which has just a few hyperparameters to set, in order to boost the network efficiency while retaining the network complexity. DShortcut uses ResNeXt as its backbone for semantic segmentation methods.

Overview of the Important Applications


Semantic segmentation of robotic instruments for robot-assisted surgery is an important issue. One of the key challenges is to correctly classify the location of an instrument in the vicinity of surgical scenes for monitoring and pose estimation. To address this challenge, accurate pixel-wise instrument segmentation is needed.

Autonomous Driving

In recent years, the deep learning approach has gained a lot of interest in the field of machine learning and an experiment is carried out on semantic image segmentation in order to assist autonomous driving of autonomous vehicles.

Drone Navigation

Semantic segmentation is a crucial task for robotic drones' navigation and safety. Often, classical approaches are used because of the time sensitivity of the aerial applications. However, deep learning semantic segmentation approaches present themselves as the computation power and accuracy improving consistently.


Real-time semantic segmentation

A large number of methods have been proposed that target real-time semantic segmentation. While excellent success in terms of both accuracy and effectiveness has been achieved, there is still a wide area for improvement. Taking the state-of-the-art DFANet as an example, in terms of MIoU, DFANet is still 10 percent lower than Cityscapes' PSPNet. The multiobjective increase in precision and speed is therefore required in this direction of study.

Occluded objects segmentation

Regarding human vision, the occluded sections can be easily identified and retrieved by a person. However, existing computer vision technologies do not replicate this process well. One significant explanation is that, without the ability to transfer information, most existing segmentation algorithms target at a hard partition. Researchers suggested that this task be realized on the basis of depth knowledge and, for this task, they developed data such as PASCAL 3D+.

Weakly/Unsupervised segmentation

Classical methods such as watershed and GrabCut algorithms are in the definition of weakly or unsupervised segmentation. However, in modern machine learning era, most of the successful semantic segmentation are based off fully supervised learning on large annotated datasets. Researchers suggests that the weakly/unsupervised segmentation may be achieved by giving the robots a basis of knowledge and learn from classical methods such as clustering, subspace learning, and other refinements.

Semantic segmentation in Videos

Most of what we presented and learned about semantic segmentation is in the form of 2D images. The main focus of semantic segmentation also is heavily on the single-image level. However, real world applications for robotics requires visual recognition and understanding of videos footage, where each frame is highly correlated with the next one. Current methods uses oriented gradients and interframe correlation to improve the efficiency and coverage. However, when compared to image semantic segmentation, segmentation in videos still has a lot of potential to make significant practical impacts for robotics.


Amodei, D., et al.: Deep Speech 2: End-to-End Speech Recognition in English and Mandarin. Proceedings of the International Conference on Machine Learning (ICML), New York, USA, 173–182 (2016)

Badrinarayanan, V., Kendall, A., Cipolla, R.: Segnet: A Deep Convolutional EncoderDecoder Architecture for Image Segmentation. arXiv preprint arXiv:1511.00561 (2015)

B. J. Meyer and T. Drummond, "Improved semantic segmentation for robotic applications with hierarchical conditional random fields," 2017 IEEE International Conference on Robotics and Automation (ICRA), Singapore, 2017, pp. 5258-5265, doi: 10.1109/ICRA.2017.7989617.

Bouget, D., Benenson, R., Omran, M., Riffaud, L., Schiele, B., Jannin, P.: Detecting surgical tools by modelling local appearance and global shape. IEEE transactions on medical imaging 34(12), 2603–2617 (2015)

Cordts, M., et al.: the Cityscapes Dataset for Semantic Urban Scene Understanding. Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Seattle, WA, USA, 3213–3223 (2016)

Colwell, R.N.: History and Place of Photographic Interpretation. Manual of Photographic Interpretation 2, 33–48 (1997)

Deep Learning Architectures, (accessed May 1, 2017)

Doignon, C., Nageotte, F., De Mathelin, M.: Segmentation and guidance of multiple rigid objects for intra-operative endoscopic vision. In: Dynamical Vision. pp. 314– 327. Springer Berlin Heidelberg, Berlin, Heidelberg (2007)

Fourure, Damien, et al. "Residual conv-deconv grid network for semantic segmentation." arXiv preprint arXiv:1707.07958 (2017).

Geiger, A., Lenz, P., Stiller, C., Urtasun, R.: Vision Meets Robotics: the KITTI Dataset. The International Journal of Robotics Research 32(11), 1231–1237 (2013)

LeCun, Y.: Backpropagation Aapplied to Handwritten ZIP Code Recognition. Neural Computation 1(4), 541–551 (1989)

Münzer, B., Schoeffmann, K., Böszörmenyi, L.: Content-based processing and analysis of endoscopic images and videos: A survey. Multimedia Tools and Applications 77(1), 1323–1362 (2018)

Mitrokhin, Anton and Hua, Zhiyuan and Fermuller, Cornelia and Aloimonos, Yiannis, Learning Visual Motion Segmentation Using Event Surfaces, IEEE/CVF CVPR 2020, pp. 14414-14423 June, 2020

Simonyan, K., Zisserman, A.: Very deep convolutional networks for large-scale image recognition. arXiv preprint arXiv:1409.1556 (2014)

Szegedy, C., et al.: Going Deeper with Convolutions. Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Boston, Massachusetts, USA, 1–9 (2015)

Yosinski, J., et al.: How Transferable are Features in Deep Neural Networks?. Advances in Neural Information Processing Systems, 3320–3328 (2014)