Complementary Random Masking for RGB-T Semantic Segmentation
Abstract
RGB-thermal semantic segmentation is one potential solution to achieve reliable semantic scene understanding in adverse weather and lighting conditions. However, the previous studies mostly focus on designing a multi-modal fusion module without consideration of the nature of multi-modality inputs. Therefore, the networks easily become over-reliant on a single modality, making it difficult to learn complementary and meaningful representations for each modality. This paper proposes 1) a complementary random masking strategy of RGB-T images and 2) self-distillation loss between clean and masked input modalities. The proposed masking strategy prevents over-reliance on a single modality. It also improves the accuracy and robustness of the neural network by forcing the network to segment and classify objects even when one modality is partially available. Also, the proposed self-distillation loss encourages the network to extract complementary and meaningful representations from a single modality or complementary masked modalities. Based on the proposed method, we achieve state-of-the-art performance over three RGB-T semantic segmentation benchmarks.
Problem Statement
Problem: Over-reliance on single-modality in RGB-T semantic segmentation
Without consideration of the nature of multi-modal inputs, the networks easily fall into a sub-optimal solution that relies on a single-modality.
It makes the network susceptible to a wide range of fault cases, such as sensor disconnection, lens occlusion, and other input quality degeneration.
It also loses the chance to learn modality-specific or complementary representation from multi-modal inputs.
Methods Overview
Complementary Random Masking and Self-distillation for RGB-Thermal Semantic Segmentation
Features
Complementary Random Maksing: Augment input RGB-T images with random masking to prevent the network from over-reliantly utilizing one modality for RGB-T semantic segmentation task.
Self-distillation Loss: Enforce prediction consistency between augmented and original images to encourage the network to extract meaningful representations even from partially occluded modalities or a single modality.
Performance: Achieves state-of-the-art results over three RGB-T benchmark datasets (i.e., MFNet, PST900, and KAIST pedestrian datasets)
Semantic Segmentation Results of RGB-thermal images on MFNet, PST900, KAIST pedestrain dataset.
Publication
"Complementary Random Masking for RGB-T Semantic Segmentation" [PDF]
Ukcheol Shin, Kyunghyun Lee, In So Kweon, and Jean Oh
IEEE International Conference on Robotics and Automation (ICRA), 2024
Bibtext
TBA