Weakly Supervised Semantic Segmentation

Railroad is not a Train: Saliency as Pseudo-pixel Supervision for Weakly Supervised Semantic Segmentation

People (*: equal contribution)

Seungho Lee1*, Minhyun Lee1*, Jongwuk Lee2, Hyunjung Shim1

1 Yonsei University 2 Sungkyunkwan University

Links

[paper] [code]

Abstract

Existing studies in weakly-supervised semantic segmentation (WSSS) using image-level weak supervision have several limitations: sparse object coverage, inaccurate object boundaries, and co-occurring pixels from non-target objects. To overcome these challenges, we propose a novel framework, namely Explicit Pseudo-pixel Supervision (EPS), which learns from pixel-level feedback by combining two weak supervisions; the image-level label provides the object identity via the localization map and the saliency map from the off-the-shelf saliency detection model offers rich boundaries. We devise a joint training strategy to fully utilize the complementary relationship between both information. Our method can obtain accurate object boundaries and discard co-occurring pixels, thereby significantly improving the quality of pseudo-masks. Experimental results show that the proposed method remarkably outperforms existing methods by resolving key challenges of WSSS and achieves the new state-of-the-art performance on both PASCAL VOC 2012 and MS COCO 2014 datasets.

Motivation

Motivating example of utilizing both the saliency map and the localization map for WSSS. (a) Groundtruth, (b) saliency map via PFAN, (c) localization map via CAM and (d) our EPS utilizing both the saliency map and the localization map for training a classifier. Note that the saliency map cannot capture person and car while our result can correctly restore them, and the localization map overly captures two objects.

Architecture of EPS

The overall framework of our EPS. C + 1 localization maps are generated from a backbone network. The actual saliency map is generated from the off-the-shelf saliency detection model. Some localization maps for target labels are selectively used to generate an estimated saliency map. The overall framework is jointly trained with the saliency loss and the classification loss.

Results

Segmentation results (mIoU) on PASCAL VOC 2012. All results are based on VGG16.

Segmentation results (mIoU) on PASCAL VOC 2012. All results are based on ResNet101.

Qualitative examples of segmentation results on PASCAL VOC 2012. (a) Input images, (b) groundtruth and (c) our EPS.