Semi-Supervised Salient Object Detection via Synthetic Data

[Github] [Paper]

Figure 1: Fully-supervised SOD methods require pixel-level human-annotated data to achieve promising performance, which is labor-consuming (A). Existing GANs (B) can generate high-fidelity and diverse images, but (C) how to generate the corresponding pixel-wise masks is an open-ended problem. The proposed SODGAN (D) can synthesize infinite high-quality image-mask pairs, and these synthesized data can be used for training SOTA SOD networks.

Abstract

Recently deep learning-based approaches have achieved remarkable progress in salient object detection (SOD). However, deep networks are extremely data-hungry, typically requiring training on large-scale pixel-level annotations to deliver such promising results. In this paper, we propose a simple yet effective method for semi-supervised SOD, coined SODGAN, which can generate infinite high-quality images-annotation pairs requiring only a small set of labeled data! These generated data can then be used for training any off-the-shelf SOD model just like real datasets are. Concretely, we discover that the interpretable direction corresponding to the foreground object can be disentangled from the background in GANs feature space. Moreover, our approach is efficient and applicable to most popular GAN models (e.g., StyleGAN and StyleGAN2). Without any bells and whistles, our approach achieves a new state-of-the-art performance in terms of semi- and weakly- supervised, and even outperforms the state-of-the-art fully supervised methods across 3 public benchmarks of ECSSD, HKU-IS, and PASCAL-S. We believe this novel SODGAN underlies a new class of representation learning, which can generalize to other tasks (e.g. semantic segmentation).

Network Architecture

Figure 2: Overview of the proposed SODGAN. The model consists of a pre-trained BigGAN generator Gimage(·) to synthesize high-quality image and our proposed mask generator to produce synthestic label. Given a latent code z ∼ N(0, 1) and a class label c, we collect multiple feature maps f0, f1, ..., f12 from Gimage(·). After that, We upsample these collected feature maps to 256 × 256 resolution and then concatenate these upsampled features together, constructing pixel-wise feature maps for all pixels of the synthesized image. Finally, these pixel-wise features are feed into the proposed maks generator branch to produce saliency mask

Experimental Results

1. StyleGAN segmentation

Figure 3: Synthetic samples of image and pixel-wise label pairs from StyleGAN (Left) and StyleGAN2 (Right)

2. BigGAN segmentation

Figure 4: Synthetic samples of image and pixel-wise label pairs from BigGAN , which is trained on ImageNet.

3. Comparison with state-of-the-art

Table 1: Quantitative comparisons with different methods on 5 datasets with MAE (smaller is better), max/avg F-measure score (larger is better), S-measure (larger is better) and AUC (larger is better). The best results are highlighted in bold.

Figure 5: Visual comparison of the proposed model and existing state-of-the-art methods in some challenging cases.