Weakly Supervised Object Localization

Attention-based Dropout Layer for Weakly Supervised Object Localization

People

Junsuk Choe and Hyunjung Shim

Abstract

Weakly Supervised Object Localization (WSOL) techniques learn the object location only using image-level labels, without location annotations. A common limitation for these techniques is that they cover only the most discriminative part of the object, not the entire object. To address this problem, we propose an Attention-based Dropout Layer (ADL), which utilizes the self-attention mechanism to process the feature maps of the model. The proposed method is composed of two key components: 1) hiding the most discriminative part from the model for capturing the integral extent of object, and 2) highlighting the informative region for improving the recognition power of the model. Based on extensive experiments, we demonstrate that the proposed method is effective to improve the accuracy of WSOL, achieving a new state-of-the-art localization accuracy in CUB-200-2011 dataset. We also show that the proposed method is much more efficient in terms of both parameter and computation overheads than existing techniques.

Overview and Contributions

The self-attention map is generated by channel-wise average pooling of the input feature map. Based on the self-attention map, we produce a drop mask using thresholding and an importance map using a sigmoid activation, respectively. The drop mask and the importance map are selected stochastically at each iteration and applied to the input feature map. Please note that this figure illustrates the case when the importance map is selected.

Drop mask and self-attention map at each layer of VGG Network. At lower-level layers, the self-attention maps include general features, while class-specific features are included in the self-attention maps at higher-level layers. The drop masks also erase most discriminative part more effectively at higher-level layers. Please note that the drop mask is overlaid with input image for better visualization. Because the importance map has a distribution very similar to that of the self-attention map, we do not visualize it.

Results


Qualitative evaluation results of VGG-GAP on CUB-200-2011 and ImageNet-1k. The left image in each figure is input image. The red bounding box is ground truth, while the green bounding box is estimates. The middle image is heatmap and the right image shows the overlap between the input image and the heatmap. We also compared our method and the vanilla model side by side.


Quantitative evaluation results on CUB-200-2011 and ImageNet-1k. Bold text refers the best localization accuracy for each backbone network. We also underline the best score in each dataset. Overheads are computed based upon their backbone networks. The accuracy with asterisk* indicates that the score is from the original paper. We leave some Top-1 Clas scores blank, because they are not reported in the original paper [60]. For reproducing baseline methods, we use hyperparameters suggested by their original papers. Also, we train and test HaS and ADL under the same setting for a fair comparison.

Publication

Attention-based Dropout Layer for Weakly Supervised Object Localization

Junsuk Choe and Hyunjung Shim, IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2019. (Oral talk)

Links

[pdf][code]