Hiroshi Fukui 's Page

Attention Branch Network:

Learning of Attention Mechanism for Visual Explanation

Paper

Slide

Code (image classification)

Code (Facial attribute)

Abstract

Visual explanation enables humans to understand the decision making of deep convolutional neural network (CNN), but it is insufficient to contribute to improving CNN performance. In this paper, we focus on the attention map for visual explanation, which represents a high response value as the attention location in image recognition. This attention region significantly improves the performance of CNN by introducing an attention mechanism that focuses on a specific region in an image. In this work, we propose Attention Branch Network (ABN), which extends a response-based visual explanation model by introducing a branch structure with an attention mechanism. ABN can be applicable to several image recognition tasks by introducing a branch for the attention mechanism and is trainable for visual explanation and image recognition in an end-to-end manner. We evaluate ABN on several image recognition tasks such as image classification, fine-grained recognition, and multiple facial attribute recognition. Experimental results indicate that ABN outperforms the baseline models on these image recognition tasks while generating an attention map for visual explanation.

Architecture

ABN consists of three modules as follows: feature extractor, attention branch, and perception branch. The feature extractor contains multiple convolution layers and extracts feature maps from an input image. The attention branch converts the attention location based on Class Activation Mapping to an attention map by using an attention mechanism. The perception branch outputs the probability of each class by receiving the feature map from the feature extractor and attention map.

ABN using a classification model was designed to divide the branch that is generated the attention map and output the probability of each class. This network design can be applicable to other image recognition tasks, such as multi-task learning.

Results (image classification)

Our experimental results show that ABN can outperform the accuracy of baseline models on image classification tasks, such as CIFAR 10/100, SVHN, ImageNet datasets.

As shown in this figure, Grad-CAM, CAM and ABN highlighted a similar region. Particular, this original image at the third column is a typical example because multiple objects such as “Seat belt” and “Australian terrier” are included. In this case, Grad-CAM (conventional ResNet152) and CAM are failed, but ABN performed well. When visualizing the attention maps in the third column, the attention map of ABN highlights each object. Therefore, this attention map can focus on a specific region when multiple objects are in an image.

We show the more attention maps on image classification using the ImageNet dataset. These attention maps are generated by using ResNet152 with ABN. Our attention maps highlight the areas of a target object and ignores areas do not highlight the ignore areas, such as background.

Results (facial attribute recognition)

In multi-task learning, we evaluate for multiple facial attributes recognition using the CelebA dataset, which consists of 40 facial attributes. As shown in this figure, our attention map highlights the specific locations such as mouth, eye, beard, and hair. These highlight locations correspond to the specific facial task. It is conceivable that these highlight locations are contributed to performance improvement of ABN.

Reinforcement Learning + ABN (Private work)

Surprisingly, our ABN also applicable to reinforcement learning, such as Atari games. In this work, we apply ABN to A3C on Atari games.

Deep reinforcement learning (RL) has great potential for determining optimal action selection in complicated environments. However, deep RL has difficulty interpreting the reason for the action selection of an agent. In this work, we analyze the decision making of a deep RL agent by introducing a visual explanation method developed in the computer vision field. To this end, we propose a network architecture based on the ABN and the actor-critic method. ABN generates an attention map that visually indicates the reason for the network output, and the attention map is used by the attention mechanism to improve the classification performance. At the same time, the actor-critic method outputs both a policy and a state value from a single network. Focusing on the structures of ABN and the actor-critic method, we build two branches that respectively output a policy and a state value.

The proposed method outputs an attention map highlighting the regions where a higher reward factor exists while estimating a state value. Experimental results with Atari 2600 games demonstrate that the reason behind the decision making can be analyzed from the attention map obtained by the proposed method. Moreover, we show that our method enables the deep RL to achieve a higher control performance.

※ 2021/3/10: This evolved work is available on arXiv! （I'm not in author）：https://arxiv.org/abs/2103.04067

SpaceInvaders

MsPacman

Bibtex

@InProceedings{Fukui2019,

author = {Fukui, Hiroshi and Hirakawa, Tsubasa and Yamashita, Takayoshi and Fujiyoshi, Hironobu},

title = {{Attention Branch Network: Learning of Attention Mechanism for Visual Explanation}},

booktitle = {The IEEE Conference on Computer Vision and Pattern Recognition},

year = {2019}

}