Top-Down Visual Attention from Analysis by Synthesis

UC Berkeley UC Berkeley Microsoft Research

Motivation: Top-Down vs. Bottom-Up Attention

Figure 1. (a) Bottom-up attention. (b-c) Top-down attention.

Human visual attention is often task-guided, i.e., we tend to focus on different objects when processing different tasks. For example, when we answer different questions about one image, we only attend to the objects that are relevant to the question (Fig. 1(b-c)).

This stands in contrast with the widely-used self-attention which is completely stimulus-driven, i.e., it highlights all the salient objects in the image without task-guided selection (Fig. 1(a)).

While the stimulus-driven bottom-up attention has shown promising results in visual representation learning, current vision transformers still lack the ability of task-guided top-down attention which provides task-adaptive representation and could potentially improve task-specific performances.

Derivation: Top-Down Attention from Analysis by Synthesis

Figure 2. (a) Attention is equivalent to sparse reconstruction . (b) AbS solves a similar sparse reconstruction problem where the sparse code is modulated by a top-down signal.

Analysis by Synthesis (AbS) states that human vision is a Bayesian Inference system: our understanding of an image is influenced by our prior of the world.

Top-down attention is doing something similar, i.e., the intermediate representation are modulated by the high-level task, which could be formulated as a prior.

Indeed, we can derive top-down attention from AbS by noticing that: a) Attention is functionally equivalent to sparse reconstruction [1], and b) AbS solves a similar sparse reconstruction problem where the sparse code is modulated by a top-down signal.

This means, by selecting different priors in AbS, we can direct the model to look at different objects.

[1] Shi, Baifeng, et al. "Visual attention emerges from recurrent sparse reconstruction." arXiv preprint arXiv:2204.10962 (2022).

AbSViT: ViT With Top-Down Attention

We propose AbSViT (Analysis-by-Synthesis ViT) which contains a feedforward and a feedback path. The feedforward path is a regular ViT and the feedback path contains linear decoders in each layer.

Inference of AbSViT has four steps: 1) pass the image through the feedforward ViT, 2) reweight the output tokens using their similarity with a prior token \xi (e.g., a language embedding of the task description), 3) send the tokens back through the feedback decoder to intermediate layers, and 4) run the feedforward path again but this time each self-attention layer also receives a top-down input. The top-down input in self-attention is added on the values, while keeping queries and keys untouched.

Training of AbSViT requires a variational loss so that it approximates Bayesian Inference. Specifically, except for the supervised loss, there are two additional losses: a) prior loss, i.e., maximizing the similarity between the final output and the prior token, b) reconstruction loss, i.e., l-th layer's decoder needs to reconstruct l-th layer's tokens from (l+1)-th layer's tokens.

Experiments: Vision-Language Tasks

Table 1. Results on VQA and zero shot image retrieval.

Figure 3. Attention map of AbSViT on VQA and its comparison to human attention.

Experiments: Classification, Semantic Segmentation, and Robustness

Table 2. Results on ImageNet classification and robustness benchmarks.

Table 3. Results on semantic segmentation.

Figure 4. Attention map of AbSViT and bottom-up ViT.