HAAV: Hierarchical Aggregation of Augmented Views
for Image Captioning
Georgia Institute of Technology
Abstract
A great deal of progress has been made in image captioning, driven by research into how to encode the image using pre-trained models. This includes visual encodings (e.g. image grid features or detected objects) and more recently textual encodings (e.g. image tags or text descriptions of image regions). As more advanced encodings are available and incorporated, it is natural to ask: how to efficiently and effectively leverage the heterogeneous set of encodings?
In this paper, we propose to regard the encodings as augmented views of the input image. The image captioning model encodes each view independently with a shared encoder efficiently, and a contrastive loss is incorporated across the encoded views in a novel way to improve their representation quality and the model’s data efficiency. Our proposed hierarchical decoder then adaptively weighs the encoded views according to their effectiveness for caption generation by first aggregating within each view at the token level, and then across views at the view level.
We demonstrate significant performance improvements of +5.6% CIDEr on MS-COCO and +12.9% CIDEr on Flickr30k compared to state of the arts, and conduct rigorous analyses to demonstrate the importance of each part of our design.
Motivation
- Given
Heterogeneous views, each represented by a sequence of feature tokens such as (1) image grid features, (2) detected objects, and (3) retrieved text descriptions.
- Efficiency
State-of-art VL models are typically a transformer encoder-decoder model, which has undesirable quadratic computational complexity with respect to the input sequence size. Therefore, as more views are incorporated, each represented by a sequence of features, we should carefully manage the computation and model size.
Moreover, on the medium-scale MS-COCO image captioning benchmark (~0.6M training samples), we should take label efficiency into consideration when training the data-hungry transformer model to avoid negative effects such as overfitting.
- Effectiveness
Different views contain some shared and some complementary information of the input image. Therefore, it is important to model the effectiveness of views and adaptively weigh them according to their effectiveness for predicting each word.
In the left figure, when predicting the word “sofa”, for the incomplete caption “black bags sitting on top of a ? ”, if the view of detected objects fails to detect sofa in the input image, the captioning model should down-weigh the less effective view of detected objects and rely on other more effective views that properly encode the information about sofa.
Proposed Method
- Efficiency: regard heterogeneous views as augmentations of the input image.
Naturally use a shared transformer encoder to encode each view independently
> O(|V|) linear computational complexity
> O(1) constant parameter complexity
Data augmentation increases data diversity and thus improves label efficiency.
Incorporate a contrastive loss across views to help representation learning of heterogeneous views and increase data efficiency
> Different from how other VL methods that require annotated pairs (e.g. image-text pairs scraped from the internet)
> Work with unlabeled image-only data to achieve better performance
- Effectiveness: devise a hierarchical decoder layer to account for the effectiveness of heterogeneous views.
Two-tiered cross-attention modules
First aggregates within each view at the token level to model the effectiveness of each view.
Then aggregates across views at the view level to adaptively weigh each view according to their effectiveness.
Quantitative Results
On MS-COCO, compared with trained-from-scratch models, our HAAV outperforms previous state-of-art Xmodal-Ctx by 5.6% in CIDEr.
On MS-COCO, compared with large-scaled pre-trained larger models, our HAAV, despite being only trained on MS-COCO, achieves comparable or often better performance.
On Flickr30K, our HAAV outperforms previous state-of-art ORT substantially by 12.9% in CIDEr.
On Flickr30K, our HAAV can be trained in a novel semi-supervised way and achieves a further +3.9% CIDEr improvement.
Compared to other common multi-view aggregation approaches, our HAAV is computation, parameter, and label efficient.
Compared to other common multi-view aggregation approaches, despite being more efficient, HAAV achieves the best performance.
HAAV on MS-COCO captioning compared with small trained-from-scratch models.
HAAV on MS-COCO captioning compared with large-scaled pre-trained larger models.
Supervised and our novel semi-supervised image captioning on Flickr30K.
Ablation study to demonstrate the computation and parameter efficiency of our HAAV. HAAV does not sacrifice performance in pursuit of efficiency and achieves the best performance.
HAAV only requires 50% of labeled data to achieve the same performance as other methods.
Qualitative Results
We design two control studies in the figure above to show how the view-level attention weights of CrossAttnLv2 vary adaptively according to the effectiveness of an input view.
- Add random noise to a view
In the first experiment, we add noise to a view by randomly zeroing out tokens in a view to make a view less effective, and expect a drop of weights toward that noised view.
In the leftmost figure above, the weights for the noised view drops consistently at each word prediction step compared to the same view without added noise. This means that our hierarchical decoder indeed learns to adaptively weigh the views according to their effectiveness at the view level.
- Add random noise to a view
In the second experiment, we randomly mask out a prominent region of the input image for a view. For example, we mask out dog in the middel input image with caption “a dog laying down beside a little couch” to make a view less effective at the step of generating the word “dog”. We expect a drop of the weights toward the masked view at the step of generating the word “dog”.
In the rightmost figure above, the weights for the masked view drops consistently across all attention heads compared to the same view without masking. This means that our hierarchical decoder indeed learns to adaptively weigh the input views according to their usefulness at the word level.