HAAV: Hierarchical Aggregation of Augmented Views
for Image Captioning

Georgia Institute of Technology

Abstract

A great deal of progress has been made in image captioning, driven by research into how to encode the image using pre-trained models. This includes visual encodings (e.g. image grid features or detected objects) and more recently textual encodings (e.g. image tags or text descriptions of image regions). As more advanced encodings are available and incorporated, it is natural to ask: how to efficiently and effectively leverage the heterogeneous set of encodings?

In this paper, we propose to regard the encodings as augmented views of the input image. The image captioning model encodes each view independently with a shared encoder efficiently, and a contrastive loss is incorporated across the encoded views in a novel way to improve their representation quality and the model’s data efficiency. Our proposed hierarchical decoder then adaptively weighs the encoded views according to their effectiveness for caption generation by first aggregating within each view at the token level, and then across views at the view level.

We demonstrate significant performance improvements of +5.6% CIDEr on MS-COCO and +12.9% CIDEr on Flickr30k compared to state of the arts, and conduct rigorous analyses to demonstrate the importance of each part of our design.

Motivation

Heterogeneous views, each represented by a sequence of feature tokens such as (1) image grid features, (2) detected objects, and (3) retrieved text descriptions.

State-of-art VL models are typically a transformer encoder-decoder model, which has undesirable quadratic computational complexity with respect to the input sequence size. Therefore, as more views are incorporated, each represented by a sequence of features, we should carefully manage the computation and model size.

Moreover, on the medium-scale MS-COCO image captioning benchmark (~0.6M training samples), we should take label efficiency into consideration when training the data-hungry transformer model to avoid negative effects such as overfitting.

Different views contain some shared and some complementary information of the input image. Therefore, it is important to model the effectiveness of views and adaptively weigh them according to their effectiveness for predicting each word.

In the left figure, when predicting the word “sofa”, for the incomplete caption “black bags sitting on top of a ? ”, if the view of detected objects fails to detect sofa in the input image, the captioning model should down-weigh the less effective view of detected objects and rely on other more effective views that properly encode the information about sofa.

Proposed Method

> O(|V|) linear computational complexity

> O(1) constant parameter complexity

> Different from how other VL methods that require annotated pairs (e.g. image-text pairs scraped from the internet)

> Work with unlabeled image-only data to achieve better performance

Quantitative Results

HAAV on MS-COCO captioning compared with small trained-from-scratch models.

HAAV on MS-COCO captioning compared with large-scaled pre-trained larger models.

Supervised and our novel semi-supervised image captioning on Flickr30K.

Ablation study to demonstrate the computation and parameter efficiency of our HAAV. HAAV does not sacrifice performance in pursuit of efficiency and achieves the best performance.

HAAV only requires 50% of labeled data to achieve the same performance as other methods.

Qualitative Results

We design two control studies in the figure above to show how the view-level attention weights of CrossAttnLv2 vary adaptively according to the effectiveness of an input view.

In the first experiment, we add noise to a view by randomly zeroing out tokens in a view to make a view less effective, and expect a drop of weights toward that noised view.

In the leftmost figure above, the weights for the noised view drops consistently at each word prediction step compared to the same view without added noise. This means that our hierarchical decoder indeed learns to adaptively weigh the views according to their effectiveness at the view level.

In the second experiment, we randomly mask out a prominent region of the input image for a view. For example, we mask out dog in the middel input image with caption “a dog laying down beside a little couch” to make a view less effective at the step of generating the word “dog”. We expect a drop of the weights toward the masked view at the step of generating the word “dog”. 

In the rightmost figure above, the weights for the masked view drops consistently across all attention heads compared to the same view without masking. This means that our hierarchical decoder indeed learns to adaptively weigh the input views according to their usefulness at the word level.

Resources

@inproceedings{kuo2023hierarchical,

  title={HAAV: Hierarchical Aggregation of Augmented Views for Image Captioning},

  author={Chia-Wen Kuo and Zsolt Kira}, 

  booktitle={Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition},

  year={2023}

}