Beyond a Pre-Trained Object Detector:
Cross-Modal Textual and Visual Context for Image Captioning

Georgia Institute of Technology

Abstract

Significant progress has been made on visual captioning, largely relying on pre-trained features and later fixed object detectors that serve as rich inputs to auto-regressive models. A key limitation of such methods, however, is that the output of the model is conditioned only on the object detector's outputs. The assumption that such outputs can represent all necessary information is unrealistic, especially when the detector is transferred across datasets.

In this work, we reason about the graphical model induced by this assumption, and propose to add an auxiliary input to represent missing information such as object relationships. We specifically propose to mine attributes and relationships from the Visual Genome dataset and condition the captioning model on them. Crucially, we propose (and show to be important) the use of a multi-modal pre-trained model (CLIP) to retrieve such contextual descriptions. Further, the object detector outputs are fixed due to a frozen model and hence do not have sufficient richness to allow the captioning model to properly ground them. As a result, we propose to condition both the detector and description outputs on the image, and show qualitatively that this can improve grounding.

We validate our method on image captioning, perform thorough analyses of each component and importance of the pre-trained multi-modal model, and demonstrate significant improvements over the current state of the art, specifically +7.5% in CIDEr and +1.3% in BLEU-4 metrics.

What's Missing?


Most existing works model the image captioning problem with a graphical model shown in the left top figure: given an input image X, a set of objects O are detected by a frozen pre-trained object detector, and the caption Y is generated conditioned only on O.

From the graphical model, we can clearly identify the two major issues arising from only using a frozen pre-trained object detector to encode the input image:

Proposed Method

Retrieved Text Descriptions

We parse the regional textual descriptions from the Visual Genome dataset to construct the description database.

We leverage CLIP to perform cross-modal retrieval between the input image and the textual descriptions from the description database. To retrieve finer-grained descriptions, we also use image five and nine crops as query to retrieve textual descriptions.

After retrieving the set of text descriptions, we use the text encoder from CLIP to encode the retrieved textual descriptions.

Model Architecture

We propose to model and strengthen the conditional relationship between the detected object and the input image so that the features computed by the object detector can be refined before sending into the captioning model. For this purpose, we first encode the input image into a global representation by the pre-trained image encoder from CLIP. We then concatenate each detected object with the CLIP-encoded global image feature along the feature dimension, and pass the concatenated feature through an FC layer.

Since the text descriptions are also retrieved offline by a frozen pre-trained CLIP model, we also strengthen the conditional relationship between the retrieved textual descriptions and the input image in a similar way.

The enhanced detected objects and textual descriptions serve as the input to an existing image captioning model without any change. The captioning model can then be trained with the commonly used maximum log-likelihood loss for word prediction and fine-tuned with the RL loss using CIDEr score as reward in the same way as before.

Results

With complementary information provided by the retrieved text descriptions and image conditioning, our method improves the baseline model M2 by +7.2% in CIDEr and +1.3% in BLEU-4, and compares favorably with all previous trained-from-scratch methods across all metrics.

When combined with VinVL, a stronger detector trained on large training corpora that combine multiple object detection datasets, we can see our method is able to achieve better performance. This indicates that certain information is still missing from the stronger VinVL detector despite large-scale detector pre-training.

When combined with OSCAR, a large-scale pre-trained larger captioning model, we can see that our method is still able to achieve better performance. This indicates that certain information is still missing in the large-scale pre-trained larger captioning model.

Resources

@inproceedings{kuo2022pretrained,

  title={Beyond a Pre-Trained Object Detector: Cross-Modal Textual and Visual Context for Image Captioning},

  author={Chia-Wen Kuo and Zsolt Kira}, 

  booktitle={Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition},

  year={2022}

}