Beyond a Pre-Trained Object Detector:
Cross-Modal Textual and Visual Context for Image Captioning

Chia-Wen Kuo, Zsolt Kira

Georgia Institute of Technology

arXiv

Video

GitHub

Abstract

Significant progress has been made on visual captioning, largely relying on pre-trained features and later fixed object detectors that serve as rich inputs to auto-regressive models. A key limitation of such methods, however, is that the output of the model is conditioned only on the object detector's outputs. The assumption that such outputs can represent all necessary information is unrealistic, especially when the detector is transferred across datasets.

In this work, we reason about the graphical model induced by this assumption, and propose to add an auxiliary input to represent missing information such as object relationships. We specifically propose to mine attributes and relationships from the Visual Genome dataset and condition the captioning model on them. Crucially, we propose (and show to be important) the use of a multi-modal pre-trained model (CLIP) to retrieve such contextual descriptions. Further, the object detector outputs are fixed due to a frozen model and hence do not have sufficient richness to allow the captioning model to properly ground them. As a result, we propose to condition both the detector and description outputs on the image, and show qualitatively that this can improve grounding.

We validate our method on image captioning, perform thorough analyses of each component and importance of the pre-trained multi-modal model, and demonstrate significant improvements over the current state of the art, specifically +7.5% in CIDEr and +1.3% in BLEU-4 metrics.

What's Missing?

Most existing works model the image captioning problem with a graphical model shown in the left top figure: given an input image X, a set of objects O are detected by a frozen pre-trained object detector, and the caption Y is generated conditioned only on O.

From the graphical model, we can clearly identify the two major issues arising from only using a frozen pre-trained object detector to encode the input image:

The assumption that O completely encodes all necessary information of X. In practice, the object detector pre-trained on Visual Genome for object detection and attribute prediction may fail to encode crucial information of X such as the relationship between objects and image/scene level information illustrated in the left bottom figure.
The conditional relationship between the detected objects O and the input image X is computed by a frozen pre-trained object detector and is not jointly optimized with the target image captioning task. Therefore, the features computed by the frozen pre-trained object detector cannot be refined before sending into the image captioning model, leading to potentially poor features especially given that they are trained on a different dataset.

Proposed Method

For the assumption that O completely encodes all necessary information of X, we propose to retrieve a set of text descriptions to provide information complementary to detected objects. We effectively leverage recent advances in cross-modal pre-training on large-scale image and text pairs, CLIP, to directly retrieve relevant text given an image. Specifically, given an image sub-region, the top-k most relevant text descriptions are retrieved from a description database.
To model the conditional relationship P(O|X), we use a fully connected layer, which is jointly optimized with the target VL task, between object features and image features. Since CLIP is pre-trained on a cross-modal VL task, it can better encode information relevant to the target VL tasks compared to models pre-trained on image-only datasets. We thus encode the image by a frozen pre-trained CLIP model.

Retrieved Text Descriptions

Description database

We parse the regional textual descriptions from the Visual Genome dataset to construct the description database.

Text description retrieval

We leverage CLIP to perform cross-modal retrieval between the input image and the textual descriptions from the description database. To retrieve finer-grained descriptions, we also use image five and nine crops as query to retrieve textual descriptions.

Text encoding

After retrieving the set of text descriptions, we use the text encoder from CLIP to encode the retrieved textual descriptions.

Model Architecture

We propose to model and strengthen the conditional relationship between the detected object and the input image so that the features computed by the object detector can be refined before sending into the captioning model. For this purpose, we first encode the input image into a global representation by the pre-trained image encoder from CLIP. We then concatenate each detected object with the CLIP-encoded global image feature along the feature dimension, and pass the concatenated feature through an FC layer.

Since the text descriptions are also retrieved offline by a frozen pre-trained CLIP model, we also strengthen the conditional relationship between the retrieved textual descriptions and the input image in a similar way.

The enhanced detected objects and textual descriptions serve as the input to an existing image captioning model without any change. The captioning model can then be trained with the commonly used maximum log-likelihood loss for word prediction and fine-tuned with the RL loss using CIDEr score as reward in the same way as before.

Results

With complementary information provided by the retrieved text descriptions and image conditioning, our method improves the baseline model M2 by +7.2% in CIDEr and +1.3% in BLEU-4, and compares favorably with all previous trained-from-scratch methods across all metrics.

When combined with VinVL, a stronger detector trained on large training corpora that combine multiple object detection datasets, we can see our method is able to achieve better performance. This indicates that certain information is still missing from the stronger VinVL detector despite large-scale detector pre-training.

When combined with OSCAR, a large-scale pre-trained larger captioning model, we can see that our method is still able to achieve better performance. This indicates that certain information is still missing in the large-scale pre-trained larger captioning model.

Resources

arXiv

GitHub

@inproceedings{kuo2022pretrained,

title={Beyond a Pre-Trained Object Detector: Cross-Modal Textual and Visual Context for Image Captioning},

author={Chia-Wen Kuo and Zsolt Kira},

booktitle={Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition},

year={2022}

}

Beyond a Pre-Trained Object Detector:Cross-Modal Textual and Visual Context for Image Captioning