Video captioning has received increasing attention recently. It is essentially a multimodal task in which both auditory and visual contents contribute to the caption generation. Existing works have focused on either better video-to-language models or stronger audio-visual fusions leaving the basic questions as to what extent different modalities contribute to a particular sentence, and furthermore, to individual words in a sentence under explored. In this paper, we make the first attempt to design an interpretable and controllable audio-visual video captioning network to unfold these questions. The modality interpretability is achieved by first learning deep multimodal embedding features via separated audio- and visual-text associations and then fusing the two via an attention-based weighting mechanism for word generation, where the aggregated weights indicate the respective modality contribution. Such an interpretable design allows us to generate diverse, controllable audio-visual sentences. By manipulating directly on the aggregated modality-specific attention weights, the proposed network can produce audio-only, visual-only, and controllable audio-visual sentences. Extensive experiments demonstrate that the proposed framework has both interpretability and controllability on audio-visual video captioning even with competitive performance with state-of-the-art video captioning methods.
The proposed audio-visual interpretable and controllable video captioning framework. During testing, words in the sentence will be predicted one-by-one. The input video frames only contain content of the video game, but there is man speaking sound in the audio channel. The word man will be inferred from activated the auditory modality, and the words playing and minecraft are mainly from visual modality. We make modality selection decision based on values of audio activation energy and visual activation energy. There is an audio-visual controller in the modality-aware aggregation module, which balances the importance between audio and visual modalities during sentence prediction.
MMCNN and Masked Convolution. Residual Network in the proposed MMCNN and an example for illustrating the masked convolution operation in residual units.
Audio-visual interpretable video captioning results with modality selection visualizations. Audio activated words and visual activated words are highlighted with red and blue texts, respectively.
Audio-visual controllable video captioning results. During testing, we use a single trained model with setting the audio-visual controller as 0, 0.1, ..., 1 to generate different captions.
J. Goodman and M. Moore did the work during their REU program with the University of Rochester. This work was supported in part by NSF IIS 1741472, IIS 1813709, and CHE 1764415. This article solely reflects the opinions and conclusions of its authors and not the funding agents.