Mining for meaning: from vision to language through multiple network consensus

Abstract

Describing visual data into natural language is a very challenging task, at the intersection of computer vision, natural language processing and machine learning. Language goes well beyond the description of physical objects and their interactions and can convey the same abstract idea in many ways. It is both about content at the highest semantic level as well as about fluent form. Here we propose an approach to describe videos in natural language by reaching a consensus among multiple encoder-decoder networks. Finding such a consensual linguistic description, which shares common properties with a larger group, has a better chance to convey the correct meaning. We propose and train several network architectures and use different types of image, audio and video features. Each model produces its own description of the input video and the best one is chosen through an efficient, two-phase consensus process. We demonstrate the strength of our approach by obtaining state of the art results on the challenging MSR-VTT dataset.

Paper

I. Duta, A. Nicolicioiu, S.V. Bogolin, M. Leordeanu, Mining for meaning: from vision to language through multiple networks consensus, arXiv 2018, accepted at BMVC 2018 - PDF

@inproceedings{duta2018mining,
  title={Mining for meaning: from vision to language through multiple networks consensus},
  author={Duta, Iulia and Nicolicioiu, Andrei Liviu and Bogolin, Simion-Vlad and Leordeanu, Marius},
  booktitle={British Machine Vision Conference (BMVC 2018)},
  year={2018}
}

Qualitative results

Here we present some qualitative results on the MSR-VTT dataset (J. Xu, T. Mei, T. Yao, and Y. Rui, Msr-vtt: A large video description dataset for bridging video and language, CVPR 2016).

Below you can find two short videos and the result of our system.

Below you can see some qualitative results showing 3 generated sentences from our models along with a few relevant human annotations. The generated sentences are fluent and related in content to the human annotations. Also, note how diverse the human annotations are, especially in form, while being highly meaningful.

Consensus vs. Ground Truth Ranking

Here we show some qualitative examples of sentences generated by our 16 models from multiple videos. In the left side of the pictures, we present 4 frames sampled from each video. The right side of each picture is split in 3 cells: in the upper cell we present top 5 generated sentences sorted by consensus score (with actual value shown at the start of each line), in the middle cell we list the top 5 sentences ordered by CIDER score with respect to ground truth (actual value shown at the start of each line) and in the bottom cell we randomly sample 5 examples of from human annotations.



Language reconstruction

Below we show results of the reconstruction part of the Two-Wings model. In this submodel we receive as input a sentence from the annotations, apply a random permutation on the order of the words, remove half of them and try to reconstruct it. The first column contains the sentence to be reconstructed, the second column the remaining shuffled words used as input and the third one the generated sentence. We can see that the generated captions are grammatically and semantically correct. Although the reconstruction does not match the target sentence, given just half of the individual words and the random permutation of these words it would be impossible, even for a human, to reconstruct back the original sentence. But this is not the end goal of this branch - its main purpose is to learn to generate rich, diverse and coherent sentences. Consequently, in our experiments the Two-Wings Network produced on average more diverse sentences than other network models.

Two-Stage Network

results: In Figure 4 we present some examples produced by the Two- Stage Model. The model consists of two parts: a part (a first stage) that predicts multiple labels for the video followed by a part (the next stage) that generates a sentence based only on these labels. The model is initialized by training both parts independently and then fine-tune them jointly. For each cell, the first row contains 3 frames sampled from each video, the second row contains results from the initial independent learning and the third row contains final results after the fine-tuning. For each row, the first column shows a ground truth sentence, the second column shows the top K predicted labels with their corresponding probabilities and the last column shows the generated sentence. Labels generated by our multi-label model have a high degree of accuracy. To improve the quality of the captions we fine-tuned the whole model end-to-end, obtaining signifi- cantly better results. While the multi-label prediction was better before fine-tuning, after the fine-tuning, which did not put a cost on the labels, the multi-label prediction decreased in accuracy, while the final quality at the caption level improved. This fact is also observed in the qualitative examples in Figure 4 as the fine-tuned model, trained end-to-end (third row) produces captions of better quality than the model with the two stages trained independently (second row). However, the fine tuned model is worse at predicting intermediate word labels - due to the end-to-end training with loss on the final caption but no intermediate loss on the intermediate multi-label prediction.

Team

Acknowledgments

This work was supported by Bitdefender and UEFISCDI, under project PN-III-P4-ID-ERC-2016-0007 and PN-III-P2-2.1-PED-2016-1842.