Video ReCap: Recursive Captioning of Hour-Long Videos

 Md Mohaiminul Islam, Ngan Ho, Xitong Yang, Tushar Nagarajan,  

Lorenzo Torresani, Gedas Bertasius

UNC Chapel Hill and Meta AI

Accepted by CVPR 2024

[Paper]    [Code]   [Dataset]   [Demo]   [HF]

We propose Video ReCap, a recursive video captioning model that can process video inputs of dramatically different lengths (from 1 second to 2 hours) and output video captions at multiple hierarchy levels. Our model uses a recursive video-language architecture, which enables it to exploit the synergy between different video hierarchies and process hour-long videos efficiently. Moreover, we introduce Ego4D-HCap, a hierarchical video captioning dataset by augmenting Ego4D with 8,267 manually collected long-range video summaries. Our recursive model can flexibly generate captions at different hierarchy levels while also being useful for other complex video understanding tasks, such as long-range VideoQA on EgoSchema.

 Hierarchical Video Captioning Task

We aim to generate hierarchical captions for a long-range video at three temporal granularities: short clip captions every few seconds focusing on specific human actions, medium-length segment descriptions every few minutes capturing intermediate steps within activities or storylines, and long-range summary depicting the overall intent and goals of the actors.

Methodology

(Left) First, we generate captions for each short clip of the video using the dense spatiotemporal features. (Middle) Then Video ReCap produces segment descriptions using sparsely sampled features (e.g., CLS features) and the previously generated clip captions of a particular segment. (Right) Finally, the model generates the full video summary by utilizing sparsely sampled CLS features from the entire video and the previously generated segment descriptions. 

Results

Results on the Ego4D-HCap dataset

First, we observe finetuned methods perform significantly better than the zero-shot baselines. Second, Video ReCap achieves the best results in video captioning across all three hierarchies, surpassing strong prior baselines such as LaViLa. Third, using LLM-generated pseudo annotations leads to a significant boost in performance. Lastly, the unified variant of the model produces competitive results while having a significantly smaller number of trainable parameters than our standard variant.

Long-Range VideoQA on EgoSchema

Video ReCap achieves state-of-the-art results, outperforming the previous best method, InternVideo, by a substantial margin of 18.13%. Furthermore, leveraging the hierarchical captions produced by our model leads a 5.96% boost in performance compared to the captions generated by LaViLa. 

Qualitative Results on Ego4D-HCap

(a)

(b)

(c)

Generally, clip captions depict atomic actions and objects, segment descriptions focus on intermediate concepts, and video summaries encapsulate the overall content and goals of the videos. While generating clip captions and segment descriptions are often relatively easier tasks, developing a good video summary is often challenging. Our models perform well on video summaries (a) and (b), but the generated video summary (c) could be further improved.

BibTex

@article{islam2024video,
  title={Video ReCap: Recursive Captioning of Hour-Long Videos},
  author={Islam, Md Mohaiminul and Ho, Ngan and Yang, Xitong and Nagarajan, Tushar and Torresani, Lorenzo and Bertasius, Gedas},
  journal={arXiv preprint arXiv:2402.13250},
  year={2024}

}