A Simple LLM Framework for Long-Range Video Question-Answering
Ce Zhang*, Taixi Lu*, Md Mohaiminul Islam, Ziyang Wang, Shoubin Yu,
Mohit Bansal, Gedas Bertasius
UNC Chapel Hill
Accepted by EMNLP 2024 (main)
We present LLoVi, a simple yet effective Language-based Long-range Video question-answering (LVQA) framework. Our method decomposes the short- and long-range modeling aspects of LVQA into two stages. First, we use a short-term visual captioner to generate textual descriptions of short video clips (0.5-8 seconds in length) densely sampled from a long input video. Afterward, an LLM aggregates the densely extracted short-term captions to answer a given question. Furthermore, we propose a novel multi-round summarization prompt that asks the LLM first to summarize the noisy short-term visual captions and then answer a given input question. To analyze what makes our simple framework so effective, we thoroughly evaluate various components of our framework. Our empirical analysis reveals that the choice of the visual captioner and LLM is critical for good LVQA performance. The proposed multi-round summarization prompt also leads to a significant LVQA performance boost. Our method achieves the best-reported results on the EgoSchema dataset, best known for very long-form video question-answering. LLoVi also outperforms the previous state-of-the-art by 10.2% and 6.2% on NExT-QA and IntentQA for LVQA. Finally, we extend LLoVi to grounded VideoQA, which requires both QA and temporal localization, and show that it outperforms all prior methods on NExT-GQA. Code is available at https: //github.com/CeeZh/LLoVi.
Most existing video models are designed for short videos, so they struggle when applied to longer ones.
Long video understanding models often use complex designs, e.g. memory queues, state space models. These complex designs make their models hard to reproduce and build up on.
Recently, Large Language Models (LLMs) show impressive capability for long-range reasoning on a wide range of tasks such as document understanding and long-horizon planning.
Can we leverage LLMs to design a simple, training-free, yet effective model for long video understanding?
Stage 1: given a long video input, we segment it into multiple short clips and convert them into short textual descriptions using a pretrained frame/clip-level visual captioner (e.g., BLIP-2, LaViLa).
Stage 2: we concatenate the temporally ordered captions from Stage1 and feed them into an LLM (e.g., GPT-3.5, GPT-4, LLaMA) to perform long-range reasoning for LVQA.
Our decomposed LVQA framework brings several important advantages. First, our approach is simple as it does not rely on complex/specialized long-range video modeling operators. Second, our framework is training-free. Third, our framework enables us to leverage the strong existing short-term visual captioners and powerful zero-shot LLMs. Fourth, our method is highly flexible, i.e., it can incorporate various visual captioners and LLMs, and also benefit from future improvements in visual captioning/LLM model design.
Many modern LLMs (e.g., GPT-3.5, LLaMA) may struggle when provided with long (>1K words), noisy, and potentially redundant/irrelevant caption sequences. To address these issues, we investigate more specialized LLM prompts that ask an LLM first to summarize the noisy short-term visual captions and then answer a given question about the video.
Round 1: prompt the LLM to summarize the raw captions. Optionally, we can input the given question and answer candidates to guide the summary generation.
Round 2: prompt the LLM to answer the given question based on the summary instead of the raw captions.
The choice of the visual captioner has a significant impact on the model’s performance. We observe that LaViLa provides the best results, outperforming BLIP-2, EgoVLP, and LLaVA. The Oracle baseline with ground truth captions outperforms LaViLa captions by a large margin (10.8%).
The choice of the LLM also has significant impact on the model's performance. Our results indicate that GPT-4 achieves the best performance (61.2%), followed by LLama3-70B (56.8%) and GPT-3.5 (55.2%). Thus, stronger LLMs are better at long-range modeling. We also observe that despite having a much smaller number of parameters, LLama3-8B (52.2%) and Mistral-7B (50.8%) still achieve competitive performance.
We divide the input long video into consecutive clips of different length. The highest accuracy is achieved when the clips are shortest, while performance diminishes as clip length increases. This indicates that splitting long videos into shorter segments, particularly 1-second clips, is the most efficient approach.
We divide the input long video into consecutive 1s short clips and study the effect of different clip sampling rates. Sampling clips every 1s achieves the best performance while sampling clips every 8s achieves the best efficiency (8x) with only 2.0% accuracy drop. This suggests that we can effectively control the accuracy-efficiency trade-off of our framework by varying the clip sampling rate.
Three variants of the Multi-round Summarization Prompt :
(C) → S: use captions to get the summary.
(C, Q) → S: use captions and the question to get the summary.
(C, Q, A) → S: use captions, the question and choices to get the summary.
Our results indicate that the (C, Q) → S variant works the best, significantly outperforming (+3.6%) the standard prompt. We hypothesize that additional inputs in the form of a question Q enable the LLM to generate a summary S tailored to the given question.
We compare our multi-round summarization prompt with other commonly used prompts such as Zero-shot Chain-of-Thought and Plan-and-Solve. Our results indicate that our multi-round summarization prompt achieves the best performance among all of these prompts. Furthermore, we note that it outperforms the standard prompt by a substantial 3.6% in LVQA accuracy, thus indicating the effectiveness of our prompt design.
LLoVi achieves state-of-the-art zero-shot performance on multiple LVQA benchmarks, including EgoSchema, NExT-QA, and IntentQA. It also outperforms all prior methods on the grounded LVQA benchmark NExT-GQA.
EgoSchema
IntentQA
NExT-QA
NExT-GQA
@article{zhang2023simple,
title={A simple llm framework for long-range video question-answering},
author={Zhang, Ce and Lu, Taixi and Islam, Md Mohaiminul and Wang, Ziyang and Yu, Shoubin and Bansal, Mohit and Bertasius, Gedas},
journal={arXiv preprint arXiv:2312.17235},
year={2023}
}