RGNet

RGNet: A Unified Clip Retrieval and Grounding Network for Long Videos

Tanveer Hannan Md Mohaiminul Islam Thomas Seidl Gedas Bertasius

LMU Munich, MCML and UNC-Chapel Hill

Accepted by ECCV 2024

We propose RGNet, a novel architecture for processing long videos (20–120 minutes) for fine-grained video moment understanding, and reasoning. It predicts the moment boundary specified by textual queries from an hour-long video. Its core component is a novel transformer encoder, RG-Encoder, that unifies the two stages through shared features and mutual optimization. This enables the processing of long videos into multiple granular levels, e.g., clips and frames. RGNet surpasses prior methods, showcasing SOTA performance on long video temporal grounding datasets MAD and Ego4D-NLQ.

Long Video Temporal Grounding Task

We aim to predict the moment boundary specified by textual queries from an hour-long video. First, our proposed RG-Encoder maps the video and text features to a joint space and retrieves the relevant clip feature. The grounding decoder processes the retrieved features to predict the beginning and end times of the moment. The encoder parallelly operates at multiple levels of granularity (e.g., clip and frame) to achieve an end-to-end solution.

Methodology

RG-Encoder takes video clips and textual query as input and retrieves the relevant clip features. First, a cross-attention fuses the clips with text, and the sparsifier masks the out-of-moment frames. Based on the mask, the retrieval attention focuses on in-moment frames (colored red) and generates clip-level context and frame-level content features. We combine the context and content to generate the retrieved clip feature.

Results

RGNet achieves state-of-the-art performance on both datasets. Our default network is trained without NaQ annotations on Ego4D. Even without NaQ annotations, our default network shows a larger improvement, underscoring the effectiveness of our solution with limited data.

Empirical impact of two stages

To asses standalone grounding capability, we run an oracle experiment on the clip where the ground truth moment is present. The grounding capability degrades in LVTG evaluation because of incorrect selection by the disjoint clip retrieval network. Our unified model improves retrieval significantly, which leads to more effective temporal grounding in long videos.

Qualitative Results on Ego4D-NLQ and MAD

(a)

(b)

(a) The baseline fails to retrieve the correct clip in the first two queries. Since they primarily depict the same indoor room throughout the whole video, precise event discrimination is vital for accurate clip retrieval. In the third query, the baseline cannot identify the moment within the correctly retrieved clip, detecting it only after it has concluded. Precise event localization demands improved alignment between visual events and the queried text. RGNet correctly localizes all the moments. (b) RGNet successfully localizes moments from long movies by parallelly processing clip and frame level granularity.

BibTex

@article{hannan2023rgnet,

title={RGNet: A Unified Retrieval and Grounding Network for Long Videos},

author={Hannan, Tanveer and Islam, Md Mohaiminul and Seidl, Thomas and Bertasius, Gedas},

journal={arXiv preprint arXiv:2312.06729},

year={2023}

}

[Paper] [Code] [PapersWithCode]