RGNet: A Unified Clip Retrieval and Grounding Network for Long Videos
Tanveer Hannan Md Mohaiminul Islam Thomas Seidl Gedas Bertasius
LMU Munich, MCML and UNC-Chapel Hill
Accepted by ECCV 2024
[Paper] [Code] [PapersWithCode]
We propose RGNet, a novel architecture for processing long videos (20–120 minutes) for fine-grained video moment understanding, and reasoning. It predicts the moment boundary specified by textual queries from an hour-long video. Its core component is a novel transformer encoder, RG-Encoder, that unifies the two stages through shared features and mutual optimization. This enables the processing of long videos into multiple granular levels, e.g., clips and frames. RGNet surpasses prior methods, showcasing SOTA performance on long video temporal grounding datasets MAD and Ego4D-NLQ.