Open-ended Hierarchical Streaming Video Understanding with Vision Language Models
(* indicates equal contributions)
Hyolim Kang *, Yunsu Park *, Youngbeom Yoo, Yeeun Choi, Seon Joo Kim
Yonsei University
ICCV 2025
Paper | Code | Poster
Open-ended Hierarchical Streaming Video Understanding with Vision Language Models
(* indicates equal contributions)
Hyolim Kang *, Yunsu Park *, Youngbeom Yoo, Yeeun Choi, Seon Joo Kim
Yonsei University
ICCV 2025
Paper | Code | Poster
We introduce Hierarchical Streaming Video Understanding, a task that combines online temporal action localization with free-form description generation. Given the scarcity of datasets with hierarchical and fine-grained temporal annotations, we demonstrate that LLMs can effectively group atomic actions into higher-level events, enriching existing datasets. We then propose OpenHOUSE (Open-ended Hierarchical Online Understanding System for Events), which extends streaming action perception beyond action classification. OpenHOUSE features a specialized streaming module that accurately detects boundaries between closely adjacent actions, nearly doubling the performance of direct extensions of existing methods. We envision the future of streaming action perception in the integration of powerful generative models, with OpenHOUSE representing a key step in that direction.
(a) Illustration of OpenHOUSE. The lightweight Streaming module processes every frame, while VLM inference is triggered selectively at T=3,4,8,10, indicated by the yellow bell icon.
(b) Diagram depicting how our hybrid action boundary detection works. For instance, at T2, a sudden drop in the progression signals the action end, marking the current frame as a background frame. In the next timestep, a high actionness score (=1.0) classifies the frame as an action frame, resulting in a background-to-action transition and marking it as the start of Ins2.
Dataset generation pipeline. Given substep instances from I1 to I5, the LLM clusters these substeps into three steps, generating step descriptions, timestamps, and a goal description. Afterward, human validation is conducted to include the missing annotation I3 into the step annotations of I1 to I2, and to revise the step description for I4 to I5.
Evaluation of pseudo-labels from our dataset pipeline. All F1 scores are calculated using the ground truth step labels from the Ego4D-GoalStep(EgoGS) dataset.
The results in the first row show that the LLM effectively groups atomic actions.
The results in the second row show that the models trained with our pseudo-labels performed competitively with those trained on the actual ground truth labels.
Ablation study of Hybrid Action Boundary Detection. Since Ego-Exo4D Keystep (EgEx) and Epic-Kitchens 100 (EK100) lack ground truth step labels, experiments are conducted using pseudo-step labels.
Across all configurations and datasets, incorporating our hybrid method yields significant gains.
It highlights the critical importance of our approach in processing procedural videos, where many instances are strictly adjacent.
Comparison with baseline methods and results in various datasets. dagger indicates datasets with ``step pseudo-label'' from our pipeline, and ``Exo.'' denotes EgEx's Exo view. ``GT proposal'' refers to experiments using ground truth action proposals as VLM input, without class information. Zero-shot VLM inference achieves around 30% accuracy in matching ground truth labels. InternVL2-40B-AWQ is used as the frozen VLM. Following the original implementation, YT-Temporal pretrained Vid2Seq weights are used for initialization of SDVC, and fine-tuned using EgoGS annotations, with separate training for each hierarchy. We used the default configuration on SDVC.
Qualitative results. For the GT substep caption "Place broccoli strainer in pot", OpenHOUSE provides "The person is placing the steamer basket with the rinsed broccoli onto the stove.", which aligns better with the actual visual content.
We observed that the generated captions exhibit high semantic similarity to the ground truth (GT) captions, providing even more detailed descriptions.
Code and Datasets will be released soon. Stay tuned!
@inproceedings{2025openhouse,
title = {Open-ended Hierarchical Streaming Video Understanding with Vision Language Models},
author = {Hyolim Kang and Yunsu Park and Youngbeom Yoo and Yeeun Choi and Seon Joo Kim},
booktitle = {Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV)},
year = {2025},
pages = {to appear},
url = {https://sites.google.com/view/yunsupark/projects/openhouse}
}