Zhu Xu1 Ting Lei1 Zhimin Li2 Guan Wang3 Qingchao Chen4 Yuxin Peng1 Yang Liu1*
1Wangxuan Institute of Computer Technology, Peking University
2Tencent Inc. 3Baidu Inc. 4National Institute of Health Data Science, Peking University
ICCV 2025
*Corresponding Author
Dynamic Scene Graph Generation (DSGG) aims to create a scene graph for each video frame by detecting objects and predicting their relationships. Weakly Supervised DSGG (WS-DSGG) reduces annotation workload by using an unlocalized scene graph from a single frame per video for training. Existing WS-DSGG methods depend on an off-the-shelf external object detector to generate pseudo labels for subsequent DSGG training. However, detectors trained on static, object-centric images struggle in dynamic, relation-aware scenarios required for DSGG, leading to inaccurate localization and low-confidence proposals. To address the challenges posed by external object detectors in WS-DSGG, we propose a Temporal-enhanced Relation-aware Knowledge Transferring (TRKT) method, which leverages in-domain knowledge to enhance detection in relation-aware dynamic scenarios. TRKT is built on two key components: (1)Relation-aware knowledge mining: we first employ object and relation class decoders that generate category-specific attention maps to highlight both object regions and interactive areas, facilitating attention maps relation-aware. Then we propose an Inter-frame Attention Augmentation strategy that exploits neighboring frames and optical flow information to enhance these attention maps, making them motion-aware and robust to motion blur. This step yields relation- and motion-aware in-domain knowledge mining for WS-DSGG. (2) we introduce a Dual-stream Fusion Module that integrates category-specific attention maps into external detections to refine object localization and boost confidence scores for object proposals. Extensive experiments demonstrate that TRKT significantly improves detection performance, providing more accurate and confident pseudo labels for WS-DSGG training.
TRKT comprises two integral phases. During the Relation-aware Knowledge Mining phase, Object and Relation Class Decoder separately generate attention maps that focus on specific object and relation semantic regions, and then fuse together to construct class-sensitive attention maps. Further, Inter-frame Attention Augmentation (IAA) adopts previous frame equipped with cross-framed optical flow to generate pseudo attention maps aware of motion. Then Dual-stream Fusion Module uses class-sensitive attention maps to refine external detection results. Localization Refinement Module improves bounding box accuracy, while the Confidence Boosting Module boosts the confidence score for object proposals through attention projection. Refined detection results are utilized to generate a pseudo scene graph for DSGG model training.
Performance comparison with baseline on Action
Genome dataset for object detection
Performance comparison with sota method on Action
Genome dataset for WS-DSGG.
Visualization results of external object detection, class-sensitive tokens attention maps and final detection.
Visualization comparison of generated dynamic scene graph by baseline(PLA) and our KIKT.
Bibtex
@misc{xu2025graph,
title={Weakly Supervised Dynamic Scene Graph Generation with Temporal-enhanced In-domain Knowledge Transferring},
author={Zhu Xu and Ting Lei and Zhimin Li and Guan Wang and Qingchao Chen and Yuxin Peng and Yang Liu},
year={2025},
booktitle={ICCV},
organization={IEEE}
}