OED: Towards One-stage End-to-End Dynamic Scene Graph Generation

Guan Wang1, Zhimin Li2, Qingchao Chen3, Yang Liu1

1Wangxuan Institute of Computer Technology, Peking University

2Tencent Inc.    3National Institute of Health Data Science, Peking University

Abstract

Dynamic Scene Graph Generation (DSGG) focuses on identifying visual relationships within the spatial-temporal domain of videos. Conventional approaches often employ multi-stage pipelines, which typically consist of object detection, temporal association, and multi-relation classification. However, these methods exhibit inherent limitations due to the separation of multiple stages, and independent optimization of these sub-problems may yield sub-optimal solutions. To remedy these limitations, we propose a one-stage end-to-end framework, termed OED, which streamlines the DSGG pipeline. This framework reformulates the task as a set prediction problem and leverages pair-wise features to represent each subject-object pair within the scene graph. Moreover, another challenge of DSGG is capturing temporal dependencies, we introduce a Progressively Refined Module (PRM) for aggregating temporal context without the constraints of additional trackers or handcrafted trajectories, enabling end-to-end optimization of the network. Extensive experiments conducted on the Action Genome benchmark demonstrate the effectiveness of our design. 

Method

Given the target frame and reference frames, OED directly generates scene graphs with spatial-temporal context in a way of set prediction. First, the CNN backbone and Transformer encoder are sequentially utilized to extract visual features of each frame. To extract and aggregate useful spatial context, we adopt DETR-like architecture and associate learnable queries with pair-wise feature of candidate object pairs. The pair-wise feature then extracts and aggregates spatial context in Transformer decoder. To improve the detection of blurred object and predicate classification with dependencies on contextual frames at the same time, we introduce a progressive refined pair-wise feature interaction module (PRM) to select and aggregate useful information from reference frames to the pair-wise feature of the target frame in a progressively refined way. PRM fuses additional temporal con- text with the spatial aggregated pair-wise feature of the target frame, and then we obtain the final pair-wise feature with spatial-temporal context. The pair-wise detection and predicate classification results will form a list of triplets ⟨s, p, o⟩, which corresponds to the scene graph of target frame. 

Publications

OED: Towards One-stage End-to-End Dynamic Scene Graph Generation 

Guan Wang, Zhimin Li, Qingchao Chen, Yang Liu

CVPR, 2024

PDF  | Code

Bibtex

@inproceedings{oed_cvpr24,

 title={OED: Towards One-stage End-to-End Dynamic Scene Graph Generation},

 author={Guan Wang, Zhimin Li, Qingchao Chen, and Yang Liu},

 year={2024},

 booktitle={CVPR},

 organization={IEEE}

}