COPILOT: Human-Environment Collision Prediction and Localization from Multi-view Egocentric Videos
Boxiao Pan, Bokui Shen*, Davis Rempe*, Despoina Paschalidou, Kaichun Mo, Yanchao Yang, Leonidas J. Guibas
(* Equal contribution)
Stanford University NVIDIA The University of Hong Kong
International Conference on Computer Vision (ICCV), 2023
Introduction
We propose the problem of predicting human-scene collisions from multi-view egocentric RGB videos captured from body-mounted cameras. Specifically, the problem consists of predicting: (1) if a collision will happen in the next H seconds; (2) which body joints might be involved in a collision; and (3) where in the scene might cause the collision, in the form of a spatial heatmap.
To solve this problem, we present COPILOT, a COllision PredIction and LOcalization Transformer that tackles all three sub-tasks in a multi-task setting, effectively leveraging multi-view video inputs through a proposed 4D attention operation across space, time, and viewpoint.
To train and evaluate the model, we further develop a synthetic data pipeline that simulates virtual humans walking and possibly colliding in photo-realistic 3D environments. This pipeline is then used to establish a large-scale dataset consisting of ~8.6M egocentric RGBD frames.
We perform extensive experiments that demonstrate COPILOT's promising performance, especially on sim-to-real transfer. Notably, we also apply COPILOT to a downstream collision avoidance task, and successfully reduce collision cases by 29% on scenes unseen during training.
Video Presentation
Data examples
Third-person view is not provided to the model.
Sim-to-real transfer
Per-frame collision predictions are overlaid on the observation videos.
Model predictions in simulation
Per-frame collision predictions are overlaid on the third-person rendering.
Collision avoidance assistance
Uncolored meshes indicate the history, orange is the original future, and blue is the future using collision avoidance assistance.