We propose the problem of predicting human-scene collisions from multi-view egocentric RGB videos captured from body-mounted cameras. Specifically, the problem consists of predicting: (1) if a collision will happen in the next H seconds; (2) which body joints might be involved in a collision; and (3) where in the scene might cause the collision, in the form of a spatial heatmap.
To solve this problem, we present COPILOT, a COllision PredIction and LOcalization Transformer that tackles all three sub-tasks in a multi-task setting, effectively leveraging multi-view video inputs through a proposed 4D attention operation across space, time, and viewpoint.
To train and evaluate the model, we further develop a synthetic data pipeline that simulates virtual humans walking and possibly colliding in photo-realistic 3D environments. This pipeline is then used to establish a large-scale dataset consisting of ~8.6M egocentric RGBD frames.
We perform extensive experiments that demonstrate COPILOT's promising performance, especially on sim-to-real transfer. Notably, we also apply COPILOT to a downstream collision avoidance task, and successfully reduce collision cases by 29% on scenes unseen during training.