Architecture of the pose estimation network.
Point-of-View videos recorded by Augmented Reality Glasses contain jitters because they are acquired under users’ actions in varying environments. Applying video stabilization on such videos is difficult due to weakness of conventional keypoint-based motion estimation to environmental conditions. They are prone to fail to track in low-textured or dark environments.
To overcome this limitation, we propose a neural network-based motion estimation method for video stabilization. Our network predicts frame-to-frame motion in high accuracy by focusing on global camera motion, while ignoring local motion caused by moving objects. Motion prediction takes only up to 10 ms so that we achieve real-time stabilization on modern smartphones hardware. We demonstrate our method outperforms keypoint-based motion estimation and the quality of estimated motion is good enough for video stabilization. Our network is trainable without ground truth and easily scalable to large datasets.
Two video frames (column 1, 2) are registered by using the estimated pose and their differences before (column 3) and after registration (column 4). The error is much smaller in the registered one (the darker is the larger errors)
Pose estimation in challenging scenarios; (a) Large moving object occupies a large regions in a video; (b) Lighting conditions change severely in a room. In both cases, keypoint-based tracking (top row) fails to estimate a correct pose, while our method (bottom row) successfully find frame-to-frame motion. The red rectangles depicts the final image regions kept after applying video stabilization, where wrong pose estimation results in a severe crop.
W. Lee, B. Yoo, D. Kim, J. Lee, S. Yim, T. Kwon, G. Lee, J. Jeong, "Robust Camera Motion Estimation for Point-of-View Video Stabilization," 13th International Conference, VAMR 2021, Held as Part of the 23rd HCI International Conference, HCII 2021, Pages 353–363. 2021.