GeoNVS: Geometry Grounded Video Diffusion
for Novel View Synthesis
† Corresponding Author
† Corresponding Author
Novel view synthesis requires strong 3D geometric consistency and the ability to generate visually coherent images across diverse viewpoints. While recent camera-controlled video diffusion models show promising results, they often suffer from geometric distortions and limited camera controllability. To overcome these challenges, we introduce GeoNVS, a geometry-grounded novel-view synthesizer that enhances both geometric fidelity and camera controllability through explicit 3D geometric guidance. Our key innovation is the Gaussian Splat Feature Adapter (GS-Adapter), which lifts input-view diffusion features into 3D Gaussian representations, renders geometry-constrained novel-view features, and adaptively fuses them with diffusion features to correct geometrically inconsistent representations. Unlike prior methods that inject geometry at the input level, GS-Adapter operates in feature space, avoiding view-dependent color noise that degrades structural consistency. Its plug-and-play design enables zero-shot compatibility with diverse feed-forward geometry models without additional training, and can be adapted to other video diffusion backbones. Experiments across 9 scenes and 18 settings demonstrate state-of-the-art performance, achieving 11.3% and 14.9% improvements over SEVA and CameraCtrl, with up to 2x reduction in translation error and 7x in Chamfer Distance.
(Left) Reconstructed point clouds and camera trajectories from the generated videos.
(Right) Comparison of generated novel-view images against baselines (SEVA and CameraCtrl).
Our model is compatible with diverse feed-forward geometry models in a zero-shot manner.
Feature-Space Modulation: By operating in the feature space rather than the input level, GeoNVS avoids the view-dependent color noise that often degrades structural consistency in previous works.
Plug-and-Play Design: The modular architecture supports zero-shot compatibility with various feed-forward geometry models (e.g., VGGT, DepthSplat, Pi3) without any additional training and can be adapted to different video diffusion backbones (e.g., SEVA, CameraCtrl).