Minjun kang - GeoNVS (Arxiv26)

Minjun kang

GeoNVS: Geometry Grounded Video Diffusion

for Novel View Synthesis

Minjun Kang1 Inkyu Shin2 Taeyeop Lee1 Myungchul Kim1

In So Kweon1 Kuk-Jin Yoon1†

1KAIST 2Luma AI

† Corresponding Author

[Paper] [Code]

Abstract

Novel view synthesis requires strong 3D geometric consistency and the ability to generate visually coherent images across diverse viewpoints. While recent camera-controlled video diffusion models show promising results, they often suffer from geometric distortions and limited camera controllability. To overcome these challenges, we introduce GeoNVS, a geometry-grounded novel-view synthesizer that enhances both geometric fidelity and camera controllability through explicit 3D geometric guidance. Our key innovation is the Gaussian Splat Feature Adapter (GS-Adapter), which lifts input-view diffusion features into 3D Gaussian representations, renders geometry-constrained novel-view features, and adaptively fuses them with diffusion features to correct geometrically inconsistent representations. Unlike prior methods that inject geometry at the input level, GS-Adapter operates in feature space, avoiding view-dependent color noise that degrades structural consistency. Its plug-and-play design enables zero-shot compatibility with diverse feed-forward geometry models without additional training, and can be adapted to other video diffusion backbones. Experiments across 9 scenes and 18 settings demonstrate state-of-the-art performance, achieving 11.3% and 14.9% improvements over SEVA and CameraCtrl, with up to 2x reduction in translation error and 7x in Chamfer Distance.

✨ GeoNVS: 3D consistent camera-controlled video diffusion model

(Left) Reconstructed point clouds and camera trajectories from the generated videos.

(Right) Comparison of generated novel-view images against baselines (SEVA and CameraCtrl).

Our model is compatible with diverse feed-forward geometry models in a zero-shot manner.

💡 Proposed method

Feature-Space Modulation: By operating in the feature space rather than the input level, GeoNVS avoids the view-dependent color noise that often degrades structural consistency in previous works.
Plug-and-Play Design: The modular architecture supports zero-shot compatibility with various feed-forward geometry models (e.g., VGGT, DepthSplat, Pi3) without any additional training and can be adapted to different video diffusion backbones (e.g., SEVA, CameraCtrl).

🎞️ Video

🎨 Gallary

Google Sites

Report abuse