DreamSparse: Escaping from Plato’s Cave with 2D Frozen Diffusion Model given Sparse Views

Paul Yoo,   Jiaxian GuoYutaka Matsuo, Shixiang Shane Gu

The University of Tokyo

[arxiv]  [Github]

Abstract


Synthesizing novel view images from a few views is a challenging but practical problem. Existing methods often struggle with producing high-quality results or necessitate per-object optimization in such few-view settings due to the insufficient information provided.  In this work, we explore leveraging the strong 2D priors in pre-trained diffusion models for synthesizing novel view images. 2D diffusion models, nevertheless, lack 3D awareness, leading to distorted image synthesis and compromising the identity. To address these problems, we propose DreamSparse, a framework that enables the frozen pre-trained diffusion model to generate geometry and identity-consistent novel view image. Specifically, DreamSparse incorporates a geometry module designed to capture 3D features from sparse views as a 3D prior. Subsequently, a spatial guidance model is introduced to convert these 3D feature maps into spatial information for the generative process. This information is then used to guide the pre-trained diffusion model, enabling it to generate geometrically consistent images without tuning it.  Leveraging the strong image priors in the pre-trained diffusion models, DreamSparse is capable of synthesizing high-quality novel views for both object and scene-level images and generalising to open-set images. Experimental results demonstrate that our framework can effectively synthesize novel view images from sparse views and outperforms baselines in both trained and open-set category images. 


Single Image Object-centric Scene-level Novel View Synthesis Results

Without Inference-time Distillation


For generating novel views along a circular trajectory, we condition on all previously generated samples on the fly. Conditioning is simply appending the generated views to the set of context views. As our model is generative without the need to fit a 3D representation like NeRF, synthesizing a novel view during inference-time is just a single forward pass with no further training (for both unseen and open-set category objects), and thus costs just roughly 2~3 seconds per frame on a single A100-40GB GPU.

Context View

Synthesized

Context View

Synthesized

Novel View Synthesis Results with NeRF

After Distillation

Ground Truth Image

SparseFusion

Ours

Single Image Object-centric Scene-level Novel View Synthesis Results

(Unseen Test Sample)

Hydrant Scene Novel View Synthesis Results

With N=5 Context Views 

 (Without Test-Time Distillation)


GT

Synthesized

GT

Synthesized

Additional Novel View Synthesis Results for Donuts (Without Distillation)

Context View

Synthesized

Context View

Synthesized

Depth Estimation Results from the Geometry Module

GT RGB

Depth Estimate

Text Guided

Scene Editing & Style Transfer

Context View

"ghibli inspired"

"oil painting"

"digital illustration"

"in snow"

Open-Set Category Results  (Unseen Test Sample)

Context Views

Training Domain Novel View Synthesise Results  (Unseen Test Sample)