ScaleDreamer: Scalable Text-to-3D Synthesis with Asynchronous Score Distillation

We propose Asynchronous Score Distillation (ASD) a novel score distillation method to train text-to-3D generator in an unsupervised manner. ASD is stable to train and can scale up to 100k prompts. We conduct extensive experiments across different 2D diffusion models, including Stable Diffusion and MVDream, and text-to-3D generators, including Hyper-iNGP, 3DConv-Net and Triplane-Transformer. The results demonstrate ASD's effectiveness in stable 3D generator training, high-quality 3D content synthesis, and its superior prompt-consistency, especially under large prompt corpus.

Figure 1. Overview of Asynchronous Score Distillation (ASD). As illustrated in the left sub-figure, ASD can be employed for prompt-specific generation by optimizing 3D representations for each prompt, as well as for prompt-amortized generation by training a text-to-3D generator. The right sub-figure depicts how ASD uses the difference in noise predictions at asynchronous timesteps to update the 3D network parameters. 

PS: You can change the resolution of the following videos for more detail 🙏

Teaser

Teaser.mp4

Demo 1. Top rows: Asynchronous Score Distillation (ASD) for prompt-specific text-to-3D generation. Bottom row: ASD for prompt-amortized generation, which learns a text-to-3D generator on multiple prompts without 3D ground truths. ASD has strong capability to scale up the training corpus to as much as 100k text prompts.

Results with iNGP / Hyper-iNGP

Demo 3. Qualitative comparison among CSD, VSD and our ASD (with 3DConvnet as generator) on AT2520 and DF415 corpuses. SDS is not compared because it encounters numerical instability in this experiment.

Results with iNGP, Hyper-iNGP.mp4

Demo 2. Qualitative comparison on prompt-specific (with iNGP as the 3D representation) and prompt-amortized (with Hyper-iNGP as the 3D generator) text-to-3D results by SDS, CSD, VSD and our ASD methods

Results with 3DConv-net

Results with 3DConv-net.mp4

Demo 3. Qualitative comparison among CSD, VSD and our ASD (with 3DConvnet as generator) on AT2520 and DF415 corpuses. SDS is not compared because it encounters numerical instability in this experiment.

Ablation Study

Ablation_Study.mp4

Demo 4. The qualitative results of the ablation study on the timestep interval

Scalability

Scalability.mp4

Demo 5. The scalability comparison with CSD  and VSD  on CP100k corpus.

More Results with MVDream

MVDream Compare.mp4

Demo 6. Qualitative comparison between SDS* and ASD on prompt-specific text-to-3D generation, with iNGP as 3D representation and MVDream as 2D diffusion prior.