CLAY: A Controllable Large-scale Generative Model for Creating High-quality 3D Assets

Longwen Zhang1,2*  Ziyu Wang1,2*  Qixuan Zhang1,2+  Qiwei Qiu1,2  Anqi Pang1   Haoran Jiang1,2  Wei Yang3  Lan Xu1#  Jingyi Yu1# 

1ShanghaiTech University        2Deemos Technology        3Huazhong University of Science and Technology

*Equal contributions. +Project leader. #Corresponding author. 

SIGGRAPH 2024 Journal & Real-time Live!

Experience CLAY online now!

Abstract

In the realm of digital creativity, our potential to craft intricate 3D worlds from imagination is often hampered by the limitations of existing digital tools, which demand extensive expertise and efforts. To narrow this disparity, we introduce CLAY, a 3D geometry and material generator designed to effortlessly transform human imagination into intricate 3D digital structures. CLAY supports classic text or image inputs as well as 3D-aware controls from diverse primitives (multi-view images, voxels, bounding boxes, point clouds, implicit representations, etc). At its core is a large-scale generative model composed of a multi-resolution Variational Autoencoder (VAE) and a minimalistic latent Diffusion Transformer (DiT), to extract rich 3D priors directly from a diverse range of 3D geometries. Specifically, it adopts neural fields to represent continuous and complete surfaces and uses a geometry generative module with pure transformer blocks in latent space. We present a progressive training scheme to train CLAY on an ultra large 3D model dataset obtained through a carefully designed processing pipeline, resulting in a 3D native geometry generator with 1.5 billion parameters. For appearance generation, CLAY sets out to produce physically-based rendering (PBR) textures by employing a multi-view material diffusion model that can generate 2K resolution textures with diffuse, roughness, and metallic modalities. We demonstrate using CLAY for a range of controllable 3D asset creations, from sketchy conceptual designs to production ready assets with intricate details. Even first time users can easily use CLAY to bring their vivid 3D imaginations to life, unleashing unlimited creativity. 

Overview

An overview of our CLAY framework for 3D generation. Central to the framework is a large generative model trained on extensive 3D data, capable of transforming textual descriptions into detailed 3D geometries. The model is further enhanced by  physically-based material generation and versatile modal adaptation, to enable the creation of 3D assets from diverse concepts and ensure their realistic rendering in digital environments.

Results

Evolution of human innovation, from primitive tools and cultural artifacts to modern electronics and futuristic imaginings, generated by CLAY.

Sample creations using CLAY, with conditions marked in sky blue and input geometries for respective conditioning (if available) in sandy brown.

Citation

@misc{zhang2024clay,

      title={CLAY: A Controllable Large-scale Generative Model for Creating High-quality 3D Assets}, 

      author={Longwen Zhang and Ziyu Wang and Qixuan Zhang and Qiwei Qiu and Anqi Pang and Haoran Jiang and Wei Yang and Lan Xu and Jingyi Yu},

      year={2024},

      eprint={2406.13897},

      archivePrefix={arXiv},

      primaryClass={id='cs.CV' full_name='Computer Vision and Pattern Recognition' is_active=True alt_name=None in_archive='cs' is_general=False description='Covers image processing, computer vision, pattern recognition, and scene understanding. Roughly includes material in ACM Subject Classes I.2.10, I.4, and I.5.'}

}

Notice: You may not copy, reproduce, distribute, publish, display, perform, modify, create derivative works, transmit, or in any way exploit any such content, nor may you distribute any part of this content over any network, including a local area network, sell or offer it for sale, or use such content to construct any kind of database.