The creation of 3D artifacts has traditionally been the domain of highly skilled designers and artists, bound by a complex and time-consuming workflow. The existing process, from ideation to production, requires a diverse range of technical skills and mastery over multiple computational systems. This multi-layered approach—involving rule-setting, exploration, simulation, and optimization—is not only slow but also inherently exclusive, limiting creative input to trained professionals. Consequently, the barrier to entry remains high, stifling collaboration and slowing down the pace of innovation.
To address these fundamental challenges, we have developed a novel generative AI framework that revolutionizes 3D asset creation. Our approach streamlines the entire pipeline by leveraging a sophisticated Encoder-Decoder architecture that translates natural language descriptions directly into detailed 3D models. This paradigm shift replaces the convoluted traditional workflow with an intuitive, single-system solution.
Our system is designed to achieve four key objectives:
Promote Natural Engagement: By using text as the primary input, we lower the technical barrier, allowing creators to express their vision in a simple and intuitive way.
Democratize the Process: Our model opens the door for multiple users—including writers, producers, and developers without specialized 3D skills—to actively participate in the asset generation process.
Consolidate Systems: It eliminates the need for a complex chain of software tools, integrating the generation process into one unified and efficient system.
Accelerate Production: By directly converting ideas into assets, we drastically reduce the time from concept to a usable 3D model.
In essence, our text-to-3D system is not just an incremental improvement but a transformative leap towards making 3D content creation faster, more accessible, and profoundly more collaborative.
Our dataset strategy progressed through three deliberate phases to maximize model performance. We began with ModelNet40 for foundational class-based point cloud generation to establish a robust baseline in a controlled environment. To transition to text conditioning, we scaled up our training with the massive Objaverse dataset; while its scale was invaluable for initial text-to-3D alignment, the crude captions and inconsistent mesh quality presented a clear ceiling on fidelity. To surpass this limitation, we ultimately curated a custom high-fidelity dataset by meticulously scraping superior models, converting them into Signed Distance Fields (SDFs), and pairing them with high-quality, descriptive captions. This final, purpose-built dataset provides the clean, nuanced data necessary for our model to generate exceptionally detailed and accurate 3D artifacts.
For our initial trial, we implemented a U-Net Denoising Diffusion Probabilistic Model to establish a baseline for 3D shape generation. Instead of generating a 3D volume directly, this model was designed to predict a triplanar representation—a set of three orthogonal 2D feature planes (XY, YZ, and XZ) that efficiently encode the geometry of an object. The generation process begins with a random noise vector. This vector is iteratively denoised by the U-Net, guided by a conditioning vector derived from a specific class label (e.g., "airplane," "chair") that has been processed by a text encoder. During training, the model learns by minimizing the L1 loss between its predicted triplanes and the ground truth triplanes extracted from the training data. Once the denoising process is complete and the final triplanes are predicted, we use the marching cubes algorithm. This classic computer graphics algorithm interprets the three 2D planes to reconstruct a coherent and continuous 3D mesh, effectively converting the model's compact 2D output into a usable 3D asset.
Building upon the insights from Trial 1, we recognized the computational demands and efficiency challenges of the Denoising Diffusion Probabilistic Model for generating high-fidelity triplanes. To address these limitations and enhance our model's capacity for complex text-to-3D generation, we transitioned to a transformer-based architecture. For computational efficiency and to leverage existing powerful language understanding capabilities, our Trial 2 model incorporates a pre-trained encoder transformer model. This encoder is responsible for processing a rich text prompt (e.g., "Low poly Necron gauss rifle from the dawn of war series") and encoding it into a compact, meaningful latent vector of size 384. The core innovation in this trial lies in our decoder transformer model. This decoder is specifically trained to take the latent vector generated by the pre-trained encoder and directly synthesize the triplanar representation of the desired 3D object. By utilizing a pre-trained encoder, we significantly reduce the training burden, allowing us to focus computational resources on fine-tuning the decoder to accurately translate abstract text features into concrete 2D geometric projections. During training, we optimize the decoder by combining two loss functions: L1 Loss for pixel-level accuracy against the ground truth triplanes, and Perceptual Loss (VGG16), which helps the model generate visually coherent and realistic triplanes by comparing high-level features. Similar to Trial 1, the predicted triplanes are then converted into a final 3D mesh using the marching cubes algorithm, producing the desired 3D model from a textual description. This setup allows for more efficient and higher-quality generation of complex 3D assets directly from text.
Following the successful implementation of our transformer architecture in Trial 2, our next iteration focuses on enhancing the quality and smoothness of the generated geometry. While triplanes are efficient, they can sometimes lead to discrete or blocky artifacts. Trial 3 addresses this by replacing the triplanar output with a direct prediction of a Signed Distance Field (SDF) and its corresponding Normal Map. The core architecture remains a computationally efficient pre-trained encoder and a custom decoder transformer. However, the decoder is now trained to synthesize a continuous SDF volume from the text prompt's latent vector. SDFs are proven to be highly effective at representing smooth, continuous, and watertight surfaces, which is a significant step towards generating production-ready meshes. The model is trained to minimize the L1 and L2 loss between the predicted and ground truth SDFs, with the final mesh still being extracted via the Marching Cubes algorithm. This trial aims to leverage the power of transformers to create geometrically superior models directly from text.
Our future work aims to evolve the model from a text-only system into a versatile multimodal framework capable of generating high-fidelity 3D assets from either a text prompt or a single image in seconds, significantly broadening its creative and practical applications from pure text-based ideation to 3D reconstruction from a single photograph. Inspired by the LDM paper's pipeline, the multimodal model will incorporate a new conditional branch for image inputs. When a single image is provided, it will first be processed by multi-view diffusion models like MVDream and ImageDream to generate a set of consistent 2D views from different angles. A pre-trained vision transformer will then encode these multiple views, and their combined features will serve as a robust geometric condition for our decoder transformer, guiding it to reconstruct the SDF. By adopting this strategy, we aim to create a single, powerful pipeline that can handle both abstract text descriptions and concrete visual references, unifying the process of creating versatile 3D assets.
Our next major initiative is the development of a 3D Co-Creative AI, a project that moves beyond single-shot generation to create a truly agentic AI for 3D asset production. This AI is designed to function as a collaborative partner, capable of interpreting complex, high-level creative goals and managing the entire production pipeline. As illustrated in the workflow, the process begins when the agent deconstructs a user's request, identifying the main objects, their desired states, and any technical constraints. Following this planning phase, the agent generates each required component as a separate base model before performing modifications and assembling them into a cohesive final scene. The final step in its workflow is to optimize this assembled scene to meet all of the initial user-defined constraints, delivering a complete and polished 3D asset. By automating this entire pipeline—from deconstruction and generation to assembly and optimization—the 3D Co-Creative AI aims to streamline the content creation process, allowing users to focus on high-level vision rather than manual, step-by-step execution.