3Ts model for architectural designers
(Collaborated with David, Chia, and Graham)
Website: https://github.com/1gfelton/3T3D?tab=readme-ov-file
Website: https://github.com/1gfelton/3T3D?tab=readme-ov-file
The primary objective of this project is to develop a collaborative and intuitive computational design framework that promotes sustainable experimentation within the creative process. Existing computational tools frequently operate as opaque, "black box" systems, limiting meaningful interaction predominantly to experienced users. This restricts the democratization of the design process, reducing accessibility and collaboration opportunities.
Addressing this issue, our research proposes a more diffused and democratic interaction model, aimed at facilitating intuitive participation from a diverse range of users. A critical challenge we identify is the underappreciation of computationally generated design artifacts, often dismissed as mere mechanical outputs lacking genuine creative value. However, as Daniel Cardoso Llach emphasizes in his work "Sculpting of Spaces," the development of computational tools itself is a creative act that inherently generates valuable design artifacts.
Consequently, our project focuses on creating computational tools that are intuitive and accessible, fostering inclusive and meaningful collaborations among users with varied expertise. By redefining computational outputs as intrinsically creative contributions, we strive to elevate the creative process in the early stages of design.
Our research is founded on recent advancements in 3D-aware synthesis, particularly the work of Deng et al., who combined conditional generative models and Neural Radiance Fields (NeRFs) using an efficient triplanar representation. While their focus was novel view synthesis, we envision its potential as a "sketch-to-3D" tool for designers, provided the underlying 3D data can be converted into a standard mesh.
Building on this, Shue et al. integrated the triplanar concept into a diffusion architecture to generate novel designs. However, their model was optimized for object-centric datasets (e.g., ShapeNet) and isn't suited for structured environments. This limitation is the direct motivation for our work: we are extending triplanar diffusion techniques to the architectural domain, aiming to create a model that can generate complex spatial designs while respecting critical design constraints.
During our literature review, we were unable to find a dataset that was 'good enough' for our intent of producing designs of a quality suitable for architectural design. We needed something new, ultimately opting for a custom pipeline that allowed us to produce thousands of images of architectural renderings that we would then use to generate thousands of 3D models using TripoSR. Once the 3D models were generated, for each model we generate views of the model from the top, front, and side, and convert these to sketchy images using Informative Drawings.
Both chosen baselines were selected due to their strong resonance with our project goals. Below is a brief description of the chosen baselines:
• Pix2Pix3D — 3D-aware conditional generative model designed for controllable, multi- view-consistent image synthesis. Given a segmentation or edge map, the model generates an image of the model from different viewpoints while maintaining geometric consistency. It achieves this by learning a 3D scene representation that encodes color, density, and semantic labels at each 3D point, allowing it to synthesize both images and their corresponding pixel-aligned label maps. The underlying 3D representation is parameterized using triplane features and decoded using a lightweight MLP-based volume renderer.
• Triplanar Diffusion — Diffusion model to generate 3D objects which were constructed from their triplane representations. A shared MLP decodes these triplane features into occupancy fields, representing the 3D structure. A 2D diffusion model (DDPM backbone) is then trained on the normalized triplane features to generate new samples. Once generated, the triplane features are decoded to reconstruct detailed 3D shapes.
For this project, we built a model that reconstructs detailed 3D surfaces from a few 2D images. We chose to leverage the power of DINOv2, a state-of-the-art, pre-trained Vision Transformer (ViT). Its remarkable ability to implicitly understand scene geometry, combined with its training efficiency, made it the perfect foundation for our work. The model processes images and generates a 3D mesh through a streamlined encoder-decoder architecture. Our pipeline can be broken down into three main stages: encoding the input images, decoding them into a 3D representation, and finally, extracting the 3D mesh.
Encoder: Understanding the Scene
The process begins with the DINOv2 ViT acting as our powerful image encoder.
Input: The model takes three distinct views of an object as input.
Feature Extraction: Each image is independently processed by the DINOv2 backbone, which extracts a rich set of feature embeddings. This captures high-level information about the object's shape, texture, and position in each view.
Decoder: Building the 3D Representation
The decoder's job is to fuse the information from the three views and translate it into a format that defines a 3D shape.
Fusion: The feature embeddings from each of the three views are first projected into a common dimension and then summed together, creating a single, unified feature representation.
Core Transformer: This combined representation is fed into a custom Transformer decoder, which refines the features and learns the object's complete 3D structure.
Upsampling: The model then intelligently upsamples this representation through a series of transposed convolutions, progressively increasing the spatial resolution to generate a detailed volumetric field (represented as a triplane).
Mesh Extraction: From Volume to Surface
The final step is to convert the model's abstract volumetric output into a clean, usable 3D mesh.
Marching Cubes Algorithm: We use the classic and efficient Marching Cubes algorithm to extract a high-quality 3D surface mesh directly from the volumetric occupancy field predicted by our model. This results in the final 3D reconstruction.
Based on our initial experiments, several avenues for future work emerge. Firstly, the observed difficulty in effectively training the model using the combined Normal Map and Signed Distance Field (SDF) triplane representation suggests a need for further investigation. Exploring architectural modifications to the decoder and different feature fusion techniques could potentially improve the learning of the richer geometric features. Secondly, while the Binary Occupancy model demonstrated promising results, its quantitative performance (Mean Chamfer Distance of 0.200) indicates room for improvement compared to state-of-the-art methods. Further experimentation involving extended training durations, hyperparameter optimization, and potentially exploring more sophisticated data augmentation techniques for the sketch inputs could help bridge this performance gap. Additionally, investigating alternative input representations beyond edge maps, such as rough volumetric sketches or incorporating textual prompts, could broaden the model’s applicability as an architectural design tool.