The evolution of 3D Computer Vision has shifted from explicit geometric prediction to neural rendering and, most recently, to generative pipelines distilled from large-scale 2D priors. To understand the fundamental trade-offs in this domain, I conducted a series of experiments traversing three distinct paradigms: supervised single-view prediction, optimization-based neural rendering, and diffusion-guided generation. This progression mirrors the broader field's movement—starting with discrete geometric supervision, moving toward continuous implicit fields learned from 2D observations, and finally leveraging the semantic knowledge of foundation models to hallucinate 3D structures. The following analysis details the architectural implementations and insights gained from treating 3D generation as a problem of representation, differentiable rendering, and optimization.
In my first experiment, I treated 3D generation as a direct supervised learning problem, aiming to map a single RGB image to 3D structures using voxels, point clouds, and triangular meshes. To implement this, I utilized a shared backbone architecture where a pretrained ResNet-18 encoder compresses the input RGB image into a compact 512-dimensional feature vector. This vector serves as the foundation for three distinct, parallel decoding branches. For volumetric reconstruction, the feature vector is projected and upsampled through a stack of 3D transposed convolutions to generate a 32x32x32 voxel grid, supervised by Binary Cross Entropy with Logits Loss. In parallel, the point cloud branch employs a Multi-Layer Perceptron (MLP) to directly regress the spatial coordinates of a set number of points, optimizing the geometry via Chamfer Loss. Finally, the mesh generation adopts a template deformation approach where an MLP predicts vertex displacements for a spherical template; this branch is trained using a composite objective of Chamfer Loss for geometric alignment and Laplacian Smoothness Loss to ensure the resulting surface remains coherent. Comparing these outputs side-by-side highlighted inherent trade-offs: voxels are computationally expensive and resolution-limited, point clouds are efficient but lack topology, and meshes provide coherent surfaces but rely heavily on template priors.
Moving away from direct prediction, I explored Neural Rendering, where the 3D scene is represented as a continuous function optimized to match 2D observations. This method shifts the focus from predicting geometry to learning a field that renders correctly. I implemented a structured "Volume SDF" where, instead of a massive black-box MLP, the implicit field was defined by analytic primitives such as spheres or boxes. Their defining parameters—specifically shape, pose, and size—were initialized and then made learnable via gradient descent. This SDF was converted into a renderable density field via a trainable mapping controlled by $\alpha$ and $\beta$ parameters and associated with a learnable color feature vector. Training this representation required only 2D images: a differentiable volume renderer composited samples along rays to predict pixel colors, and an MSE reconstruction loss against ground-truth images allowed gradients to backpropagate through the entire rendering pipeline. This experiment demonstrated why implicit fields feel "generative": they represent a resolution-independent, continuous 3D function that can be queried from arbitrary viewpoints, unlike the fixed discretization of supervised methods.
In the final experiment, I linked classical reconstruction with modern generative AI by implementing a pipeline where a 3D representation is optimized by the "knowledge" contained within a frozen text-to-image model. I designed a "Teacher-Student" framework using Score Distillation Sampling (SDS), where the 3D representation acts as a "student" optimized by a frozen Stable Diffusion 2.1 "teacher". The training process involved rendering views from the 3D model, encoding them into latent space, perturbing them with noise, and passing them to the diffusion model’s UNet to predict noise residuals. These residuals provided the gradients needed to update the underlying 3D parameters—specifically the density and color outputs of a Multilayer Perceptron (MLP) with harmonic embeddings. To ensure 3D consistency and prevent the "Janus problem," I employed view-dependent text embeddings, dynamically modifying the text prompt conditioning based on the camera’s azimuth during the optimization loop. This approach highlights the power of using 2D foundation models as differentiable objective functions, allowing for geometry generation from text alone.
Together, these three experiments provided a comprehensive overview of the modern 3D vision landscape, progressing from Explicit Reconstruction, where geometry is predicted directly from data, to Implicit Neural Fields, where geometry is an emergent property of optimizing a continuous function, and finally to Generative Optimization, where 3D structures are distilled from 2D semantic priors. The overarching lesson is that learning in 3D is converging on a unified principle: the combination of a flexible underlying representation—whether voxels, SDFs, or MLPs—and a differentiable rendering engine. This pairing allows us to apply powerful optimization signals, be it geometric loss, photometric consistency, or semantic scores from diffusion models, to reconstruct and generate complex 3D worlds.