In addressing software dependencies, we faced intricate incompatibility issues among cuda, pytorch, and cudnn — essential components for the renderer to function properly. We also spent time on consulting official documentation for cuda and cudnn. We ended up using cuda 11.8, cudnn 8.0, pytorch 11.6.
We spent time on digesting the codebase, since a solid understanding of the hierarchal code structure will reduce our debugging work and give us a better place to start on modification and optimization. We've identified the following core components, where we spent most of the time on understanding the mechanism and looking for possible flaws and problems.
Zero1to3 serves as a guidance mechanism during the training of NeRF. It essentially provides additional cues or hints to the NeRF model, ensuring that the rendered outputs adhere multiple novel views rather than only the front view. In essence, zero1to3 is a pre-trained model that can take samples from NeRF from different camera poses, calculate the loss between sampled views and corresponding prediction based on the reference view. Specifically, zero1to3 is capable of working in the latent space. It doesn't predict what the current view should look like directly. Instead, it predicts adjustments (in the form of noise) to the latent representations, which is later used to guide the NeRF's training. This guidance is conditioned on the difference between the current view and multiple reference views.
We have made an important upgrade here that transitioned from the CarveKit library to the cutting-edge Clipdrop by Stability AI for background removal. This shift reduced the unwanted truncations of the target object and more importantly the residual background. At the beginning, we did the fix due to the evident issues of truncation and residual. But the fix turned out to be important to the final result since the guidance model sees the residues as part of the object, greatly jeopardizing the training of NeRF.
The core idea of this framework somewhat resembles a paradigm where the training process is guided by partial and noisy labels, and the objective is to learn robustly despite label imperfections. However, there's a distinctive difference in our task. Even though our scenario is highly constrained: we work with only one front-view image, which serves as our partial label, we care about the quality of output. Given the identified bottleneck from previous section, we know the labels from preprocessing process are defective, the refinement and perfection of them become pivotal, as it directly influences the quality of our 3D reconstructions. In this bottleneck, the crux hinges on two main operations:
Depth Estimation: Managed efficiently by the DPT class from the dpt library.
Normal Estimation: Also handled by the DPT class from the dpt library.
Figure.2
Figure.3
Figure. 4
Figure. 5
Single Image to 3D Reconstruction poses significant challenges, especially when dealing with hand-drawn cartoon images, such as the iconic Pikachu. Although hand-drawn images of 3D models present an artistic representation, the results from trained models often exhibit flattened features and lack vital information, particularly in the back of the body. This deficiency arises because hand-drawn images simulate 3D effects through shading, causing depth and normal maps to lose essential depth information and appear flattened compared to true 3D models. This limitation highlights the inherent challenge in converting 2D hand-drawn images to accurate 3D representations.
To address this issue, we conducted research on the sequencing of background removal and discovered that training the model in a modified order results in thicker representations. As depicted in Fig.5 this approach significantly improves the overall form of Pikachu. However, as shown in the last column of Fig.5, we observed that the facial region may still appear abruptly flattened due to the Front Image saturating during training, aligning entirely with the input's flat information. Consequently, achieving accurate facial depth in hand-drawn images remains a potential direction for future exploration. This study sheds light on the challenges and potential solutions in Single Image to 3D Reconstruction, offering valuable insights for improving the accuracy and fidelity of 3D representations from hand-drawn images. By further investigating the impact of background removal sequences and refining the training process, we aim to enhance the quality and depth perception of 3D reconstructions from hand-drawn inputs, bridging the gap between 2D artistic expression and accurate 3D representations.