JUST NeRF IT!

Omri Green, Dheeraj Bhogisetty, Siyuan Huang, ​

Shreya Bang, Joel John​

OBJECTIVE

  • The objective of the project is to create a continuous volumetric (3D) scene given a sparse (discrete) set of images of the scene from different viewing angles and generates novel views utilizing a small number of input images. This can be achieved by using NeRF. (Neural Radiance Fields)


  • Further we can add labels to the generated scene which will enable users to replace the existing objects with other volumetric structures or simply add new objects to the scene.​


INTRODUCTION

The NeRF (Neural Radiance Field) is a synthesis problem solution that doesn't use a CNN. In this method a static scene is represented as a continuous 5D function that outputs the radiance emitted in each direction (θ, φ) at each point (x, y, z) in space and a density at each point. Then an MLP (multilayer perceptron) is optimized to represent this function. This means that the 3D scene itself is represent by the weights of the MLP. To render this Neural Radiance Field (NeRF) from a particular veiwpoint we:

  • Pass camera rays through the scene and sample a set of 3D points along the ray.

  • Use these points and their corresponding viewing directions as input to the network to produce an output set of colors and densities

  • Use classical volume rendering techniques to accumulate those colors and densities into a 2D image.

RELATED WORK

In recent literature there are two lines of work in volume rendering. One is using the traditonal discrete representation techniques such as triangle meshes or voxel grids. The second is using multilayer perceptron networks to represent the 3D shape/scene, wherein the shape/scene is encoded in the weights of the network and the 3D spatial location is mapped to an implicit representation of the shape.

Representing 3D shapes using Neural Networks

In recent works, 3D shapes were represented as level sets by optimizing deep networks that map (x,y,z) co-ordinates to signed distance functions. One such network is the "Occupancy Network" where the 3D surface is represented as the continuous decision boundary of a deep neural network classifier. However, these techniques are limited to simple shapes with low geometric complexities that result in oversmooth rendering. The proposed 5D radiance field can represent more complex and high-resolution geometry.

Synthesising novel views and rendering

Given a large set of images, interpolation techniques can be used to to generate novel views. With less number of images, computer vision techniques have to be used to synthesise novel views. One approach is to use mesh-based methods such as differentiable rasterizers or pathtracers. Another approach is using volumetric representations such as voxel grids. Recent methods have used a combination of deep networks, CNN and voxel grids to synthesise novel views.

These methods have produced impressive results, but are limited due to poor time and space complexity due to discrete sampling. This limitation is overcome by representing the volume (shape/scene) within the weights of a fully connected neural network. This produces higher quality renderings at a fraction of the storage of volumetric methods.

PIPELINE

The input images that are taken from the camera are first run through COLMAP to calculate the camera poses. From the camera pose we form the 5D input vector required for the NeRF and pass this vector along with the image to the network. The network is then trained with these images. After training, novel views can be rendered by inputting the new position and viewing direction.

DATASET GENERATION AND COLMAP

  • 5D input vector - contains the spatial location and the viewing direction i.e., the pose of the camera.​

  • Camera pose - estimated using computer vision methods. ​

  • COLMAP – open-source SfM package​

  • COLMAP calculates the intrinsic and extrinsic properties of the camera and the images. The output is a JSON file that maps each image with its pose.​

NeRF ARCHITECTURE

  • The network takes in a 5D vector that contains the spatial location and viewing direction​ and outputs the Volume Density and View-Dependent Emitted Radiance​ at that point. The network can be divided into two parts - the Density Network and the Color Network.

  • The density network consists of 8 fully connected layers with 256 neurons, with RELU activations, in each layer. The output is the Volume Density and a 256 dimension feature vector. This feature vector is concatenated with the ray's viewing direction and passed to the color network.

  • The color network consists of only one layer with 128 neurons with RELU activations which outputs the view-dependent RGB color.

DEMO AND OBSERVATIONS

  • During our testing we found out a multitude of facts about NeRF which are listed below.

  • Reflections can be captured by NeRF due to its very properties, since it calculates the behavior of light for every viewing angle it learns how reflections work in the environment.

  • Fully converged scenes require input images from all directions to fully model the scene, this is because the algorithm can't figure out how something looks from a direction that it has no data on.

  • Convergence for real world scenes is typically not as good as convergence from a synthetic scene, this is likely because of slight imperfections that are made while making the data set, as well as the fact that it is just as likely that not as many images were added to the data set in the real world versus the synthetic scenes.

  • NeRF also does not work for dynamic scenes in its current implementation, after all it has to learn from static images, and since they are not all taken at exactly the same time there are differences if there is motion in the image. This could possibly be mitigated by having many cameras all take an image at once, but that is impractical for most situations.

CHALLENGES AND LIMITATIONS

While working on this final project we encountered a multitude of problems and challenges. The initial NeRF algorithm that we used had a very long run time, in fact to process a single image it would have taken several days, to solve this problem we started to use Instant NeRF a brand new iteration of the NeRF algorithm that is much faster, achieving the same results in a matter of minutes rather than days. However, Instant NeRF introduced a variety of other issues, including but not limited to the fact that it is written in CUDA which made it near impossible to add the extra heuristic to the algorithm. Convergence for real world scenes was also difficult to achieve requiring over 40 images in order to reach a satisfactory result. Finally, Instant NeRF can only run satisfactorily on newer GPUs such as the RTX 30 series.

FUTURE SCOPE

In the future we plan to enable users to add, remove, and/or replace objects in the 3D volumetric scene generated by NeRF. We hope to achieve this by using Tiny Object Loader to add 3D objects to the scene. Removing objects will be done by modifying the NeRF network to include an object label which when detected would make the labelled portion of the image 'see-through' for all intensive purposes. To replace objects in the scene, a combination of the two approaches will be used, first the object is removed, then another object will be placed in its centroid allowing for a fully modifiable scene for use in applications ranging from VR to home improvement.