Semantic implicit neural scene representations through feature fields

Demircan Tas, Rohit Sanatani / 6.S980 Special Subject - Machine Learning for Inverse Graphics

Abstract

We use two dimensional semantic segmentation models to train three dimensional neural radiance fields to output semantic classes for each position in a space.

Self-supervised 2D image feature extractors can generate attention-based latent-space representations with semantic structure(Caron et al. 2021). Neural fields are not object-centric or semantically meaningful. Feature fields can be optimized to map points in space to semantic features extracted by self-supervised models (Kobayashi et al. 2022).

For our preliminary model, we train NeRFs using output from supervised semantic segmentation to create semantic fields.

Problem Definition

While there has been significant progress in the development of implicit neural field representations, such representations do not allow for semantic editing. The task of identifying and modifying specific semantic objects defined by feature fields has applications within different domains. Such workflows can lead to the development of software that modifies architectural spaces by modifying semantic classes that a model is not explicitly trained to identify.

Semantic NeRF (Zhi et al. 2021) is able to predict both view-dependent color as well as semantic values for novel views.

Recent approaches with self-supervised vision transformers lead to the learning of object/instance-based latent codes, where specific latent codes correspond to particular instances of objects/semantic categories in the image space (Atito et al. 2021, Caron et al. 2021, Chen et al. 2021, Li et al. 2021).

State of the art Object Scene Representation Transformers (OSRTs) (Sajjadi et al. 2022) allow for such latent codes to be learned from seemingly complex scenes with a high diversity of object types and backgrounds. Moreover, language driven semantic segmentation models allow for the semantic extraction of unseen categories not known to the model at training time. All such advances have however been largely restricted to 2D feature extraction.

Mildenhall, B., Srinivasan, P.P., Tancik, M., Barron, J.T., Ramamoorthi, R. and Ng, R., 2021. Nerf: Representing scenes as neural radiance fields for view synthesis. Communications of the ACM, 65(1), pp.99-106.

Model Architecture

We feed the outputs of existing supervised semantic segmentation models to optimize a 3D feature field, alongside the optimization of a neural radiance field.

We propose a study whereby the outputs of existing 2D self-supervised transformer models such as DINO (Caron et al. 2021) are used to optimize a 3D feature field (Kobayashi et al. 2022), alongside the optimization of a neural radiance field. Such a field is trained to output a feature vector for each 3D coordinate, in addition to density and color. Training is carried out using the outputs of a pretrained 2D pixel-level feature extractor, dense prediction transformer (DPT). For semantic editing, query-conditional segmentation will be carried out to isolate 3D regions for geometric transformation or deletion. Such a feature field allows for the semantic decomposition of the 3D space, and thus semantically edit or manipulate objects within the scene.

Zhi, S., Laidlow, T., Leutenegger, S. and Davison, A.J., 2021. In-place scene labelling and understanding with implicit scene representation. In Proceedings of the IEEE/CVF International Conference on Computer Vision (pp. 15838-15847).

Data generation

For our first test, we took pictures of common objects indoors. While the data was good enough for camera pose estimation, the number of images, and the stability of lighting was not sufficient to acquire a coherent NeRF.

Camera Pose Estimation (Agisoft MetaShape)

Point Cloud Reconstruction

input

Supervised Semantic Segmentation - Dense Prediction Transformer (DPT)

Ranftl, R., Bochkovskiy, A. and Koltun, V., 2021. Vision transformers for dense prediction. In Proceedings of the IEEE/CVF International Conference on Computer Vision (pp. 12179-12188).

Data generation - Synthetic Data in Maya

For the second test, we used our MIT classroom generation scene made in Maya. This approach affords us the ability to use different scenes to test for generalization in the future. In congruence with the time constraints of this project, we picked one classroom, and took 200 renders of the interior. This experiment did not yield satisfactory results due to mismatches in camera extrinsics.

Parametric Classroom Generator (Autodesk MAYA) rendered using Arnold for Maya

synthetic RGB data

synthetic semantic segmentation

synthetic depth

NeRF Outputs

RGB from our NeRF

Semantic Segmentation from our NeRF

Replica Dataset

To produce a baseline for our approach, we used the Replica Dataset, a synthetic scene with matching labels and extrinsics, commonly used in NeRF research.

Replica RGB data

Replica semantic segmentation

Straub, J., Whelan, T., Ma, L., Chen, Y., Wijmans, E., Green, S., Engel, J.J., Mur-Artal, R., Ren, C., Verma, S. and Clarkson, A., 2019. The Replica dataset: A digital replica of indoor spaces. arXiv preprint arXiv:1906.05797.

Resulting NeRF

Even with relatively few learning steps, we were able to train a NeRF that outputs coherent RGB, and semantic values.

RGB from our NeRF

Semantic Segmentation from our NeRF

with noisy 2D segmentation as input

We extended our experiments to include noisy segmentation data, created using dense prediction transformer. The resulting NeRF while relatively more stable, still includes semantic ambiguity. The same point can return different semantic labels on different renders. We attribute this to multiple classes getting assigned to the same coordinate from different angles.

input 2D segmentation (dense prediction transformer)

semantic segmentation from our NeRF

semantic uncertainty from our NeRF

Future Work

We aim to solve neural rendering ambiguity, implement outputs as meshes, and extend our work to self-supervised semantic segmentation models. Acquiring separate meshes from images of a scene, and sparse segmentation holds value for architectural design studies.