Juan Luis Gonzalez Bello

About me

I am a passionate computer vision/ML/AI postdoctoral researcher at the Korea Advanced Institute of Science & Technology (KAIST), specializing in high and low-level computer vision. I studied Electrical Engineering at UNAM and got my Ph.D. in August 2023 at KAIST. With six years of experience in research, I primarily work on cutting-edge topics such as novel-view synthesis, self-supervised monocular/stereo depth estimation, neural radiance fields, 3D Gaussian splatting, and image reconstruction. As the first author, I have contributed to several noteworthy papers published in prestigious international conferences and journals, including ICIP, ICLR, NeurIPS, CVPR, and TPAMI. On the other hand, my professional experience extends from project management and manufacturing at P&G (FTE) to computer vision research at META and Adobe (internships).

Email: juanluisgb at kaist dot ac dot kr

Office: Room #1106, N24 (LG Innovation Hall)

[LinkedIn] [Twitter] [Google Scholar] [CV] [GitHub]

News

ICLR2020 Work

Single View Deep 3D Pan

NeurIPS2020 Work

Forget About the LiDAR

CVPR2021 Work

PLADE-Net

CVPR2024 Work

NVSVDE-Net

IEEE ACCESS Work

ProNeRF

Research interests

Scientific publications

Novel View Synthesis with View-Dependent Effects from a Single Image

Juan Luis Gonzalez Bello and Munchurl Kim

CVPR2024

In this paper, we address single image-based novel view synthesis (NVS) by firstly integrating view-dependent effects (VDE) into the process. Our approach leverages camera motion priors to model VDE, treating negative disparity as the representation of these effects in the scene. By identifying that specularities align with camera motion, we infuse VDEs into input images by aggregating pixel colors along the negative depth region of epipolar lines. Additionally, we introduce a 'relaxed volumetric rendering' approximation, enhancing efficiency by computing densities in a single pass for NVS from single images. Notably, our method learns single-image NVS from image sequences alone, making it a fully self-supervised learning approach that requires no depth or camera pose annotations. We present extensive experimental results and show that our proposed method can learn NVS with VDEs, outperforming the SOTA single-view NVS methods on the RealEstate10k and MannequinChallenge datasets.

ProNeRF: Learning Efficient Projection-Aware Ray Sampling for Fine-Grained Implicit Neural Radiance Fields [PDF]

Juan Luis Gonzalez Bello, Minh Quan Viet Vu, and Munchurl Kim

Accepted for publication at IEEE Access (April 2024)

Recent advances in neural rendering have shown that although computationally expensive and slow for training, implicit compact models can accurately learn a scene's geometries and view-dependent appearances from multiple views. To maintain such a small memory footprint but achieve faster inference times, recent works have adopted `sampler' networks that adaptively sample a small subset of points along each ray in the implicit neural radiance fields (NeRF), effectively reducing the number of network forward passes to render a ray color. Although these methods achieve up to a 10x reduction in rendering time, they still suffer from considerable quality degradation compared to vanilla NeRF. In contrast, we propose a new projection-aware neural radiance field model, referred to as ProNeRF, which provides an optimal trade-off between the memory footprint (similar to NeRF), speed (faster than HyperReel), and quality (better than K-Planes). ProNeRF is equipped with a novel projection-aware sampling (PAS) network together with a new training strategy for ray exploration and exploitation, allowing for efficient fine-grained particle sampling.  Our exploration and exploitation training strategy allows ProNeRF to learn the color and density distributions of full scenes, while also learning efficient ray sampling focused on the highest-density regions.

ProNeRF yields state-of-the-art metrics, being 15 to 23x faster with 0.65dB higher PSNR than the vanilla NeRF and showing 0.95dB higher PSNR performance compared to the best published sampler-based method, HyperReel.

Self-Supervised SVDE from Videos with Depth Variance to Shifted Positional Information [PDF]

Juan Luis Gonzalez Bello, Jaeho Moon, and Munchurl Kim

IEEE Transactions on Image Processing (TIP, March 2024).

Recently, much attention has been drawn to learning the underlying 3D structures of a scene from monocular videos in a fully self-supervised fashion. One of the most challenging aspects of this task is handling independently moving objects as they break the rigid-scene assumption. In this work, for the first time, we show that pixel positional information can be exploited to learn SVDE (Single View Depth Estimation) from videos. Our proposed moving object (MO) masks, which are induced by depth variance to shifted positional information (SPI) and referred to as `SPIMO' masks, are very robust and consistently remove the independently moving objects in the scenes, allowing for robust and consistent learning of SVDE from videos. Additionally, we introduce a new adaptive quantization scheme that assigns the best per-pixel quantization curve for depth discretization improving the fine granularity and accuracy of the final aggregated depth maps. Finally, we employ existing boosting techniques in a new way to further self-supervise the moving object depths. With these features, our pipeline is robust against moving objects and generalizes well to high-resolution images, even when trained with small patches, yielding state-of-the-art (SOTA) results with 4 to 8x fewer parameters than the previous SOTA that learns from videos. We present extensive experiments on KITTI and CityScapes that show the effectiveness of our method. 

Shape of Depth: Blind Image Filtering for Depth Map Refinement

Juan Luis Gonzalez Bello, Kevin James Blackburn-Matzen, Simon Niklaus, Oliver Wang, Munchurl Kim 

Work made during Adobe Internship

For the first time, we propose to learn the shape of depth (SOD), a general geometric restoration filter that can be used to align depth maps to image edges. Contrary to the recent deep-learning-based depth enhancement literature, our method is blind to the target (unrefined) depth map. Non-blind methods, which are often trained with the depth as input, can overfit to the range and shape of training depth maps, resulting in limited edge restoration and generalization to different types of input depth. The "blindness" in our method helps overcome this over-fitting by disconnecting the strong incorrect geometry bias in the target depth maps. Our method produces refined depth maps with fine detail at boundaries, and can generalize to new types of (unrefined) depth maps better than non-blind methods. For this, we propose a SOD-net that works by predicting the parameters of our novel deformable adaptive joint bilateral filter, which can then be applied to low-quality depth maps for refinement. To show the effectiveness of our method, we present state-of-the-art results on the iBims-1 dataset for the depth restoration task with the best trade-off between accuracy preservation and depth boundary refinement.We also show that these filters generalize to other piece-wise smooth tensors, such as segmentation maps.

Self-Supervised Deep Monocular Depth Estimation with Ambiguity Boosting [PDF]

Juan Luis Gonzalez Bello and Munchurl Kim 

 IEEE Transactions on Pattern Analysis and Machine Intelligence (TPAMI, November 2021)

We propose a novel two-stage training strategy with ambiguity boosting for the self-supervised learning of single view depths from stereo images. Our proposed two-stage learning strategy firstly aims to obtain a coarse depth prior by training an auto-encoder network for a stereoscopic view synthesis task. This prior knowledge is then boosted and used to self-supervise the model in the second stage of training in our novel ambiguity boosting loss. Our ambiguity boosting loss is a confidence-guided type of data augmentation loss that improves the accuracy and consistency of generated depth maps under several transformations of the single-image input. To show the benefits of the proposed two-stage training strategy with boosting, our two previous depth estimation (DE) networks, one with t-shaped adaptive kernels and the other with exponential disparity volumes, are extended with our new learning strategy, referred to as DBoosterNet-t and DBoosterNet-e, respectively. Our self-supervised DBoosterNets are competitive, and in some cases even better, compared to the most recent supervised SOTA methods, and are remarkably superior to the previous self-supervised methods for monocular DE on the challenging KITTI dataset. We present intensive experimental results, showing the efficacy of our method for the self-supervised monocular DE task.

PLADE-Net: Towards Pixel-Level Accuracy for Self-Supervised Single-View Depth Estimation with Neural Positional Encoding and Distilled Matting Loss [PDF]

Juan Luis Gonzalez Bello and Munchurl Kim 

CVPR2021

In this paper, we propose a self-supervised single-view pixel-level accurate depth estimation network, called PLADE-Net. The PLADE-Net is the first work that shows unprecedented accuracy levels, exceeding 95% in terms of the d1 metric on the challenging KITTI dataset. Our PLADE-Net is based on a new network architecture with neural positional encoding and a novel loss function that borrows from the closed-form solution of the matting Laplacian to learn pixel-level accurate depth estimation from stereo images. Neural positional encoding allows our PLADE-Net to obtain more consistent depth estimates by letting the network reason about location-specific image properties such as lens and projection distortions. Our novel distilled matting Laplacian loss allows our network to predict sharp depths at object boundaries and more consistent depths in highly homogeneous regions. Our proposed method outperforms all previous self-supervised single-view depth estimation methods by a large margin on the challenging KITTI dataset, with unprecedented levels of accuracy. Furthermore, our PLADE-Net, naively extended for stereo inputs, outperforms the most recent self-supervised stereo methods, even without any advanced blocks like 1D correlations, 3D convolutions, or spatial pyramid pooling. We present extensive ablation studies and experiments that support our method's effectiveness on the KITTI, CityScapes, and Make3D datasets.

Forget About the LiDAR: Self-Supervised Depth Estimators with MED Probability Volumes [PDF]

Juan Luis Gonzalez Bello and Munchurl Kim 

NeurIPS2020

We propose a method to "Forget About the LiDAR" (FAL), for the training of monocular depth estimators from stereo images, with Mirrored Exponential Disparity (MED) probability volumes, from which we obtain geometrically inspired occlusion maps with our novel Mirrored Occlusion Module (MOM). Contrary to the previous methods that learn SIDE by regressing disparity in the linear space, our network, called FAL-net, regresses disparity by binning it into the exponential space, which allows for better detection of distant and nearby objects. We define a two-step training strategy for our FAL-net: It is first trained for view synthesis and then fine-tuned for depth estimation with our MOM. Our FAL-net is remarkably light-weight and outperforms the previous state-of-the-art methods with 8x fewer parameters and 3x faster inference speeds on the challenging KITTI dataset. To our best knowledge, the presented method performs the best among all the previous self-supervised methods until now. 

AdaMM-DepthNet: Unsupervised Adaptive Depth Estimation Guided by Min and Max Depth Priors for Monocular Images

Juan Luis Gonzalez Bello and Munchurl Kim 

2020년 한국방송미디어공학회 추계학술대회

Unsupervised deep learning methods have shown impressive results for the challenging monocular depth estimation task, a field of study that has gained attention in recent years. A common approach for this task is to train a deep convolutional neural network (DCNN) via an image synthesis sub-task, where additional views are utilized during training to minimize a photometric reconstruction error. Previous unsupervised depth estimation networks are trained within a fixed depth estimation range, irrespective of its possible range for a given image, leading to suboptimal estimates. To overcome this suboptimal limitation, we first propose an unsupervised adaptive depth estimation method guided by minimum and maximum (min-max) depth priors for a given input image. The incorporation of min-max depth priors can drastically reduce the depth estimation complexity and produce depth estimates with higher accuracy. Moreover, we propose a novel network architecture for adaptive depth estimation, called the AdaMM-DepthNet, which adopts the min-max depth estimation in its front side. The same extension is also made to the Monodepth and Deep3D to show the effectiveness of the min-max depth priors for unsupervised depth estimation. Intensive experimental results demonstrate that the adaptive depth estimation can significantly boost up the accuracy with a fewer number of parameters over the conventional approaches with a fixed minimum and maximum depth range. 

Pan-Sharpening with Color-Aware Perceptual Loss and Guided Re-Colorization [PDF]

Juan Luis Gonzalez Bello, Soomin Seo and Munchurl Kim 

ICIP2020

We present a novel color-aware perceptual (CAP) loss for learning the task of pan-sharpening. Our CAP loss is designed to focus on the deep features of a pre-trained VGG network that are more sensitive to spatial details and ignore color information to allow the network to extract the structural information from the PAN image while keeping the color from the lower resolution MS image. Additionally, we propose "guided re-colorization", which generates a pan-sharpened image with real colors from the MS input by "picking" the closest MS pixel color for each pan-sharpened pixel, as a human operator would do in manual colorization. Such a re-colorized (RC) image is completely aligned with the pan-sharpened (PS) network output and can be used as a self-supervision signal during training, or to enhance the colors in the PS image during test. We present several experiments where our network trained with our CAP loss generates naturally looking pan-sharpened images with fewer artifacts and outperforms the state-of-the-arts on the WorldView3 dataset in terms of ERGAS, SCC, and QNR metrics.

Deep 3D Pan via local adaptive "t-shaped" convolutions with global and local adaptive dilations [PDF]

Juan Luis Gonzalez Bello and Munchurl Kim 

ICLR2020

We propose a novel network architecture to perform stereoscopic view synthesis at arbitrary camera positions along the X-axis, or “Deep 3D Pan”, with “t-shaped” adaptive kernels equipped with globally and locally adaptive dilations. Our proposed network architecture, the monster-net, is devised with a novel t-shaped adaptive kernel with globally and locally adaptive dilation, which can efficiently incorporate global camera shift into and handle local 3D geometries of the target image’s pixels for the synthesis of naturally looking 3D panned views when a 2-D input image is given. Extensive experiments were performed on the KITTI, CityScapes, and our VICLAB_STEREO indoors dataset to prove the efficacy of our method. Our monster-net significantly outperforms the state-of-the-art method (SOTA) by a large margin in all metrics of RMSE, PSNR, and SSIM. Our proposed monster-net is capable of reconstructing more reliable image structures in synthesized images with coherent geometry. Moreover, the disparity information that can be extracted from the “t-shaped” kernel is much more reliable than that of the SOTA for the unsupervised monocular depth estimation task, confirming the effectiveness of our method. 

A NOVEL MONOCULAR DISPARITY ESTIMATION NETWORK WITH DOMAIN TRANSFORMATION AND AMBIGUITY LEARNING [PDF

Juan Luis Gonzalez Bello and Munchurl Kim 

ICIP2019

We propose a novel encoder-decoder architecture that outperforms previous unsupervised monocular depth estimation networks by (i) taking into account ambiguities, (ii) efficient fusion between encoder and decoder features with rectangular convolutions and (iii) domain transformations between encoder and decoder. Our architecture outperforms the Monodepth baseline in all metrics, even with a considerable reduction of parameters. Furthermore, our architecture is capable of estimating a full disparity map in a single forward pass, whereas the baseline requires two passes. We perform extensive experiments to verify the effectiveness of our method on the KITTI dataset. 

Theses

González Bello, Juan Luis. (2017). "Programación de interfaz hombre máquina en sistema operativo Android para control y monitoreo inalámbrico de sistemas a base de microcontrolador". (Tesis de Licenciatura). Universidad Nacional Autónoma de México, México.  [link

Advisor: Prof. José Luis Barbosa Pacheco (UNAM)

A Study on Free-view Image Synthesis with View-Dependent Effects based on Camera Motion and Local Context Priors. [link]

Advisor: Prof. Munchurl Kim (KAIST)