Analyzing Models for Depth Map Prediction from Monocular Images

Hello, this is my website for presenting all of the materials for my Advanced ML Final Project for the Fall 2020 semester. I worked alone on this project, where I evaluated depth map prediction models on monocular images. Basically, I looked at a couple of papers on depth map prediction models from recent years that worked on single monocular images (meaning only one image, rather than a scene from multiple views). Then, I read a few of those papers to understand more of the context of how the models worked, and if there was any interesting data pre-processing steps that made them distinguishable from other models. I then found available implementations online and existing trained models (since my own computer would not support training these models), and ran them on my own machine, evaluating them on a popular dataset (NYU Depth Dataset @ https://cs.nyu.edu/~silberman/datasets/nyu_depth_v1.html), and compared the error between both predictions and how a few sample images looked, and made inferences as to why each model performed differently.

The two models I used were from different papers. The first was "Unsupervised Monocular Depth Estimation with Left-Right Consistency" (Monodepth, implementation: https://github.com/mrharicot/monodepth, paper: https://arxiv.org/abs/1609.03677). This was an unsupervised model, so the depths were not specified beforehand on these images. The intuition of their model was that they if they could train a model to predict what a binocular model would see (left and right images), then they could also reconstruct the depth map of the image itself. So, at training time they used either the left or right image to predict the other by seeing what depth map could be most consistent with the other image, and then calculate the loss of their model based off of the result.

The second model was "Depth Map Prediction from a Single Image using a Multi-Scale Deep Network" (implementation: https://github.com/DhruvJawalkar/Depth-Map-Prediction-from-a-Single-Image-using-a-Multi-Scale-Deep-Network, paper: https://arxiv.org/abs/1406.2283). This model first predicts a coarse output for the whole image, to predict the overall depth map using a global (full image) view of the scene, with fully connected upper layers which are shrunk in the lower/middle layers using max-pool operations. This is important for capturing vanishing points, object locations, and room alignment. Then, they run it through a local fine-scale network, to edit the coarse prediction with local features like edges.

Overall both models don't perform exceedingly well on our sample data from the NYU Depth Dataset, but this makes sense considering it is the monocular case - it's much harder to extract image details when machines don't have a strong awareness of depth cues that humans have like an understanding of standard object sizes, etc.