Fast Monocular Depth Estimation

Abstract

Object detection with depth estimation significantly helps visually impaired people. These applications require the depth estimation model to be highly accurate and fast. In autonomous vehicles bulky and high-power consumption sensors such as LIDAR are used. However, these sensors are not suitable to use on small robots due to power limitations and size. Thus, use of monocular image for depth estimation on small robots is very effective.

The approach behind this would be to build the model using Depth Wise Separable Convolution Layers to reduce computational time requirements and training the model with images of various environmental conditions resulting in increase of the applications of the model.

The results obtained showcase the time and accuracy for the FastDepth model trained by us over outdoor images to be better than the best models for depth estimation. The size of the proposed model FastDepthV2 is reduced significantly as compared to the state of the art MidasV2 model[1].

A teaser for the results obtained:

VIdeo3.mp4

High Level Project Flow:

Fig 1. Monocular Depth Estimation Working Model

Introduction:

Depth Sensing is a critical function for creating environment maps, localization, and object distance estimation. Simultaneous Localization and Mapping(SLAM), one of the commonly used robotics applications in autonomous mobile robots, help the robots to build a map and understand their environment. Currently, the LIDAR sensor is used in most of the SLAM technology. Few other SLAM solutions uses IMU, monocular camera, depth camera, GPS device sensors for input data.

Stereo Cameras and LIDAR are typical sensors used to estimate the depth. However, these sensors consume a lot of power and are bulky. The use of such sensors on small mobile robots and UAVs is practically impossible. Using a single monocular camera for estimating the depth is the most feasible option. MidasV2[1] is one of the best models for depth estimation, with the best accuracy. The authors used resnet50 as an encoder. The model is trained with different dataset types together, mixing them to get the best result. However, the execution time of the model is more than 100ms for each image.

Similarly, there have been many attempts to estimate the depth of an image from monocular vision. However, the model they proposed is too complex, and the execution time per image is 100 ms. Diane et al. [2] proposes FastDepth, which is a depth estimation model specially designed for embedded systems. The model uses the MobileNet model at its encoder and uses nearest-neighbor interpolation at the decoder. The time required for estimating the depth is very low, in the range of 10ms for NVIDIA AGX Xavier. We have extended the work done by [2], so that the model works well for outdoor images.

The general architecture of Monocular Depth estimation is shown in Fig.1 . The architecture consists of two main sections Encoder and Decoder. Encoder is used to extract the features from the images, and it is generally an image classification model. In decoder section the image is scaled up and interpolated to create the Depth image. The two main challenges in using the Depth estimation in small autonomous robots are the speed and accuracy of the depth across various environmental condition. The Midas Model designed by Intelligent System Lab Org [1] is highly accurate and complex model,

The FastDepth paper [2] implements the fast depth monocular depth estimation for embedded systems. The model uses Depth wise separable convolution layers instead of regular convolution which reduces the computation by factor as given below:

, where ‘Dk’ is the size of kernel inside the layer and ‘N’ is the number of kernels.

Research problems to address:

The time required for calculations for the MidasV2 model is comparatively very high and makes the MidasV2 Model moderately slow. Also, the size of the MidasV2 hybrid model is in GB.

The fast depth monocular depth estimation takes less time but the model is trained only for a specific environment and is not robust, which limits its application.

Proposed Approach and Implementation:

Key Implementation Ideas:

Build the model using Depth wise separable convolution layers to reduce the computational requirements.
Train the model for various environmental changes to increase its application base.
The datasets to be used for the same are NYU Depth Dataset V2[4] for depth estimation and KITTI Vision Benchmark Suite[3] for various environmental changes respectively.

Monocular Depth Estimation:

The general architecture of Monocular Depth estimation is shown in Fig.1. The authors of FastDepth [2] have designed a fast model for monocular depth estimation. As compared to the most used and accurate model, Midasv2, developed by Intel ISL [1], the FastDepth model is fast and is designed especially for an embedded device with resource constraints. The full CNN FastDepth model is as shown in Fig.2.

Fig 2. FastDepth CNN [1]

The architecture consists of two main sections Encoder and Decoder. The encoder extracts the features from the images, and it is generally an image classification model. In this case, MobileNet[5] is used for Image classification.

Encoder:

MobileNets[5]: The general trend has been to make deeper and more complicated networks such as Resnets and VGG in order to achieve higher accuracy. However, these advances to improve accuracy are not necessarily making networks more efficient with respect to size and speed. In many real world applications such as robotics, self-driving cars and augmented reality, the recognition tasks need to be carried out in a timely fashion on a computationally limited platform. To address this challenge, MobileNets[5] are built primarily from depth-wise separable convolutions which is explained in the next section “Depthwise Separable Convolution” to reduce the computation in the first few layers which in turn give better performance in terms of speed and also they are relatively smaller in size.

Decoder:

The image is scaled up and interpolated in the decoder section to create the Depth image. Decoder network consists of five cascading upsample layers and a single pointwise layer at the end. Each upsample layer performs 5 × 5 convolution. Nearest-neighbor interpolation present after the convolution doubles the spatial resolution of intermediate feature maps and it also lowers the resolution of feature maps processed by the convolutional layers. The depthwise decomposition which is explained in the next section “Depthwise Separable Convolution” further lowers the complexity of all convolutional layers which makes the decoder simple and fast.

As compared to the most used and accurate model, MidasV2, developed by Intel ISL[1], the FastDepth model is fast and is designed especially for an embedded device with resource-constraints. For our experimentation purpose, we have used the pre-trained model provided by the [2] and built our model over it. The model implementation uses the python programming language's PyTorch library. The disadvantage of using the pre-trained model provided by the [2] is that it is trained only with the NYU dataset[4], which works only for the indoor images.

Depthwise Separable Layer:

Depthwise Separable Convolution[6] layer reduces the number of parameters and computations in convolution, increasing the efficiency. The convolution operation is divided into two steps: depthwise and pointwise convolution. Consider the input image of size 'Df * Df * M' and kernel size 'Dk * Dk * M * N'. After applying the convolution as shown in the above table we get the output of convolution of size 'Dg * Dg * N'. Where 'Dg' is given by equation :

Here 'Df' is the input dimension, 'P' is Padding, 'D' is dilation, 'Dk' is kernel size, and S is Stride. The number of computations to get the output is as shown in the above table 1.

For depthwise separable convolution, the operation is divided into two parts to get the output size 'Dg * Dg * N'. In depthwise layer the kernel size used is 'Dk * Dk * 1 * M', for 'M' input channels. The output from the first layer is 'Dg * Dg * M'. As the name suggested, the second layer is the convolution of the kernel with size '1*1'. In our case the for 'M' input channels and 'N' filters, the kernel size is '1 * 1 * M * N'. We get the required output after pointwise convolution 'Dg * Dg * N'. The two-stage convolution is explained in figure in the above table 1.

Experimentation and Results:

Below are some images from the NYU Depth Dataset V2[4] which show the original image, depth estimation by MidasV2[1] and depth estimation by our implementation of FastDepth[2]

Input Image

MidasV2

Our output

Input Image

MidasV2

Our output

Input Image

MidasV2

Our output

Input Image

MidasV2

Our output

Input Image

MidasV2

Our output

Input Image

MidasV2

Our output

Input Image

MidasV2

Our output

Input Image

MidasV2

Our output

Modifications were made in MidasV2 model as below so that we could compare it with our model:

The Midas model generates an inverse depth output, i.e. closer the object greater is the output. For ground truth, the value is small if the object is closer. The Midas model also converts the output from actual depth to relative depth and sets the depth cap to 80 meters. For calculating the accuracy of the Midasv2 model to compare with our model, we scale the output and add the bias to convert the relative depth into actual depth. With the Midasv2_large model, we obtain an accuracy of 88% and the accuracy of Midasv2_small is 22%.
Another observation is the time required to generate the depth change with a change in input size for Midasv2. The input image is scaled to get the fixed ratio. However with our model, we resize the input image to a fixed size :(224x224x3), thus keeping the time required to calculate the depth constant to 8ms.

Training :

Implemented Python script to train the model for outdoor images on KITTI dataset[3] (over 93,000 depth maps recorded using LIDAR and left and right stereo RGB images for corresponding depth maps)
We have used Google Colab for the training purpose which provides NVIDIA Tesla K80 GPU for acceleration
To train the model, we have utilized PyTorch framework
For optimization, we have used the Adaptive Moment Estimation (Adam) optimizer. The ADAM optimizer works very well with the default hyper-parameters and is suited for the non-stationary object and problems with very noisy/or sparse gradients.
The hyper-parameters were tuned as: learning rate = 0.0001, batch size = 64 and epochs = 500
We have used RMSE (Root Mean Square Error) to calculate error and are evaluating our model against MidasV2_small and MidasV2_hybrid for accuracy and timing.

Validation :

We have conducted validation on 4944 outdoor images of on KITTI dataset[3] by keeping image size as 192*640 to validate all the 3 models MidasV2_small, MidasV2_hybrid and FastDepthV2.

The results of various validation parameters are listed in Table.2

Fig 3. Model Comparisons

Testing :

The testing experiments are conducted on the models MidasV2_small, MidasV2_hybrid and FastDepthV2 by keeping the image size 375*1242.
The average testing time per image is listed for the models in Table.3

Conclusion:

Trained the fast monocular depth estimation model for outdoor images
Compared the time required and accuracy with the top models like MidasV2_large and MidasV2_small ; and found that the time required for execution is very less (around 8 ms ) compared to 90 ms for MidasV2_small and 45000 ms for MidasV2_large.
The accuracy obtained is around 44.37% for FastDepthV2 whereas that for
We can see that the model utilizing depth-wise convolution layer when trained for various environmental changes gives faster and accurate results.
The size of the proposed model FastDepthV2 is significantly small as compared to the highly accurate model making it perfectly suitable to use in the applications where model size is the constraint.

References:

[1] Towards Robust Monocular Depth Estimation: Mixing Datasets for Zero-shot Cross-dataset Transfer -https://arxiv.org/abs/1907.01341

[2] FastDepth: Fast Monocular Depth Estimation on Embedded Systems - https://arxiv.org/abs/1903.03273

[3] Vision meets Robotics: The KITTI Dataset, International Journal of Robotics Research (IJRR),2013

[4] NYU Depth Dataset V2 - https://cs.nyu.edu/~silberman/datasets/nyu_depth_v2.html

[5] MobileNets: Efficient Convolutional Neural Networks for Mobile Vision Applications - https://arxiv.org/pdf/1704.04861.pdf

[6] Depthwise separable convolutions for neural machine translation - https://arxiv.org/abs/1706.03059

Model is also available at: https://github.com/burhanb7/FastDepth_model

Burhanuddin Bharmal - burhanuddinb@vt.edu

Shruti Dongare - dshruti20@vt.edu

Shreyas Joshi - sjdan@vt.edu

Page updated

Report abuse