DisparityEstimationbyDeepLearning

under construction ...

Disparity Estimation by Deep Learning

The FlowNet of Dosovitskiy et al. uses an encoder-decoder architecture with additional cross-links between contracting and expanding network parts, where the encoder computes abstract features from receptive fields of increasing size, and the decoder reestablishes the original resolution via an expanding upconvolutional architecture. The information is first spatially compressed in a contractive part of the network and then refined in an expanding part.

To perform the refinement, apply the ‘upconvolution’ to feature maps, and concatenate it with corresponding feature maps from the ’contractive’ part of the network and an upsampled coarser flow prediction (if available). This way to preserve both the high-level information passed from coarser feature maps and fine local information provided in lower layer feature maps.

In the expanding part, upconvolutions, convolutions and loss layers are alternating in the training mode.

FlowNet (A Encoder-Decoder Network for Optic Flow Estimation)

Following the architecture of the FlowNet, the DispNet is designed: each network consists of a contractive part and an expanding part with long-range links between them. The contracting part contains convolutional layers with occasional strides of 2, resulting in a total downsampling factor of 64. This allows the network to estimate large displacements. The expanding part of the network then gradually and nonlinearly upsamples the feature maps, taking into account also the features from the contractive part. This is done by a series of up-convolutional and convolutional layers. Note that there is no data bottleneck in the network, as information can also pass through the long-range connections between contracting and expanding layers.

A loss weight schedule can be beneficial: start training with a loss weight of 1 assigned to the lowest resolution loss and a weight of 0 for all other losses (that is, all other losses are switched off). During training, progressively increase the weights of losses with higher resolution and deactivate the low resolution losses. This enables the network to first learn a coarse representation and then proceed with finer resolutions without losses constraining intermediate features.

DispNet (A Encoder-Decoder Network for Disparity Estimation)

Kendall et al. apply CNN regression in stereo matching's cost volume with CNN feature representation like Žbontar and LeCun's method, and learn to incorporate the context information by 3-d convolution over this volume, shown below.

GC-Net (Geometry and Context Network) stereo regression architecture

Laina et al. proposed a single RGB image-based depth prediction residual network, modeling the ambiguous mapping btw monocular images and depth maps, trained end-to-end. Similarly, it efficiently learns feature map up-sampling as shown below.

A fully convolutional network with ResNet for depth prediction

Garg et al. also train the convolutional encoder for the task of predicting the depth map for the source image. To utilize stereo images for training the encoder-decoder network, they explicitly generate an inverse warp of the target image using the predicted depth and known inter-view displacement, to reconstruct the source image; the photometric error in the reconstruction is the reconstruction loss for the encoder.

FCN for depth estimation

Hazirbas et al. proposed Deep Depth From Focus (DDFF) with the encoder-decoder network.

DDFFNet (Depth from Focus)

Some Results

Stereo Pair (FlyingThings3D)

Ground Truth

Stereo Pair (Kitti 2012)

Stereo Pair (Kitti 2015)

Estimation With Stereo RGB Images

Disparity Map

Disparity Map

Disparity Map

Estimation with Single RGB Image

Depth map (FlyingThings3D)

(Kitti2012)

(Kitti2015)

References

1. I Laina et al., “Deeper Depth prediction with fully convolutional residual networks”, arXiv 1606.00373.

2. C Hazirbas et al., "Deep Depth From Focus", arXiv:1704.01085.

3. C Godard et al., “Unsupervised Monocular Depth Estimation with Left-Right Consistency”, arXiv 1609.03667.

4. Y Cao, "Estimating Depth from Monoc. Images as Classif. Using Deep Fully Conv. Residual Networks", arXiv:1605.02305.

5. R Garg et al., "Unsupervised CNN for Single View Depth Estimation: Geometry to the Rescue", arXiv:1603.04992.

6. Y Cao et al., “Estimating Depth from Monocular Images as Classification Using Deep Fully Convolutional Residual Networks”, arXiv 1605.02305.

7. N Mayer et al., “A large dataset to train CNNs for disparity, optical flow, and scene flow estimation”, CVPR 2016.

8. P Fischer, et al. “FlowNet: learning optic flow with convolutional networks”, arXiv 1504.06852.

9. B Ummenhofer et al., “DEMON: depth and motion network for learning monocular stereo”, arXiv 1612.02401.

10. J Žbontar, Y LeCun, “Stereo Matching by Training a Convolutional Neural Network to Compare Image Patches”, arXiv 1510.05970.

11. W Luo, A G. Schwing, R Urtasun, “Efficient Deep Learning for Stereo Matching”, CVPR 2016.

12. A Newell, K Yang, J Deng, “Stacked Hourglass Networks for Human Pose Estimation”, ECCV, 2016.

13. A Kendall et al., "End-to-End Learning of Geometry and Context for Deep Stereo Regression", arXiv:1703.04309.

14. P Knoebelreiter et al., “End-to-end training of hybrid CNN-CRF models for stereo”, arXiv 1611.10229.