Current metrics of monocular depth prediction problem only evaluate the prediction error and accuracy for individual points. The state-of-the-art methods on these metrics often suffer from the lack of object details in scenes which is significant when applying to robotics or 3D reconstruction problems. To generate both accurate and clear depth maps, we first introduce an attention-based encoder-decoder network to automatically fuse the high-level semantic features and low-level edge-sensitive features. Motivated by the fact that relative depth methods produce better object outlines, then we propose a relative loss based on the ground truth depth value to improve the performance on both numerical measurements and image quality. This loss function embodies a strategy of global supervision which has wide extensibility. Additionally, we propose two new metrics of image quality to quantify the performance of the reconstruction of small objects and refinement of edges. The proposed method achieves state-of-the-art results on both indoor and outdoor benchmarks, i.e., NYU Depth V2, and Make3D. Qualitative results show that our method can preserve better edge outlines and rich object details. The code will be made publicly available.
The code will be made publicly available.