Unsupervised Deep Event Stereo for Depth Estimation

Hint-Based Image Colorization Based on Hierarchical Vision Transformer

Bio-inspired event cameras have been considered effective alternatives to traditional frame-based cameras for stereo depth estimation, especially in challenging conditions such as low-light or high-speed environments. Recently, deep learning-based supervised event stereo matching methods have achieved significant performance improvements over the traditional event stereo methods. However, the supervised methods depend on ground-truth disparity maps for training, and it is difficult to secure a large amount of ground-truth disparity maps. A feasible alternative is to devise an unsupervised event stereo method that can be trained without ground-truth disparity maps. To this end, we propose the first unsupervised event stereo matching method that can predict dense disparity maps, and is trained by transforming the depth estimation problem into a warping-based reconstruction problem. We propose a novel unsupervised loss function that enforces the network to minimize the feature-level epipolar correlation difference between the ground-truth intensity images and warped images. Moreover, we propose a novel event embedding mechanism that utilizes both temporal and spatial neighboring events to capture spatio-temporal relationships among the events for stereo matching. Experimental results reveal that the proposed method outperforms the baseline unsupervised methods by significant margins (e.g., up to 16.88% improvement) and achieves comparable results with the existing supervised methods. Extensive ablation studies validate the efficacy of the proposed modules and architectural choices.

Published in: IEEE Transactions on Circuits and Systems for Video Technology ( Volume: 32, Issue: 11, November 2022)

Paper link: https://ieeexplore.ieee.org/document/9819909

Fig. 1. (a) Events in space and time. (b) Events overlaid in left intensity frame (for visualization purpose). (c) Estimated disparity map by the proposed method. (d) Ground-truth disparity map. Note that, in (b), the most recent 15,000 events from the left camera are overlaid in the corresponding intensity frame for better visualization (red color represents positive events and blue color represents negative events).

We propose a novel end-to-end unsupervised deep model for the event-based stereo matching. To the best of our knowledge, this is the first work to propose an unsupervised event stereo matching approach to predict dense disparity maps.
We propose an epipolar feature correlation difference loss for the effective unsupervised learning of our event stereo model.
We also propose an event embedding sub-network for better event representation by considering both temporal and spatial neighboring events that can provide useful information in the event-based stereo matching.
The proposed method has been evaluated using the publicly available multi-vehicle stereo event camera (MVSEC) dataset [22] and DSEC dataset [60]. The experimental results reveal that the proposed method outperforms the traditional hand-crafted methods while showing comparable results with the deep learning-based supervised methods.

Fig. 2. The overall architecture of the unsupervised event stereo model. The proposed method takes asynchronous and sparse stereo event data as inputs and embeds them into event features using the event embedding sub-network. The embedded event features are then fed to a stereo matching sub-network to produce a dense disparity map. Note that our method does not need any ground-truth disparity maps. A novel epipolar feature correlation difference loss, along with local appearance matching losses, supervises the event stereo network during the training.

As shown in Figure 1, the proposed HCoTnet is divided into three main parts: first, a patch embedding (tokenization) module for using the input data (i.e., luminance image and color hint map) in the transformer; second, the Unet-like encoder and decoder modules consisting of transformer blocks [17]; finally, a projection module for outputting the result by restoring and projecting the embedded features onto the ab dimension of the CIELAB color space. The luminance image and color hint map are the inputs of the network, and through the HCoT network, the colorization result is output with the ab color channels.

Fig. 3. Spatio-temporal event embedding mechanism. (a) First-In-First-Out queue that accumulates the n most recent events in each spatial location by their arrival time [8]. It is a 4D tensor with a size of H × W × N × 2. The accumulated event data is inputted to the spatio-temporal event embedding module. (b) Illustration of the spatio-temporal event embedding module at a spatial position (y, x). Spatial and temporal weights are calculated from the input timestamps respectively, and weighted-sum operations for polarities are then performed to extract spatial and temporal event features. Note that, in this example, is 1 (i.e., 3 × 3 spatial window) for simple visualization.

Fig. 8. Visual comparison with the baseline unsupervised event stereo method. From the left, (a) the most recent 15,000 events from the left camera are overlaid in an intensity image for better visualization (red color represents positive events and blue color represents negative events), (b) disparity output of the baseline method, (c) disparity output of the proposed method, and (d) ground-truth disparity. Best viewed in color and zoomed in.

Fig. 11. Qualitative comparison of an unsupervised frame-based stereo method (i.e., PASMNet [11]) and our proposed event-based method in challenging lighting conditions (Zurich in DSEC dataset [60]). From left, intensity image, output from PASMNet, output from our method and the corresponding groundtruth. It can be seen that our method can produce comparatively better disparity maps than the intensity-based counterpart.

This work was supported by the SK Hynix

Page updated

Google Sites

Report abuse