Deep Event Stereo Leveraged by Event-to-Image Translation
AAAI 2021 (accept rate: 21%)
Depth estimation in real-world applications requires precise responses to fast motion and challenging lighting conditions. Event cameras use bio-inspired event-driven sensors that provide instantaneous and asynchronous information of pixel-level log intensity changes, which makes them suitable for depth estimation in such challenging conditions. However, as the event cameras primarily provide asynchronous and spatially sparse event data, it is hard to provide accurate dense disparity map in stereo event camera setups - especially in estimating disparities on local structures or edges. In this study, we develop a novel deep event stereo network that reconstructs spatial intensity image features from embedded event streams and leverages the event features using the reconstructed image features to compute dense disparity maps. To this end, we propose a novel event-to-image translation network with a cross-semantic attention mechanism that calculates the global semantic context of the event features for the intensity image reconstruction. In addition, a feature aggregation module is developed for accurate disparity estimation, which modulates the event features with the reconstructed image features by a stacked dilated spatially-adaptive denormalization mechanism. Experimental results reveal that our method can outperform the state-of-the-art methods by significant margins both in quantitative and qualitative measures.
Proposed Method and Contributions
We propose a novel end-to-end deep event stereo architecture to generate spatial image features from input event data and use them as a guidance for the accurate stereo matching. We leverage the reconstructed image features to provide dense spatial intensity information that is absent in the asynchronous and sparse event data.In summary, the contributions of this work are as follows:
We propose a deep event stereo network that extracts the event features leveraged by the reconstructed image features for dense disparity map estimation.
A novel image reconstruction sub-network is proposed to extract the image features reconstructed from the event features, which is based on a dual-path encoder-decoder network with a semantic attention mechanism.
A feature aggregation sub-network is proposed to incorporate the reconstructed image features into the event features in a spatially adaptive modulation concept, which uses a stacked dilated SPatially-Adaptive DEnormalization (stacked dilated SPADE) mechanism.
Proposed Architecture for End-to-End Deep Event Stereo
Overall architecture of the proposed deep event stereo network
The event embedding sub-network contains a kernel net- work with continuous fully connected layers, following (Tulyakov et al. 2019), for left/right event-to-feature em- bedding. The embedded event features are then fed to both the image reconstruction sub-network and feature aggregation sub-network as inputs. The proposed image reconstruction sub-network takes event features as input and uses a dual-path encoder-decoder network with a novel attention mechanism to reconstruct corresponding left and right images and also to obtain image features of the same shape of event features. The proposed feature aggregation subnetwork takes embedded event features and reconstructed image features as inputs and fuses the features with a stacked dilated SPADE mechanism to obtain a final fused and ag- gregated feature, which is then fed into a stereo matching sub-network to obtain dense disparity maps. Note that the event embedding and stereo matching sub-networks are ap- plied using the same methods as used in the previous study. Thus, the following subsections introduce the proposed sub- networks in detail (i.e., image reconstruction sub-network and feature aggregation sub-network)
Event-to-Image Translation
Image Reconstruction Sub-Network
Image Reconstruction Sub-Network
The image reconstruction sub-network is based on a dual-path encoder-decoder network.
Attention branch - To capture the global contextual relationships among the features using a novel Semantic Attention Mechanism.
Regular branch - To extract better features with dilated convolutions.
The output from both the regular and attention branches are concatenated channel-wise and fed into a single decoder.
The decoder consists of several convolutions and up-sampling layers (i.e., up-sampling operation followed by a convolution operation) that outputs a reconstructed intensity image and associated image features.
Semantic attention module
Spatial Context Attention
Calculates the dependency across all channels for each spatial location in a feature map.
We obtain the spatial context information in terms of channel dependency
Cross-Semantic Attention
The cross-semantic attention block performs a re-calibration of the global context of input features.
It calculates channel-wise statistics from both input features and uses
Feature Aggregation
Feature Aggregation Sub-Network
Feature Aggregation Sub-Network
This sub-network takes the embedded event feature and the reconstructed image feature as inputs, and generates a fused and aggregated feature.
The proposed aggregation sub-network is based on the spatially-adaptive denormalization (SPADE) method that modulates the existing feature using the conditional feature with learned scale and shift parameters.
Stacked Dilated SPADE Block
we modify the SPADE to fit the stereo matching task. Specifically, to compensate for the insufficient structural information of the event feature, the proposed conditional normalization applies the stacked dilated convolution.
Stacked Dilated SPADE Block
Experiments and Results
Qualitative Results
Qualitative comparison with recent event-based methods on the Indoor Flying dataset
Quantitative Results
Results for sparse disparity estimation. Note that the blank entries in the table denote the unavailability of the respective values from the associated papers
Results for dense disparity estimation. Note that the baseline method is DDES (Tulyakov et al. 2019)