BEV-MODNet

Monocular Camera based Bird’s Eye View Moving Object Detection for Autonomous Driving

Hazem Rashed, Mariam Essam, Maha Mohamed, Ahmad El-Sallab, Senthil Yogamani

Detection of moving objects is a very important task in autonomous driving systems. After the perception phase, motion planning is typically performed in Bird’s Eye View (BEV) space. This would require projection of objects detected on the image plane to top view BEV plane. Such a projection is prone to errors due to lack of depth information and noisy mapping in far away areas. CNNs can leverage the global context in the scene to project better. In this work, we explore end-to-end Moving Object Detection (MOD) on the BEV map directly using monocular images as input. To the best of our knowledge, such a dataset does not exist and we create an extended KITTI-raw dataset consisting of12.9k images with annotations of moving object masks in BEV space for five classes. The dataset is intended to be used for class agnostic motion cue based object detection and classes are provided as meta-data for better tuning. We design and implement a two-stream RGB and optical flow fusion architecture which outputs motion segmentation directly in BEV space. We compare it with inverse perspective mapping of state-of-the-art motion segmentation predictions on the image plane. We observe a significant improvement of 13% in mIoU using the simple baseline implementation. This demonstrates the ability to directly learn motion segmentation output in BEV space. To encourage further research, the annotations will be made public.

Moving object detection has gained significant attention recently especially for autonomous driving applications. Motion information can be used as a signal for class-agnostic detection. For example, current systems come with appearance based vehicle and pedestrian detectors. They won't be able detect unseen classes like animals which can cause accidents. Motion cues can be used to detect any moving object regardless of its class, and hence the system can use it to highlight unidentified risks.

Sensor fusion is typically used to obtain an accurate and robust perception. A common representation for all sensors fusion is the BEV map which defines the location of the objects relative to the ego-vehicle from top-view perspective. BEV maps also provides a better representation than image view as they minimize the occlusions between objects that lie on the same line of sight with the sensor. In case of visual perception on image view, a projection function is applied to map them to the top-view BEV space.

Such a projection is usually error prone due to the absence of depth information. Deep learning on the other hand can be used to improve this inaccuracy by learning the objects representation directly in BEV representation. There has been efforts to explore deep learning performance for BEV object detection using camera sensor and there has been also efforts in motion segmentation on front view. However, there is no literature in end-to-end learning of BEV motion segmentation. In this work, we attempt to tackle such limitation through the following contributions:

We create a dataset comprising of 12.9k images containing BEV pixel-wise annotation for moving and static vehicles for 5 classes.
We design and implement a simple end-to-end baseline architecture demonstrating reasonable performance.
We compare our results against conventional Inverse Perspective Mapping (IPM) approach and show a significant improvement of over 13\%.

Dataset Preparation

TBD

Dataset Samples

TBD

Results

Below are samples of the results using our approach evaluated on our generated dataset.