The velocity estimation primarily focuses on stacking concepts involving optical tracking, optical flow estimation and distance regression, and combining these under a single umbrella using deep neural architectures.
Object tracking is one of the fundamental problems in computer vision and has been deployed for various applications. The Lucas-Kanade method (2) uses brightness constancy assumption to track certain points of interest in images through time. Median Flow (4) extends the Lucas-Kanade method, using the concept of forward – backward error. Scene flow (5) represents the 3D motion of each point in the image. A more conventional method (6) decomposes the scene into piece-wise rigid motion planes and solves an optimization problem.
Optical flow estimation is identified as essential for efficient object tracking. Very large CNN based approaches like Flownet (7) and Flownet2 (8) stack multiple networks and utilise the warping operation. A faster network PWC-Net (9) applies the idea of spatial pyramids to Flownet2, using cost volume efficiently.
Two types of deep learning based were developed for solving inter-vehicle depth estimation, monocular depth estimation and 3D object detection. A U-Net(13) based architecture is initially proposed predicting dense depth using the concept of Conditional Random Fields. DORN(14) discretizes the depth instead and solves it using efficient learning. M3D-RPN(15) constructs a 3D region proposal network generating 3D bounding boxes utilizing features generated from 2D space. A more recent paper (16) estimates depth using spatial and temporal information in a video.
Velocity is also estimated by leveraging synthetic data in (17). A very recent approach (18) introduces MSANet to predict vehicle velocity and inter-vehicle distance estimation. (10) regresses velocity in autonomous driving settings through the trajectory features extracted from Monodepth, Median Flow tracker and Flownet. A multi-scale plane fitting based visual flow algorithm that is robust to aperture problem is described in (11). While many of these networks use different backbones to predict optical flow and velocity, (12) uses a single network to leverage the information of depth from predicting optical flow. Although sensor fusion is used in today’s autonomous vehicles to estimate depth and velocity, they give more weight to cameras. Our aim is to use a single network to compute depth and velocity only using images which makes it computationally efficient in real time.
Figure 2: 2D Bounding Box and 3D bounding box in 2 different perspectives
The above approaches regress for depth using the 2D bounding box or 3D bounding projection on the image, see the above figure 2. As indicated in the paper by Liu et al. (22), these keypoints have no real contextual meaning and their 2D locations vary differently with the changing of the camera view-point or object’s orientation. The keypoints may lie ground, sky or trees - thereby making it difficult for the network to distinguish between keypoints and other image pixels. Hence, we use keypoints on the car itself such that their projections and their averaged center also lie on the car.