Figure 1
Today's autonomous cars mostly rely on sensor fusion amongst LIDAR, RADAR and Camera data. It is very expensive to collect, clean and maintain data for LIDAR and RADAR compared to cameras which hampers the scalability of such systems. Additionally stereo’s make the computation heavier there by affecting the response time of the computers. Also there is lot of camera data in this autonomous vehicle domain and vision based techniques are more proven and advanced compared to LIDAR/RADAR processing techniques.
Advanced Driver Assistance Systems (ADAS) are intelligent systems designed to assist the driver in extensive ways - better reaction times, compensate for negligence and avoid mishaps. These systems provide vital information about the traffic, blockage and potential collision. Inter-vehicle distance and relative velocity with respect to self or ego-vehicle are the two most essential information for ADAS (1). With the increase in enthusiasm around driver-less vehicles - research to find the above two information has also increased significantly.
One among the representative solutions for these assistive systems is the use of fusion of sensors like LiDARs, RADARs and cameras. LiDARs are good at estimating the position and velocities of other vehicles in ideal environments. However, the data is very expensive to collect, clean and update. It also requires heavy compute for autonomous driving, making it less scalable compared to cameras. In unpleasant weather, like rain, snow or fog, or even sudden minor aberrations in the surroundings, the LiDAR’s performance decreases significantly, leading to undesirable outputs. RADARs have even better accuracy, however, they suffer the same problems as LiDARs. It might give abrupt reading at crucial times, begging the question that in case the fused sensors give contradicting readings individually - whom do we listen to ? Deep learning has provided great alternatives in many visual applications. Distance regression problems, which include depth estimation and 3D bounding boxes, have been widely studied. The relative velocity estimation or the sceneflow is considered as the 3D motion comprising the distance regression problem of other vehicles relative to the ego-vehicle. Stereo cameras are commonly used for depth estimation but studies show stereo vision based object scene flow estimation in videos suffer from high computational cost. Therefore, for real time scenarios, monocular camera depth estimation or velocity estimation has been proposed as the viable alternative.
In this work, we adopt a few recent deep learning baselines for depth and motion estimation which majorly consists of two stages: (1) Distance regression and (2) Optical flow estimation. The current baselines have dependency of bounding boxes on Distance regression and Optical flow on Velocity estimation. A robust deep depth estimator and a non-learning based method will be used to estimate velocity. Additionally, both depth and velocity will concurrently be estimated using transformers and provide real time inference for these tasks. We leverage the study towards monocular cameras using the TuSimple and KITTI Dataset.