Summary:
A majority of the modern approaches tackling this problem of depth and velocity estimation are very data intensive, i.e., requires a variety ground truth annotations. In this project, we have tried to reduce the dependence on some variables by implicitly learning for uncoupled representations of the object.
We showcase two approaches - (a) uses the monocular 3D detection architecture (Monoflex) to estimate depth, and then utilizes the Kalman Filter based AB3DMOT network to estimate the velocity of the object. The monoflex directly regresses the 3D keypoints for bounding box and the direct depth, which is ensembled to get the final depth. We remove the depth regression part and directly focus on the key point regression. The AB3DMOT uses kalman filter state estimation to track the 3D bounding box, and we regress the velocity from this Kalman filter for our final output.
(b) uses Multiscale Vision Transformers (MViT) consisting of spatio-temporal attention base blocks to directly regress object depth and velocity. MViT in contrast to Multi head attention, pools the sequence of latent tensors to reduce the sequence length. The pooling factor provides edge during the training period and rest of process is similar to the ViT model. The main reason for directly regressing for depth and velocity, excluding keypoints is to make use other dataset on depth prediction satisfy the data hunger of transformer.
Conclusion:
Experiments on the KITTI benchmark show that our first approach is able to reduce its dependence on data while experiencing just a 3.7% decrease in accuracy of depth estimation.
For the MViT, the ablation study was performed with the input as 4 frames and 8 frames. The results for this approach does not perform as well as expected, achieving a MAE of 0.61 for depth estimate and 0.51 for the velocity estimate with input as 4 frames.