Future directions for this project include
Using transformers to predict depth maps instead of predicting at only one point. For this depth map datasets(like KITTI) can be used.
Using optical flow maps in transformers to predict velocity. Optical flow segmentation datasets can be used for this.
Velocity and Depth can be modeled properly in the head rather than through direct regression.
To be independent of depth label, a mesh can be used to fit around the car, that is similar to world sheet in the Monoflex approach.