Results :-
For the MViT, the ablation study was performed with the input as 4 frames and 2 frames. The framework was not able to achieve very accurate results, producing a minimum MAE of 0.51 and 0.17 for depth and velocity respectively for input with 4 frames. Figure 12 in shows the training loss, validation depth and velocity MAE. The loss function converges initially for the first 1000 iterations but then it becomes stagnant for later iterations. Similar variations can be seen the depth and velocity MAE for validation.
Discussion :-
The validation values however vary a lot which suggest that data is less. So the model can be trained for depth maps and optical flow maps separately from many available datasets (Not just for the center points of the car, but rather the whole image)
A good solution to this problem is to train on a depth dataset first and fine-tune it for velocity.
Depth and velocity could be modelled in a better way rather than just modifying head and regressing for values.
Figure 10: (a) Training loss considering input every two frames. (b) Validation depth MAE considering input every two frames. (c) Validation velocity MAE considering input every two frames.
Figure 11: (a) Training loss considering input every four frames. (b) Validation depth MAE considering input every four frames. (c) Validation velocity MAE considering input every four frames.