Results :-
The evaluation of the proposed approach with baseline Monoflex method for 3D detection detection on the described KITTI dataset is shown in table. It can be clearly seen that using 22 key-points give the best results with not much difference in 89 key-points. So just increasing key-points may have plateaued the accuracy, instead a segmented depth map may help in increasing accuracy. Segmentation requires more image labels which are difficult to obtain. The final AP is also less than the original baseline which shows that uncoupled representation of the object could be the best way to learn the bounding box.
Discussion :-
Our proposed frameworks show inferior performance in comparison to the baseline Monoflex results on the KITTI Benchmark. Our most successful method utilized 22 key-points to directly regress depth, and produced a decrease in MAE of 3.7% compared to the baseline. Our least successful method used 8 bounding box points which are regressed directly.
Training time was fast as compared to the original baseline. While the baseline needs 35000 iterations to converge, our approach converges in at around 10000 iterations. Additionally, it improves the inference time by 10% since five heads are reduced to two heads.
The 9 in appendix shows graphs how overall IoU increases for cars while training. This shows that model is able to fit the box to the car properly.
Losses for different depth estimates are compared in Appendix. Clearly direct depth estimator works equivalently for all methods. The depth loss from corner points also converges. The depth estimates from added key-points(22 and 89) fail the predict the depth accurately (can be inferred from key-point additional MAE) and its uncertainty is high as shown in 10 and 11. The depth prediction from these key-points need to be modelled in a right way.
Figure 10 shows that the key-point loss and key-point MAE are low which suggests that this approach is able to learn the key-points properly.
Figure 11 shows that the key-point loss and key-point MAE are high which suggests that this approach is not able to learn the key-points properly and suggests that key-points need to be modelled in a meaningful way.