Monoflex:- Since direct regression of 3D keypoints just using a 2D image is difficult, four losses robust to scale factor variance is considered. Uncertainty is added to each of these losses. The losses investigated are as follows
1) Inverse Sigmoid Loss :- Network output for directly predicted 3D coordinates has to be unlimited which is (empirically) unstable for learning. So, the output is transformed to [0,1] by sigmoid and then converted to absolute depth by using this equation, where zo is network output. In this expression we use the L1 loss.
2) Uncertainty-based L1 Loss :- This is to minimise the uncertainty (inverse of confidence) in the multiple depths that is calculated by the network.
Less confidence =⇒ Higher σdep =⇒ Lesser Ldep.
3) Behru Loss :- It is less sensitive to outliers in data, while behaving like L1 loss for small errors and amplifying large errors.
4) Scale Invariant Loss :- where d is log(Predicted points) - log(Ground truth points). It helps measure the relationships between points in the scene irrespective of absolute global scale.
MViT :- Since MViT is a simple baseline that directly regresses for depth and velocity (optical flow maps), two losses are used.
1) Structural Similarity Index (SSID) :- When comparing images, the mean squared error (MSE)–while simple to implement–is not highly indicative of perceived similarity. Structural similarity aims to address this shortcoming by taking texture into account.
2) Root mean square error (RMSE) :- Since the values of velocities are high, the effect of each error is proportional to size of squared error. It represents the square root of the second sample moment of the deviation between predicted model values and actual values.