A more robust depth predictor, using the ensemble method, is deployed. This approach is a 3D detection problem that involves prediction of 3D location (x, y, z), dimension (h, w, l) and orientation θ. This project's aim is to implicitly predict dimension, orientation while directly regressing for 3D key-points. Additionally, a direct depth regressor is used to predict depth from visual cues in an image.
Framework :-
The network architecture is extended from CenterNet, where objects are identified with their representative points and predicted by peaks of the heatmap. Current Monoflex has multiple regression branches which share the same backbone to regress objects properties including the 2D bounding box, dimension, orientation, keypoints and depth. The final depth is an uncertainty guided combination of directly regressed depth and a geometry based estimate from orientation, dimension and 2D key-points. The overview of the model is similar to the network shown in Figure. 6. However, after replicating Monoflex, the importance of different depth estimates are shown in Table 1. This approach predicts decoupled representation of objects i.e. dimension, orientation, depth and key-points on the 2D image.
However as table 1 shows, 80% of the final depth is explained only by the depth regressor.
Hence, we considered direct regression of 3D key-points i.e. coupled representation of object instead of regressing object properties.
Direct Regression of 3D key points :-
The standard archituture of Monoflex deploys five heads after CenterNet backbone. These heads are replaced with two heads, one that directly regresses for depth and the other that regresses for 8 vertices of 3D bounding box. The coordinates of the 3D bounding box are calculated in camera coordinate system. The aim is to implicitly learn the dimensions and orientation. After predicting the eight key-points, two estimates of depth are obtained by averaging diagonals of the 3D bounding box.
The corresponding network architecture is shown in Figure 7.
Figure 6: Depth prediction using MonoFlex
Regress 14 more key-points :-
The 3d key-points corresponding to the bounding box may not have proper representation in the image since those pixels don’t correspond to car, rather correspond to road/background. Hence, 14 more key-points are sampled from center at fixed dimensions from center. Currently the points are sampled using dimensions, i.e. 6 points are sampled on 6 faces of the cube and 8 more points are sampled on diagonals with a length of one fourth of diagonal. These 14 points are assumed to have more visual cues compared to the 8 key-points regressed. After the key-points are regressed two estimates of depth are obtained by averaging two diagonals of the cube. Additionally one more estimate of depth is obtained averaging the additional key-points.
Regress 89 more semantically meaningful key-points :-
After regression of 14 key-points, more semantically meaningful key-points are regressed to represent a car. These points are obtained using LiDAR data. According to the labeled 3D bounding boxes, we first segment out the individual 3D points from the entire raw point cloud. Then we remove the ground points using the ground-plane estimation method (i.e. RANSAC-based plane fitting). Finally, we obtain the “clean” 3D points for each vehicle. Corresponding points on the segmented image are calculated by projecting the points onto the image plane. The 3D points are generated by constraining the points in the LiDAR using 2D image and optimizing for reprojection loss and reconstruction loss. 3009 key-points are generated using this approach and 89 points are considered in regular intervals, since regression for 3000 points may over fit the data. In addition to the two estimates of depths obtained from the corners, another estimate of depth is obtained by adding convolution layers. The corresponding key-points on the image are shown in Figure 7.
Figure 7: (a) 14 Key-points on the car. (b) 89 Semantically meaningful key-points.
Removing the dependency on Optical Flow to predict velocity - AB3DMOT :-
After regressing the keypoints from the proposed Monoflex, a non learning based state estimation approach that is both efficient and simple in design- AB3DMOT which deploys Kalman filter constant velocity model coupled with the Hungarian method. The AB3DMOT has an additional constraint mechanism called Birth and Death Memory which limits the valid tracking and detection by considering the matched tracking only after a given threshold frame number. The velocity detected is zero until a certain threshold frame. The unmatched detections and unmatched trajectories are used to create new trajectories and delete disappeared trajectories. With the new associated trajectories the Kalman filter predicts state of trajectories and updates the trajectories, and the new trajectories are matched with 3D detections using Hungarian algorithm. The algorithm is shown in corresponding figure 8.
Figure 8: Monoflex and AB3DMOT