The baseline (Mono-velocity), shown in Fig. 5, proposes a solution by end-to-end training. The key novelty is the integration of multiple visual clues provided by any two time frames, which includes geometric clues and optical flow clues. The baseline also proposes a vehicle-centric sampling mechanism which is a pre-processing step to alleviate the distortion in the motion field. The network primarily consists of three parts as show below:
Velocity Centric Mechanism: After bounding box is detected on the current frame, it is cropped and resized to the same input size. The same process is followed for the previous frame, which is considered in the calculations.
The vehicle characteristics can be extracted from the features (fi) of deep networks consisting of the PWC encoder, ROI Align and two convolution layers. It is substituted in the equation 1 below, using bottom (b), top (t), right (r) and l (left) pixel positions.
On obtaining the depth from distance regression, as shown in equation above, the current velocity in x and y direction are calculated using the corrected depths (from current and previous frame). The bounding boxes and vehicle characteristics are obtained from deep features. cx, cy, fx and fy are defined by camera matrix while, ui and vi are pixel velocities in the x and y directions.
Our approach gives better results than this approach. Please visit Modified Monoflex and its results to view.
Figure 5: Network architecture proposed in Vehicle-Centric Approach (18)
Equation 1