Implementation details for Monoflex: We adopt the same modified DLA-34 (23) as our backbone network for Monoflex. All input images are padded to the same size of 384 × 1280. Every prediction head attached to the backbone consists of one 3 × 3 × 256 convolution layer, BatchNorm, ReLU, and another 1 × 1 × co conv layer, where co is the output size. The model is trained using AdamW (24) optimizer with the initial learning rate as 0.0003 and weight decay as 0.00001. We train the model for atleast 10,000 iterations with a batch size of 8 on a single Quadro RTX 6000 GPU and the learning rate is divided by 10 at after iteration 5,000 and 8,000. The random horizontal flip with a probability of 0.5 is the only data augmentation used.
Implementation details for MViT: The backbone for this implementation is similar to MViT-B in the original MViT paper, with space time cube of shape 3 x 7 x 7, followed by four stages of Multi Head Pooling Attention and MLP, with a final fully connected layer to regress the confidence score, depth and velocity. The input frames was resized to 224 x 621 and cropped to 224 x 224 to satisfy the input head for MViT. Color Jitter augmentation of scale 224 x 224 was used on during training similar to the original MViT baseline. The model was trained for 400 epochs, using ADAMW with initial learning rate of 0.0001 and weight decay of 0.05 on AWS instance g4dn.4xlarge
Evaluation Metrics:
3D detections are evaluated by the average precision of 3D bounding boxes (AP3D). For the validation data, we report and AP3D|R40 for a comprehensive representation, where R40 is the 40 equidistant points recall points between 0 and 1. The IoU threshold used for precision is 0.7, since this experiment considers only clearly visible cars.
The absolute depth and velocity values are evaluated using Mean Average Error (MAE) which is standard in comparing regression values.