Task-Oriented Pre-Training for Drivable Area Detection
Fulong Ma, Guoyang Zhao, Weiqing Qi, Ming Liu, and Jun Ma
HKUST(GZ) HKUST
[paper] [code]
This webpage is under construction.
Fulong Ma, Guoyang Zhao, Weiqing Qi, Ming Liu, and Jun Ma
HKUST(GZ) HKUST
[paper] [code]
This webpage is under construction.
Motivation
Traditional pre-training and self-training methods require large datasets and substantial computational resources, and can only learn shared features through prolonged training, making it difficult to capture deeper task-specific features.
For learning-based methods, data annotation is labor-intensive, time-consuming, and costly. Improving the performance of task-specific models without the need for manual data annotation is a highly valuable endeavor.
Method
The overall architecture of our method. On the left side of the dashed line, segmentation proposals are generated using the SAM model, which are then used to fine-tune the CLIP model. On the right side of the dashed line, the process involves integrating segmentation proposals with text prompts to classify these image patches, selecting the one that most closely approximates the drivable area.
Qualitative Results
The qualitative comparison results between using and not using our task-oriented pre-training show that our method consistently improves model performance across three different architectures—CNN, transformer, and MAMBA—represented by UNet, SegFormer, and VM-UNet, respectively. Additionally, in three different multimodal settings—image-lidar, image-depth, and image-text—corresponding to models PLARD, SNE-RoadSeg, and LVIT, our pre-training method also boosts performance.
Quantitative Results
The comparison results of three classic single-modal segmentation models on the KITTI road dataset for drivable area detection using our pre-training method, without any pre-training, and with conventional pre-training on ImageNet. The best results are shown in bold type.
The comparison results of three multi-modal models with different inputs for drivable area detection using our pre-training method, without any pre-training, and with conventional pre-training on ImageNet. The best results are shown in bold type.
The comparison between our method and some self-training method like MoCo, SimCLR, MAE, and DINO. The best results are shown in bold type.
Video
video is coming.