Junsu Kim - vlmpl

VLM-PL: Advanced Pseudo Labeling Approach for Class Incremental Object Detection via Vision-Language Model [CVPRW'24] [PAPER]

Researchers and engineers dealing with incremental or continual learning where each new stage adds additional classes
Practitioners exploring ways to refine pseudo-labels for object detection (e.g., repeated labeling errors, noisy annotations)
Those interested in vision-language models (VLMs) and how they can assist conventional computer vision tasks

VLM-Assisted Refinement: Introduces a novel pipeline that filters out incorrect pseudo ground-truth boxes using a large vision-language model, without extra training
Prompt Tuning for Region Verification: Constructs dedicated prompts (i.e., bounding box coordinates, object names) to confirm or reject each pseudo-labeled instance
Enhanced Pseudo-Labeling: Demonstrates how verifying old-class labels with a VLM significantly reduces mistakes (i.e., catastrophic forgetting) in multi-stage CIOD
Replay-Free Approach: Achieves strong incremental detection results without replaying past real or synthetic data

Long-Term Model Maintenance: Useful for camera surveillance, robotics, or any system updated with new objects regularly, since errors from older tasks get filtered out
Data-Limited Environments: For setups where storing a large replay buffer is infeasible—VLM-based verification helps maintain old-class accuracy
Streamlined Labeling: Potentially reduces human annotation overhead, since the system auto-corrects older model predictions by referencing a vast language-vision knowledge base

Multi-Scenario Reliability: The method robustly corrects pseudo-labeling in both dual-scenario and multi-scenario incremental setups
Prompt-Driven Verification: By harnessing the VLM’s large-scale knowledge, it ensures fewer labeling mistakes from previous models
No Extra Training for Classification: The VLM step requires no additional network training, only well-crafted prompts for region-based queries
Replay-Free State-of-the-Art: Delivers leading performance on PASCAL VOC and MS COCO without storing or replaying real old data