MM-SeR:
Multimodal Self-Refinement for Lightweight Image Captioning
What is it?
Humans first take in the overall scene, then glance at specific regions to notice finer details. Our Sharp-Eyed Refinement framework mimics this human tendency, allowing the captioning specialist to revise and improve initial descriptions.