MM-SeR:
Multimodal Self-Refinement for Lightweight Image Captioning

What is it?

Humans first take in the overall scene, then glance at specific regions to notice finer details. Our Sharp-Eyed Refinement framework mimics this human tendency, allowing the captioning specialist to revise and improve initial descriptions.