ONE MORE GLANCE 

WITH SHARP EYES: 

Rethinking Lightweight Captioning as a Practical Visual Specialist

What is it?

Humans first take in the overall scene, then glance at specific regions to notice finer details. Our Sharp-Eyed Refinement framework mimics this human tendency, allowing the captioning specialist to revise and improve initial descriptions.

[📖 arXiv] [💻 code]