Abstract
Vision-Language Models (VLMs) have demonstrated remarkable performance across a wide range of computer vision tasks that require understanding of the overall semantics of an image, such as image captioning and visual question answering. However, they still struggle with region-level scene understanding, which involves distinguishing and interpreting specific objects or regions within an image. To address this limitation, recent studies have adopted approaches that construct large-scale datasets consisting of region annotations such as bounding boxes and segmentation masks paired with textual descriptions, and fine-tune VLMs on these datasets. Such approaches not only require substantial costs for dataset construction and model training, but may also degrade the generalization capability that VLMs originally acquired through pretraining. In this paper, we propose a method that enables region-level scene understanding through the interpretation of VLM internal representations, without the need for additional dataset construction or fine-tuning.