Chain-of-Spot: Interactive Reasoning Improves 

Large Vision-language Models 


Zuyan Liu*   Yuhao Dong*   Yongming Rao   Jie Zhou   Jiwen Lu   

 Tsinghua University    Tencent   

* Equal Contribution

[Paper (arXiv)]      [Code (GitHub)]


Chain-of-Spot (CoS) is a novel approach that enhances feature extraction by focusing on key regions of interest (ROI) within the image, corresponding to the posed questions or instructions. This technique allows VLMs to access more detailed visual infor-mation without altering the original image resolution, thereby offering multi-granularity image features. 

Fig. 1: Chain-of-Spot encourages Large Vision-Language Models to identify the region of interest (ROI) in the image condition on the question and reasoning through an interactive manner, thereby improving the ability of visual understanding. 


Approach

Fig. 2: Procedure of Chain-of-Spot. We combine original visual tokens with questions in instruction 1 to ask LLMs to generate regions of interest (ROI) in the image. Then, we use the cropped image together with the questions again to generate better responses. 

Quantitative and Qualitative Results

We perform our main experiments on 11 widely used and challenging multimodal benchmarks. We clearly show the performance compared with our baseline LLaVA-1.5 and other vision-language models to show the superiority of our method.

We further employ visualizations to elucidate the effectiveness of the Chain-of-Spot approach in identifying regions of interest pertinent to query responses within images, and thus help improve the overall performance in multimodal tasks.

Table 1: Comparisons with vision-language models on visual question answering datasets. Our Chain-of-Spot (CoS)  consistently improves the vanilla LLaVA-1.5 in all the benchmarks under different language model sizes. The best results are highlighted bold and the second are highlighted underline.

Table 2: Comparisons with vision-language models on multimodal benchmarks. LLaVA-1.5 with Chain-of-Spot (CoS) achieves state-of-the-art performance on all the multimodal benchmarks, surpassing previous LVLMs by a large margin.  The best results are highlighted bold and the second are highlighted underline.

Fig. 3: Visualizations on Chain-of-Spot. Chain-of-Spot shows the reasonable region of interest condition on the given questions. 

Fig. 4: Generation comparisons after implementing Chain-of-Spot. Chain-of-Spot corrects the focus and the answers of LLaVA model on complex visual question cases. 


BibTeX

@article{liu2024chain,

  title={Chain-of-Spot: Interactive Reasoning Improves Large Vision-Language Models},

  author={Liu, Zuyan and Dong, Yuhao and Rao, Yongming and Zhou, Jie and Lu, Jiwen},

  journal={arXiv preprint arXiv:2403.12966},

  year={2024}

}