Chain-of-Spot: Interactive Reasoning Improves
Large Vision-language Models
Zuyan Liu* Yuhao Dong* Yongming Rao Jie Zhou Jiwen Lu
Tsinghua University Tencent
* Equal Contribution
[Paper (arXiv)] [Code (GitHub)]
Zuyan Liu* Yuhao Dong* Yongming Rao Jie Zhou Jiwen Lu
Tsinghua University Tencent
* Equal Contribution
[Paper (arXiv)] [Code (GitHub)]
Chain-of-Spot (CoS) is a novel approach that enhances feature extraction by focusing on key regions of interest (ROI) within the image, corresponding to the posed questions or instructions. This technique allows VLMs to access more detailed visual infor-mation without altering the original image resolution, thereby offering multi-granularity image features.
Fig. 1: Chain-of-Spot encourages Large Vision-Language Models to identify the region of interest (ROI) in the image condition on the question and reasoning through an interactive manner, thereby improving the ability of visual understanding.
Fig. 2: Procedure of Chain-of-Spot. We combine original visual tokens with questions in instruction 1 to ask LLMs to generate regions of interest (ROI) in the image. Then, we use the cropped image together with the questions again to generate better responses.
Fig. 3: Visualizations on Chain-of-Spot. Chain-of-Spot shows the reasonable region of interest condition on the given questions.
Fig. 4: Generation comparisons after implementing Chain-of-Spot. Chain-of-Spot corrects the focus and the answers of LLaVA model on complex visual question cases.
@article{liu2024chain,
title={Chain-of-Spot: Interactive Reasoning Improves Large Vision-Language Models},
author={Liu, Zuyan and Dong, Yuhao and Rao, Yongming and Zhou, Jie and Lu, Jiwen},
journal={arXiv preprint arXiv:2403.12966},
year={2024}
}