Efficient Inference of Vision Instruction-Following Models with Elastic Cache

Zuyan Liu Benlin Liu Jiahui Wang Yuhao Dong

Guangyi Chen Yongming Rao Ranjay Krishna Jiwen Lu

Tsinghua University University of Washington Carnegie Mellon University

Mohamed bin Zayed University of Artificial Intelligence Tencent Allen Institute for AI

Instruction encoding accounts for most of the theoretical computation cost, while the actual latency is negligible. This underscores that it’s not just model weights but also the KV cache used in output generation that can become a significant bottleneck.

We propose Elastic Cache through a Cache Merging based on the importance scores of instruction tokens, complemented by a fixed-point elimination strategy in the output generation phase. Our designs yield significant inference acceleration while maintaining generation quality.

Quantitative and Qualitative Results

We conduct extensive experiments to illustrate the effectiveness and efficiency of Elastic Cache. We use two mainstream LVLMs, LLaVA-1.5 and Qwen-VL as our backbone and adopt the Elastic Cache on instruction-following chat generation datasets.

In the empirical analysis of our method’s performance in real-world text generation scenarios, we rigorously evaluate the robustness of our caching strategy using wild data under constrained conditions.

Fig. 1: Results on visual instruct-following tasks. We evaluate Elastic Cache together with baselines on PPL (lower better) and ROUGE (higher better) metrics. We conduct LLaVA-1.5 of different sizes (a),(b) and Qwen-VL-7B(c) for visual tasks. Our Elastic Cache outperforms baselines consistently.

Fig. 2: Generations on image recognition question. We fix the KV-Cache Budget as 0.5. Local and H2O cache pruning methods fail to generate rational results under such experimental settings while Elastic Cache maintains the generation ability with a detailed and correct description of the image.

BibTeX

@article{liu2024elastic,

title={Efficient Inference of Vision Instruction-Following Models with Elastic Cache},

author={Liu, Zuyan and Liu, Benlin and Wang, Jiahui and Dong, Yuhao and Chen, Guangyi and Rao, Yongming and Krishna, Ranjay and Lu, Jiwen},

journal={arXiv preprint arXiv:2407.18121},

year={2024}

}

Page updated

Google Sites

Report abuse