ZipVL: Efficient Large Vision-Language Models with Dynamic Token Sparsification and KV Cache Compression

Oct 1, 2024ยท
Yefei He
,
Feng Chen
,
Jing Liu
,
Wenqi Shao
,
Hong Zhou
,
Kaipeng Zhang
,
Bohan Zhuang
ยท 0 min read
Abstract
In this paper, we present ZipVL, an efficient inference framework designed for LVLMs that resolves both computation and memory bottlenecks through a dynamic ratio allocation strategy of important tokens. This ratio is adaptively determined based on the layer-specific distribution of attention scores, rather than fixed hyper-parameters, thereby improving efficiency for less complex tasks while maintaining high performance for more challenging ones. Then we select important tokens based on their normalized attention scores and perform attention mechanism solely on those important tokens to accelerate the prefill phase. To mitigate the memory bottleneck in the decoding phase, we employ mixed-precision quantization to the KV cache, where high-bit quantization is used for caches of important tokens, while low-bit quantization is applied to those of less importance. Our experiments demonstrate that ZipVL can accelerate the prefill phase by 2.6ร— and reduce GPU memory usage by 50.0%, with a minimal accuracy reduction of only 0.2% on Video-MME benchmark over LongVA-7B model, effectively enhancing the generation efficiency of LVLMs.
Type
Publication
arXiv preprint arXiv:2410.08584