SGLang-Prefix_Prompt_Cache设计

Johney Zheng Nov 19, 2024 2024-11-19T18:26:50+08:00

1 min

SGLang-Prefix Prompt Cache设计

Paper

Motivation

多轮对话中，由于Prompts存在大量相同前缀，因此 Prefix prompt cache 的复用就显得很重要：需要考虑：efficient prefix search,reuse,insertion,eviction已经对应的cache aware scheduing。

Key Points

SGLang采用了 prefix tree 的数据结构来进行Cache的数据存储，基础特点：

另外，Scheduling策略上，以匹配的prefix-length作为最高优先级。

VLLM结合PA和Prefix Cache，通过 hash(prefix tokens + block tokens) 来定义对应的KV Block，通过hash-mapping的方式来实现prefix-cache和physical block的mapping：

Statistics

multi-call 下的性能收益：

This post is licensed under CC BY 4.0 by the author.