This paper is included in the Proceedings of the 23rd USENIX Conference on File and Storage Technologies.
This paper is included in the Proceedings of the 23rd USENIX Conference on File and Storage Technologies.
Weijian Chen,Shuibing He,6 저자,Gang Chen
0회 인용
TLDR
IMPRESS is proposed, an importance-informed multi-tier prefix KV storage system to reduce I/O delay for LLM inference by only loading important prefix KVs, and introduces an I/O-efficient important KV identification algorithm to reduce TTFT during model inference.
