SpecInfer: Accelerating Generative Large Language Model Serving with Speculative Inference and Token Tree Verification

TLDR

This paper introduces SpecInfer, an LLM serving system that accelerates generative LLM inference with speculative inference and token tree verification, which significantly reduces the end-to-end latency and computational requirement for serving generative LLMs while provably preserving model quality.