UPDF AI

Speculative Sampling via Exponential Races

S. Kobus,Deniz Gündüz

2025 · DOI: 10.48550/arXiv.2504.15475
arXiv.org · 0회 인용

TLDR

A surprising connection is established between speculative decoding and channel simulation, which aims at simulating a noisy channel using as few bits using as few bits as possible, to provide an information-theoretic analysis of the speed up that can be achieved by speculative decoding.

초록

Speculative decoding accelerates large language model inference using a smaller draft model. In this paper, we establish a surprising connection between speculative decoding and channel simulation, which aims at simulating a noisy channel using as few bits as possible. This connection allows us to provide an information-theoretic analysis of the speed up that can be achieved by speculative decoding. Leveraging this link, we derive an explicit relation between generation speed-up and the number of tokens kk generated by the draft model for large kk, which serves as an upper bound for all kk. We also propose a novel speculative decoding method via exponential race ERSD that matches state-of-the-art performance.