2023 AcceleratingLlmInferencewithSta

Subject Headings: LLM Inference, Staged Speculative Decoding, Small-Batch LLM Inference

Notes

It introduces staged speculative decoding to accelerate LLM inference, especially in low-batch, on-device scenarios.
It builds upon speculative decoding by reorganizing batches into a tree structure and adding a second speculative stage.
It achieves a 3.16x reduction in single-batch decoding latency using a GPT-2-L model without compromising output quality.
It addresses the challenge of low arithmetic intensity in small-batch LLM inference, improving latency, personalization, and privacy.
It utilizes a tree-structured speculative batch and a two-stage decoding process for efficiency gains.
It evaluates performance using a GPT-2-L oracle model, a smaller GPT-2 draft model, and a Katz backoff trigram model.
It identifies future directions, including faster speculative sampling, running larger models on-device, and improving lower-level draft models.

Recent advances with large language models (LLM) illustrate their diverse capabilities. We propose a novel algorithm, staged speculative decoding, to accelerate LLM inference in small-batch, on-device scenarios. We address the low arithmetic intensity of small-batch inference by improving upon previous work in speculative decoding. First, we restructure the speculative batch as a tree, which reduces generation costs and increases the expected tokens per batch. Second, we add a second stage of speculative decoding. Taken together, we reduce single-batch decoding latency by 3.16x with a 762M parameter GPT-2-L model while perfectly preserving output quality.

;

	Author	volume	Date Value	title	type	journal	titleUrl	doi	note	year
2023 AcceleratingLlmInferencewithSta	Christopher Ré Benjamin Spector			Accelerating Llm Inference with Staged Speculative Decoding				10.48550/arXiv.2308.04623		2023