EntMTP: Accelerating LLM Inference with Entropy Guided Multi Token Prediction
EntMTP is a training-free scheduler that dynamically switches tree-based attention topologies based on local generation entropy, enabling deep speculation in low-entropy regions and conservative speculation in high-entropy regions. It maximizes throughput without sacrificing quality, achieving 1.15x speedup over Hydra and peak 1.36x over Medusa on various benchmarks.
[2606.27550] EntMTP: Accelerating LLM Inference with Entropy Guided Multi Token Prediction
[Submitted on 25 Jun 2026]
Title:EntMTP: Accelerating LLM Inference with Entropy Guided Multi Token Prediction
View a PDF of the paper titled EntMTP: Accelerating LLM Inference with Entropy Guided Multi Token Prediction, by Carrie Chen
View PDF HTML (experimental)
Abstract:Multi-token prediction has been shown to increase data density during training, improve downstream text-generation quality, and serves as the defacto approach for self-speculative decoding. Existing foundation and open source models that use MTP heads commit to a static tree-based attention topology throughout the entire generation sequence, meaning the speculation depth, and thus the compute required during verification, stays constant regardless of the context. This is fundamentally misaligned with the entropy patterns of natural language where low-entropy regions often support reliable multi-step drafting, while high-entropy regions require more conservative speculation. To address this, we propose Entropy-guided Multi-Token Prediction (EntMTP), a training-free scheduler that toggles between tree-based attention topologies from a set of task-specific pareto-optimal trees conditioned on a running estimate of local generation entropy. By matching speculation depth to context predictability, EntMTP maximizes expected accepted-token throughput across the full distribution of generated text without sacrificing generation quality. When evaluated across Humaneval, ShareGPT, GSM8k, and Litbench benchmarks, EntMTP consistently achieves a 1.15x speedup against Hydra and peak speedup of 1.36x against Medusa baselines respectively.
Comments: 7 pages, 5 figures
Subjects:
Computation and Language (cs.CL); Machine Learning (cs.LG)
Cite as: arXiv:2606.27550 [cs.CL]
(or arXiv:2606.27550v1 [cs.CL] for this version)
https://doi.org/10.48550/arXiv.2606.27550
arXiv-issued DOI via DataCite (pending registration)
Submission history
From: Carrie Chen [view email] [v1] Thu, 25 Jun 2026 20:54:27 UTC (1,173 KB)
Full-text links:
Access Paper:
View a PDF of the paper titled EntMTP: Accelerating LLM Inference with Entropy Guided Multi Token Prediction, by Carrie Chen
View PDF
HTML (experimental)
TeX Source
view license
Current browse context:
cs.CL
new | recent | 2026-06
Change to browse by:
cs cs.LG
References & Citations
NASA ADS
Google Scholar
Semantic Scholar
Loading...
Data provided by:
Bibliographic Tools
Bibliographic and Citation Tools
Bibliographic Explorer Toggle
Bibliographic Explorer (What is the Explorer?)
Connected Papers Toggle
Connected Papers (What is Connected Papers?)
Litmaps Toggle
Litmaps (What is Litmaps?)
scite.ai Toggle
scite Smart Citations (What are Smart Citations?)
Code, Data, Media
Code, Data and Media Associated with this Article
alphaXiv Toggle
alphaXiv (What is alphaXiv?)
Links to Code Toggle
CatalyzeX Code Finder for Papers (What is CatalyzeX?)
DagsHub Toggle
DagsHub (What is DagsHub?)
GotitPub Toggle
Gotit.pub (What is GotitPub?)
Huggingface Toggle
Hugging Face (What is Huggingface?)
ScienceCast Toggle
ScienceCast (What is ScienceCast?)
Demos
Demos
Replicate Toggle
Replicate (What is Replicate?)
Spaces Toggle
Hugging Face Spaces (What is Spaces?)
Spaces Toggle
TXYZ.AI (What is TXYZ.AI?)
Related Papers
Recommenders and Search Tools
Link to Influence Flower
Influence Flower (What are Influence Flowers?)
Core recommender toggle
CORE Recommender (What is CORE?)
Author
Venue
Institution
Topic
About arXivLabs
arXivLabs: experimental projects with community collaborators
arXivLabs is a framework that allows collaborators to develop and share new arXiv features directly on our website.
Both individuals and organizations that work with arXivLabs have embraced and accepted our values of openness, community, excellence, and user data privacy. arXiv is committed to these values and only works with partners that adhere to them.
Have an idea for a project that will add value for arXiv's community? Learn more about arXivLabs.
Which authors of this paper are endorsers? | Disable MathJax (What is MathJax?)