The MTP sweet spot moves as context fills: full-context benchmarks on Strix Halo
This article investigates how the optimal MTP (Multi-Token Prediction) draft depth shifts as context length increases on Strix Halo hardware. The author benchmarks decode performance at various context depths, finding that MTP acceptance rate remains constant but verification cost grows with context, causing the optimal depth to decrease. It also compares ROCm and Vulkan backends and highlights the significance of prefill time in long-context workloads.
A few weeks ago kmarble published one of the cleaner full-context benchmarks I've seen on this hardware: three Qwen3 models, two backends, empty context versus 76k tokens, on a Minisforum MS-S1 Max — the same Ryzen AI MAX+ 395 / gfx1151 silicon I run. The headline was sharp. At empty context, ROCm decodes the 35B MoE at 46 tok/s, more than 2× faster than Vulkan. Fill the KV cache to 76k tokens and ROCm collapses to 16.6 tok/s — a 64% drop. Vulkan, on the same machine, barely flinches: 32.7 → 28.9, down just 12%. And the fix kmarble landed on: turn on MTP. ROCm+MTP claws most of the loss back to 37.5 tok/s, and that became his production recommendation.
It's a good result. I can't use it.
My box — a Bosgame M5, same chip, different board — cannot run ROCm at all. That was the subject of an earlier post: Issue #6182 reproduces in every ROCm configuration on this board, pre-model-load, fourteen out of fourteen. Same silicon as kmarble's, opposite ROCm outcome, because that failure is board-specific, not chip-specific. (As of this writing, no fix has shipped — current ROCm is 7.2.4.) So kmarble's cure — ROCm+MTP — is off the table for me and for anyone on a #6182 board.
kmarble's own explanation for why ROCm collapses at depth — which I can't test without ROCm — was that the HIP backend pays a higher per-KV-access overhead, so once the cache is full it can't keep up with the bandwidth demand, while Vulkan's simpler shader path can. Whether that's the real mechanism or a gfx1151-specific ROCm regression is unsettled; either way it's his to explain, not mine to reproduce here.
The Vulkan path is a different matter. Mesa's RADV driver is OS-side and board-agnostic, so I expected my Vulkan numbers to track kmarble's closely. They mostly do — with one divergence I can't fully explain, which I'll get to. The more interesting result is something kmarble's setup couldn't show, because he ran a single fixed draft depth (n=2): the MTP sweet spot I found last week doesn't survive full context. It moves.
I ran the Vulkan side across four context depths. Here's what reproduced, what didn't, and why your draft depth needs to change as your context grows.
Setup
Same two models as the MTP post: Qwen3.6-27B (dense) and Qwen3.6-35B-A3B (MoE), both UD-Q5_K_XL, on llama.cpp b9295, Vulkan/RADV, the Bosgame M5 (gfx1151, 96 GB unified VRAM). Four context depths — empty, 32k, 64k, 76k — at non-MTP and at the MTP sweet spot from last week (n=3 dense, n=2 MoE), plus a draft-depth sweep at the deep contexts. Decode measured over 600 output tokens, median of runs 2-N with run 1 dropped as warmup.
Three honest deltas from kmarble up front, because they mean you should read the shape of my curves against his, not the absolute numbers:
Build: kmarble ran b9188 (the week MTP merged). I'm on b9295, which folds in the MTP optimizations I tracked across the last post.
Quant: kmarble used Q8. I kept UD-Q5_K_XL for continuity with my series — lighter, faster, possibly different acceptance.
Output length: I measured 600 tokens for steady-state decode; kmarble ran 5000, closer to a realistic agentic output window. Decode t/s is comparable; wall-times are not.
Same chip, different board, newer build, lighter quant. Where the pattern reproduces through all that, it's robust. Where it diverges, the divergence is the finding.
One practical note before the data: full-context benching is dominated by prefill, and naively it doesn't fit a day. The thing that made this tractable is llama.cpp's prompt-cache reuse (--cache-reuse 256 --slot-prompt-similarity 0.55). With it, the first run at each depth pays the full prefill once; subsequent identical-prefix runs re-prefill only the unique instruction tail — a 99.9% reduction. Per-request n_max override, for the record, does not work in b9295: draft depth is a server-start flag, so each n_max change required its own server restart.
The decode-drop curve
Decode throughput vs context depth, Vulkan, both models, MTP and non-MTP. Bosgame M5, gfx1151, b9295.
The core reproduction first: does Vulkan's context-stability hold on a different board?
Model empty 32k 64k 76k drop (empty→76k)
35B-A3B MoE non-MTP 54.22 46.07 40.47 38.59 −29%
35B-A3B MoE MTP n=2 59.75 49.18 44.18 42.73 −28%
27B Dense non-MTP 10.33 9.39 8.62 8.36 −19%
27B Dense MTP n=3 16.52 14.51 13.73 12.69 −23%
Yes and no. The drop reproduces — decode does degrade with context, and MTP stays ahead of non-MTP at every depth on both models. But the magnitudes split:
MTP drop matches kmarble almost exactly. My 35B MTP falls 28% (his −27%); my Dense MTP falls 23% (his −20%). Within a point or two.
Non-MTP drops about twice as steeply on my board. My 35B non-MTP falls 29%, where kmarble's Vulkan non-MTP fell only 12%. My Dense non-MTP falls 19% to his 9%.
I can't isolate why. Three things differ at once — board (Bosgame M5 vs MS-S1 Max), build (b9295 vs b9188), quant (Q5 vs Q8) — and I don't have the cross-product to attribute the steeper non-MTP slope to any one of them. My honest guess is the lighter quant: Q5's smaller weights make decode more memory-bandwidth-limited than Q8 (it starts at 54 t/s versus his 33), so as the KV cache grows, cache traffic competes more directly with weight-loading for the same bandwidth — and bites harder. That's a hypothesis, not a measurement. If you're on a different board and see a different slope, that's a data point worth comparing. "Vulkan holds at full context" is true directionally on my board; "Vulkan barely moves" — kmarble's stronger claim — does not reproduce here for non-MTP.
Acceptance is not the failure mode
The obvious explanation for a decode-drop at depth would be that the MTP draft head gets worse at predicting tokens as the context fills, so fewer drafts are accepted and speculation buys less. That's the intuitive story. It's wrong here.
MTP acceptance rate vs context depth.
Model empty 32k 64k 76k Δ
MoE n=2 52.0% 49.9% 50.2% 50.7% flat
Dense n=3 55.7% 55.6% 57.7% 56.1% flat
Acceptance is flat across all four depths. The draft head accepts the same fraction of its proposals at 76k as it does at empty context. Whatever is dragging decode down, it is not the draft head losing its grip on the sequence.
That points the finger at the verification side. Each decode step — speculative or not — has to attend over the entire KV cache, and at 76k that's a lot of memory traffic per token. The cost scales with depth regardless of whether you're decoding one token at a time or verifying a speculative batch. This matters for what comes next, because it means the benefit of MTP (acceptance) stays constant with context while the cost of MTP (verifying a batch against an ever-larger cache) grows. When a constant benefit meets a rising cost, the optimum moves.
(One side note: my acceptance rates — ~50% on the MoE, ~56% on the dense — sit well below kmarble's reported figures (88-100% on short prompts, 76-78% at full context). That's likely the Q5 quant producing weaker drafts than his Q8, or different prompt content. It doesn't change the depth story — flat is flat — but it's the reason over-drafting bites so hard below, and worth flagging.)
The sweet spot walks with depth
This is the part kmarble's benchmark couldn't surface, because he ran a single fixed n=2 everywhere. Last week I found the sweet spots were n=2 for the MoE and n=3 for the dense model, measured at modest context. They don't hold at 76k.
Draft-depth sweep at empty vs 76k. Does the peak move?
35B-A3B MoE — the peak shifts down:
ctx n=0 n=1 n=2 n=4 winner
empty 54.22 — 59.75 — n=2
32k 46.07 — 49.18 — n=2
64k 40.47 46.72 44.18 35.24 n=1
76k 38.59 44.94 42.73 33.21 n=1
At empty context, n=2 wins, exactly as last week. By 64k, n=1 has overtaken it, and the gap widens at 76k. Look at the n=4 column: at 64k and 76k, drafting four tokens deep produces 35.2 and 33.2 t/s — below the 40.5 and 38.6 you get with no speculation at all. When acceptance is ~50% and every verify step is expensive because the cache is huge, proposing four tokens to get two accepted is a net loss. You paid to verify a five-token batch and threw half of it away, at full-context prices. n=1 wins at depth because it gambles the least: propose one token, and you're rarely far wrong.
27B Dense — the peak flattens:
ctx n=0 n=1 n=3 n=4 winner
empty 10.33 — 16.52 — n=3
32k 9.39 — 14.51 — n=3
64k 8.62 12.59 13.73 12.35 n=3
76k 8.36 12.25 12.69 12.31 plateau
The dense model is gentler about it. n=3 stays nominally best, but by 76k the curve has flattened so much that n=1, n=3, and n=4 all land within a few percent of each other (a 3.6% spread top to bottom). The sharp empty-context peak at n=3 has melted into a plateau — the depth you pick barely matters anymore, as long as you're not at n=0.
Put together: the trap from last week was over-deep drafts at any context. The finding this week is that the trap gets worse with context, and the safe depth moves down as you fill the cache. For long-context workloads on this hardware, the right move is to back off draft depth — n=1 on the MoE, anything in the n=1-to-n=3 plateau on the dense model. Tuning MTP once and assuming it travels across context lengths is exactly the assumption that breaks here.
Prefill is the real cost of long context
Decode gets the headline, but it's not where the wall-clock goes. As context grows, prefill throughput falls and the absolute prefill time balloons:
Prefill throughput vs context depth.
Depth MoE PP t/s MoE prefill time Dense PP t/s Dense prefill time
empty 115.4 — 30.5 —
32k 90.2 ~6 min 29.3 ~18 min
64k 76.1 ~14 min 25.8 ~41 min
76k 73.1 ~17 min 24.7 ~51 min
At 76k, the first message of a session costs roughly 17 minutes of prefill on the MoE and a brutal 51 minutes on the dense model before the first output token appears. That sounds disqualifying, and for one kind of workload it is. But it depends entirely on how you use the box:
Interactive long-context chat — paste a big document, then converse: you pay prefill once per session. With prompt-cache reuse on, every follow-up turn re-prefills only the new tokens — about 60 milliseconds in my tests. The 17-minute ingest is a one-time tax, then it's fast. Usable.
Agentic batch work — many independent requests, each with large fresh context: you pay full prefill on every call. Here decode t/s is almost irrelevant; prefill throughput is the number that decides whether the workload is viable at all. At ~25 t/s prefill, the dense model is simply not a long-context batch engine, and the MoE is marginal.
So the decode-drop story — kmarble's title, and the first half of this post — is real but secondary for a lot of real usage. If you're running long-context agents, the question isn't "does decode hold at 76k." It's "can I afford to prefill 76k as often as my workload demands." On this hardware the answer is "yes for interactive, no for high-throughput batch."
What the wrong board actually costs
Here's the comparison kmarble's data and mine make possible — and it didn't come out the way I expected. I went in assuming Vulkan-only was a handicap I'd have to quantify. Best path against best path at 76k on the 35B (his rows: Q8, b9188 — mine: Q5, b9295):
Path 35B @ 76k Available on my board?
kmarble ROCm+MTP 37.5 No — #6182
kmarble Vulkan+MTP 34.3 — (his hardware)
My Vulkan + MTP n=1 — best path 44.9 Yes
My Vulkan + MTP n=2 (last week's sweet spot) 42.7 Yes
My Vulkan non-MTP 38.6 Yes
On paper my best available path at 76k (Vulkan, MTP backed off to n=1, 44.9 t/s) lands above kmarble's unreachable ROCm+MTP number (37.5). I'm not going to wave that around as "Vulkan beats ROCm," because it isn't — the comparison is contaminated three ways, all in my favor. The biggest one is quant: Q5 versus Q8 alone is worth a double-digit throughput margin on bandwidth-bound decode, plausibly enough to account for most or all of the gap on its own. Add a newer bui
[truncated for AI cost control]