2026-06-09站内改写6 min readUpdated: 2026-06-09

Saving Money on Inference

This article explores how prompt caching reduces inference costs in AI conversations, especially for agentic applications like reading assistants. By caching the KV values of repeated context, providers can avoid recomputing the same tokens, leading to 2.8x-3.3x cost savings in multi-turn dialogues. The article explains the technical details of KV caching and presents real pricing examples.

SourceHacker News AIAuthor: stonecharioteer

Table of Contents

The repeated-prefix problem

A normal chat asks one question and gets one answer. A reading assistant does something more expensive: it keeps dragging the same context forward. Every follow-up needs the system prompt, book metadata, retrieved passages, reader state, previous answers, and whatever the user is referring to with words like “that” or “they”.

That means the expensive part of an agentic conversation is often not the new question. It is paying for the model to reread the same prefix over and over.

We care about this because Merrilin is still a passion project we are funding ourselves. That does not mean we want to make the product timid or ration the parts that make it useful. It means the opposite: if we can stop wasting money on repeated context, we can afford more reading sessions, more experiments, and better models where they actually matter.

The problem is the prefix, not the follow-up

Each row lines up the chat turn with what the provider has to process. Tap or hover a row to see why a tiny follow-up can still carry a large hidden cost.

fresh input without cache cached-read prefix new tail

Prompt caching is the provider-level version of not recomputing that same prefix. The model still sees the full conversation, but the billing and compute path changes: repeated context becomes a cached read, and only the new tail of the prompt is treated as expensive fresh input.

What the model is caching

To see why that is possible, we need to look at what the model is actually caching.

Modern transformer architectures use what is known as attention to predict the next token based on a given input. Each layer of the transformer needs to compute three values, which are then used by the attention block.

For an input sequence packed into a matrix $X \in \mathbb{R}^{n \times d}$ (one row per token, each row a $d$-dimensional embedding), the layer learns three weight matrices $W_Q, W_K, W_V \in \mathbb{R}^{d \times d_k}$ and projects the input through each of them to produce $Q$, $K$, and $V$:

Tap or hover a token to highlight its row:

Q ∈ ℝ6×6

$$Q = X \, W_Q$$

K ∈ ℝ6×6

$$K = X \, W_K$$

V ∈ ℝ6×6

$$V = X \, W_V$$

Each row of Q, K, and V corresponds to one input token — the same row index across all three matrices represents the same token.

$Q$ (queries) is what the current token is “looking for”, $K$ (keys) is what each token “advertises” about itself, and $V$ (values) is the actual content that gets mixed into the output. Attention then combines them:

$$\text{Attention}(Q, K, V) = \text{softmax}\!\left(\frac{Q K^\top}{\sqrt{d_k}}\right) V$$

That is the calculation for just one layer of a transformer, and LLMs have multiple layers. The output of one layer becomes the input $X$ of the next, and each layer has to compute its own Q/K/V matrices.

Tap or hover any layer to focus on it.

input → → → data flows through all 32 layers → → → output

A 32-layer transformer (only 3 layers shown — there are 29 more between L2 and L32). Each layer has its own attention block with its own K and V; the FFN sits between layers. Only K and V are cached. Cache cost grows as 2 × L × n × dk; for a model like DeepSeek V4 Pro (61 layers) at long context, this is the dominant memory bottleneck of inference.

While we can’t save the Q matrices from each layer, we can save the K/V cache and only compute K/V for the new tokens in the prompt.

Cached prefix length:

6 cached + 2 new

K (this layer)

V (this layer)

from cache (no compute) computed now (WK·x, WV·x)

Drag the slider to change how many tokens are already cached. Cached rows skip both matmuls (WK·x and WV·x) — and that saving applies per layer. For a 32-layer model with a 4k-token cached prefix, that's 256,000 matmuls skipped per generation step.

This is all great but you might be wondering, why cache these values? Isn’t it just better to compute them?

Well, no. Compute is expensive these days, especially when you need to do expensive $O(n^2)$ matrix multiplications. For example, recomputing the KV cache for one layer in Llama 70B with, let’s say, 20,000 input tokens is a 20,000 × 8,192 matrix (the input X) multiplied by an 8,192 × 1,024 weight matrix (WK — Llama 70B uses Grouped Query Attention, which shrinks dkv from 8,192 down to 1,024). That single matmul is about 336 GFLOPs for K, another 336 for V — call it ~672 GFLOPs per layer.

Stack 80 layers and you’re at roughly 54 TFLOPs of compute just to populate the KV state for one forward pass. On an H100 at peak BF16 (~990 TFLOPs/s), that’s around 55 ms. Doable once. But imagine doing that for each iteration of your chat. Let’s say you’re adding about 500 tokens each turn; that means you’ve burned about ~27 seconds of H100 time on K/V projections alone, before any attention math, FFN, or sampling has even happened. With a cache, you only have to compute the K/V values for the 500 new tokens, reducing the FLOPs to roughly 1.3 TFLOPs total (80 layers × 2 matmuls × 500 tokens × ~16.8 MFLOPs each). That’s about 1.35 ms on an H100 — a ~20,000× reduction from the 27-second number above. And the cache survives to make the next turn cheap too.

These compute savings are also passed on to the consumer. Most LLM providers allow you to write to cache and reuse that cache for subsequent turns. This is great for agentic workflows because they often have to perform multiple turns to complete their task, and caching will help keep their costs low.

What this looks like across turns

One special case worth calling out: providers almost always cache the model’s system prompt automatically. Since the same system prompt is reused across thousands of sessions for any given application, the provider keeps it warm in cache by default — you effectively pay cache-read prices for it from the very first call, without having to opt in or annotate anything.

system prompt (always cached) cached from prior turns new (computed this turn)

Without caching

—

With caching

—

A 5-turn agentic conversation. The amber band at the start of every bar is the system prompt, which providers usually cache for free. The grey hatched portion is conversation history cached from prior turns. Only the solid-blue tail of each bar (the assistant's last response, a tool result, a new user message) needs K/V projection.

Back to Merrilin

The shape above is not hypothetical for Merrilin. The real problem starts at the bottom of the stack: a reader is confused about something specific, but answering that specific question requires a lot of repeated context.

The full chat transcripts were too redundant to embed again, so here are the two real sessions reduced to the part that matters for caching: the user turn, what Merrilin had to preserve, and how the billing shape changes as the prefix grows.

Real Merrilin turns, aligned with their hidden prefix

Tap or hover a row. The visible user message is small, but each row still depends on system instructions, book context, retrieved passages, and prior answers.

fresh input without cache cached-read prefix new tail

This is the billing pattern we care about. The prefix remains visible to the model, but it stops being treated like brand-new input on every turn. For a product like Merrilin, where follow-up questions are the core interaction, that distinction matters more than shaving a few words off a prompt.

What the bill looks like

Let’s price the 5-turn conversation above using Claude Opus 4.7 — $5 base input, $6.25 / MTok for 5-minute cache writes (1.25× base), $0.50 / MTok for cache reads (0.1× base) (Anthropic API pricing, May 2026).

The important thing to notice is that every turn re-reads the entire cached prefix — that’s how the model “sees” the conversation history each time. Only the new content at the tail of each turn is a cache write. And without caching, every one of those turns would be billed at the full input rate. Per-turn comparison:

TurnTokens (reads + writes)Without cachingWith caching

11,000 (800 + 200)$0.00500$0.00165

21,800 (1,000 + 800)$0.00900$0.00550

32,200 (1,800 + 400)$0.01100$0.00340

42,450 (2,200 + 250)$0.01225$0.00266

53,150 (2,450 + 700)$0.01575$0.00560

Total10,600 (8,250 + 2,350)$0.05300$0.01881

(Without caching: tokens × $5 / MTok. With caching: reads × $0.50 + writes × $6.25, divided by 1M.)

So with caching this conversation costs $0.01881 vs. $0.05300 without — a ~2.82× reduction, saving about $0.0342 each time. Across a million agentic conversations of this size, that’s roughly $34,188 saved purely on input billing. Notice that the savings get bigger as the conversation grows — turn 5 alone went from $0.01575 to $0.00560 (a 2.8× cut on the most expensive turn), and a 10-turn conversation would lop off even more.

The same workload on OpenAI GPT-5.5 — $5 / MTok input, $0.50 / MTok cached input (OpenAI API pricing, May 2026) — works out as:

Without caching: 10,600 × $5 / MTok = $0.0530

With caching: 2,350 × $5 / MTok + 8,250 × $0.50 / MTok = $0.01588

That’s a ~3.3× reduction, slightly bigger than Opus because OpenAI doesn’t charge a cache-write premium — first-time tokens just bill at the standard input rate. Anthropic charges 1.25× for the cache write itself, recovering that on subsequent reads. Both providers price cached reads at ~10% of base input, so the headline ratio lands in the same neighborhood regardless of who you bill against.

Where this leaves us

Prompt caching is not something I would reach for on every request. If the prompt is short, or if every turn is mostly new information, there is not much to save.

For us, the motivation is practical. Merrilin has to feel generous to use, and generosity is hard if every follow-up pays full price for context the model already saw. We would rather spend our budget on better retrieval, better answers, and more room for people to ask messy follow-up questions than on recomputing the same prefix.

It starts paying off when the product has memory. Merrilin is full of that kind of request: the same instructions, the same book, the same reader state, the same retrieved passages, and then one small follow-up at the end. Coding agents have a similar shape with repo instructions and tool output. Research workflows have it with source material and notes.

The trick is mostly discipline. Keep the stable parts stable. Do not reshuffle long instructions or retrieved context for no reason. Put the changing user message at the end. If the provider can recognize the prefix, you stop paying full price for the same tokens every turn.

That is all prompt caching is doing here. It does not make the model smarter. It just stops charging us as if the model has never seen the conversation before.