2026-06-06 15:38 UTCIn-site rewrite6 min readUpdated: 2026-06-30 13:03 UTC

Every AI Agent Feature Is a Cache Invalidation Surface

Yafei Lee, founder of OpenClacky, an open-source AI agent in Ruby, shares how building features like skills, memory, sub-agents, browser automation, dynamic model switching, and long-running sessions led to severe prompt caching issues. Over two years and three architecture generations (first two failed), they converged on seven engineering decisions that achieved 90%+ cache hit rates. The article details the failures of RAG and multi-agent orchestration, and the first three decisions: double cache markers, frozen system prompt, and single meta-tool.

SourceHacker News AIAuthor: gemHunter

I'm Yafei Lee, founder of OpenClacky, an open-source AI agent written in Ruby. We wanted an agent with skills, memory, sub-agents, browser automation, dynamic model switching, and long-running sessions. Each of those features made prompt caching worse in a different way.

That was the real architecture problem. Not how to call an LLM, not how to add another tool, not how to orchestrate more agents — how to keep the cache prefix stable while the product keeps changing.

Every agent feature is also a cache invalidation surface. Skills load new system context. Peer-agent workflows fork the prefix. Browser automation adds volatile tool output. Compression rewrites history. Model switching can fragment the cache namespace unless model-specific state stays out of the system prompt. If you're building a capable agent and your cache hit rate is much lower than expected, this is probably why.

Over two years and three architecture generations (the first two failed), we converged on seven engineering decisions that let us hit 90%+ cache rates across real tasks — while keeping all those features intact. What follows is the complete story: what broke, what we tried, and what actually worked.

Generation 1: RAG Everything (2024 – early 2025)

Our first agent was a textbook RAG system. We embedded the user's codebase, docs, and conversation history into a vector store. Every query went through hybrid retrieval, re-ranking, and query rewriting before the LLM saw anything.

It sounded right. It wasn't.

The costs never stopped climbing, and the data was always stale. Every codebase update required re-embedding. Real-time sync was unreliable, so the vector store lagged behind the actual code. We were paying more and more to search an index that was increasingly wrong.

And 90% recall is not good enough. One in ten retrievals returned the wrong context. For an agent that chains multiple steps, that error compounds fast. A wrong file in step 2 means a wrong edit in step 3 means a wasted retry in step 4. We estimated that 97% recall might be the bare minimum for an agent to be net-positive, and we were nowhere close. On top of that, the vector database was one more component that could crash, lag, or return garbage. Every extra piece between the user and the LLM is a place where latency hides and errors compound.

For coding agents working over local repos, we killed RAG entirely. No embeddings, no vector store, no retrieval pipeline. If the agent needs context, it reads files directly or searches with grep. If your documentation needs to be accessible to an agent, make it readable on a website. Don't shred it into embeddings.

Generation 2: Multi-Agent Orchestration (mid-2025)

The next idea was straight from the SWEBench leaderboard playbook: a Planner agent, a Coder agent, a Reviewer agent, and a Tester agent, all coordinated through a message bus with role-specific prompts.

We got decent SWEBench scores. The product was terrible.

Every agent handoff was a cache miss. Each sub-agent had its own system prompt and cache namespace. Passing context between agents meant serializing state into messages, and every handoff wiped the receiving agent's cache prefix. The problem was not just cache misses. Each handoff forced us to serialize rich state into a smaller message, and useful context was lost at the boundary.

A task that one agent could finish in 4 minutes took 14 minutes with four. The coordination overhead was real: agents waited for each other, re-read context the previous agent had already processed, and occasionally contradicted each other's decisions.

Cost was 6× higher. Four separate cache namespaces, four system prompts, constant serialization. The "divide work among specialists" intuition that works for human teams doesn't transfer to LLMs. A single frontier model is already a generalist. You're not dividing labor; you're multiplying overhead.

Debugging was a nightmare. When the final output was wrong, which agent caused it? The Planner gave ambiguous instructions? The Coder misinterpreted them? The Reviewer missed the bug? We spent more time tracing failures through the pipeline than we spent on the original task. At least with a single agent, when something goes wrong, you read one conversation and find the mistake.

SWEBench scores didn't predict user satisfaction. We could tune the multi-agent pipeline to pass specific benchmarks, but the modes of failure that annoyed real users (slow iteration, losing context across handoffs, inconsistent code style) weren't what benchmarks measured.

We killed role-based multi-agent orchestration. One main agent, one conversation, one cache namespace. Sub-agents survived only as isolated skill execution contexts, invoked through a single stable tool.

Two generations, same conclusion: the model is already smart enough. What it needs isn't more models, it's a better harness.

The Seven Decisions

Generation 3 started from a question: what if we optimized everything around a single agent's cache hit rate? Not as a cost hack, but as an architectural principle. High cache hits mean the model sees consistent context, responds faster, and costs less. Every decision below serves that goal.

(The code is open source. Links to the exact files implementing each decision are at the end of this post.)

Decision 1: History Growth Breaks Prefix Matching → Double Cache Markers

Prompt caching works by prefix matching. The LLM provider stores a hash of the message prefix; if your next request shares that prefix, you get the cached rate (depending on the provider, cached tokens are priced at a fraction of normal input tokens). The way you tell the provider where to cache is by placing cache_control markers on specific messages.

The naive approach is one marker on the last message. It breaks in three ways:

History grows monotonically. You mark message N. Next turn, message N+1 is appended. The content at the position of your old marker has changed, so it's a cache miss on the entire history.

Tool call retries. The model's last tool call errors out, or the user hits Ctrl-C. The "last message" gets discarded, and your marker vanishes with it.

Mid-session model switches. The user switches from Sonnet to Opus. You want to share as much prefix as possible across models. Any unnecessary marker movement becomes a cache miss event.

We hit problem (1) first. The fix progression is visible in our git log:

8ff66cc fix: cache 6ea99fe fix: prompt cache e9a3602 feat: prompt cache works fine 7734c97 feat: try 2 point cache

The first three commits were incremental patches. The last one was the structural fix: two markers instead of one.

How double markers work

Every turn, we mark two consecutive messages, not one:

Turn N: [..., msg_A, msg_B(*), msg_C(*)] ↑ ↑ marker 1 marker 2

Turn N+1: [..., msg_A, msg_B(*), msg_C(*), msg_D(*)] ↑ ↑ ↑ (still there) (still there) new marker

On turn N+1, the provider tries to match the marker on msg_C and hits everything before it (system prompt + tools + full history minus the last message). We place a new marker on msg_D for the next turn.

This is a rolling double buffer: at any moment we hold two breakpoints — one being "read" (from the previous turn) and one being "written" (at the current tail). Next turn, the old "write" becomes the new "read," and we write a fresh one at the new tail. There's never a moment where both buffers are invalid simultaneously.

Why exactly 2, not 3 or 4

Each additional marker costs a cache write at write-tier pricing. The only failure boundary we need to cover is the "old tail / new tail" edge, and two markers is exactly the minimum for that. A third marker lands further back in the prefix, writing a segment that will never be read independently. 2 covers the boundary. 3 is redundant.

Surviving tool call retries

This is the second benefit, and the actual motivation behind commit 7734c97. When the model retries a tool call (error, Ctrl-C, broken stream), the last message gets discarded. With a single marker, that's an immediate cache miss. With double markers, the second-to-last marker usually survives, so single-step rollback still hits cache. Three markers would survive two-step rollbacks, but the cost doesn't justify the edge case.

Messages that must never be marked

Our marker selection logic has one hard rule: skip any message tagged system_injected: true. These are ephemeral messages (session context blocks, compression instructions) that won't exist in the same form next turn. A marker on them is a write that will never be read back. The selector walks backward from the tail, skips system_injected messages, and stops when it has two real conversation messages.

Decision 2: Dynamic Session State Breaks System Prompts → Frozen System Prompt

Engineering discipline: our agent's system prompt is built once at session start, then byte-frozen. Any requirement to put dynamic information in the system prompt gets redirected elsewhere.

This is the foundation of the entire cache strategy. If the system prompt changes, every subsequent cache entry is invalidated. There is no partial fix.

But at least four kinds of information naturally "want" to live in the system prompt:

Current date, working directory, OS — the model needs these for correct commands.

Current model ID — helpful for self-adaptive behavior.

Newly installed skills — the model needs to see skill names to invoke them.

Updated user preferences (USER.md / SOUL.md) — the agent's personality and user context.

All four can change mid-session. If any of them is in the system prompt, a single change invalidates everything.

The [session context] block

Instead of the system prompt, we inject this information as a regular user message in the conversation history:

[Session context: Today is 2026-05-13, Tuesday. Current model: claude-sonnet-4-6. OS: macOS. Working directory: /Users/.../project]

This message is tagged system_injected: true. It won't be selected by cache markers (Decision 1), won't count as a real user turn, and gets discarded during compression. Injection is date-gated: one per day, plus one on model switch. Most sessions see exactly one.

A bug that took a day to find

Our first implementation of inject_session_context was eager. It fired during agent construction, before the system prompt was built. This meant @history.empty? returned false, so run() skipped system prompt construction entirely. The agent sent its first request with a "today is Tuesday" message but no system prompt. Behavior was subtly broken for a day before we traced it.

The fix was one line: inject after the system prompt is built. The code comment that survived:

IMPORTANT: Skip injection when the system prompt hasn't been built yet.

Otherwise, appending a user message to an empty history makes

@history.empty? false, which causes run() to skip building the

system prompt entirely.

Assembly order matters more than content. You can spend weeks designing each piece of the prefix, but if the assembly sequence is wrong by one step, the entire cache strategy is void.

How skill discovery works without touching the system prompt

Skills are rendered into the system prompt at session start, then frozen. A skill installed mid-session won't appear until the next session. We accept this friction. Re-rendering the system prompt on every skill install would invalidate the cache for all users on all sessions on every turn. Skill installation is low-frequency; cache hits are per-turn. The tradeoff is clear.

That said, invoke_skill reads each SKILL.md at call time, not at session start. So if a user explicitly asks for a newly installed skill, the system can still find and execute it, though it won't auto-discover it from the skill listing.

Decision 3: Skills and Sub-Agents Bloat History → One Meta-Tool

invoke_skill is one of our 16 tools and does more work than any other. It provides

[truncated for AI cost control]