2026-06-03 17:41 UTCIn-site rewrite5 min readUpdated: 2026-06-30 13:03 UTC

Lean Inference: Lean Manufacturing Principles Applied to AI

This article applies lean manufacturing principles to AI inference, identifying seven wastes in LLM inference and proposing core principles like just-in-time context, standardized work, takt time, and prompt caching. A repo analysis agent case study shows a 13x cost reduction and 3.3x latency improvement.

SourceHacker News AIAuthor: robmay

Rob May

Jun 03, 2026

Here’s a production scenario that should feel familiar: your agent hits a simple routing decision—does this user query need a database lookup or a calculator?—and it fires off a GPT-4o call with a 12,000-token context window stuffed with documentation it will never read, waits 4 seconds for a response, gets back malformed JSON, retries twice, and burns $0.40 to answer a question that a regex could have handled.

Multiply that across 10,000 daily requests. Congratulations—you’ve built an inference money pit.

The AI engineering community collectively discovered that “just throw it at a frontier model” works great in demos and collapses in production. Agents enter retry death spirals. Context windows bloat with irrelevant RAG results. Sequential LLM calls stack latency until users abandon the workflow. The tools are extraordinarily powerful, and we are using them with the efficiency of a factory floor that nobody has ever walked with a stopwatch.

Lean Manufacturing fixed this problem for physical production 40 years ago. It’s time to apply the same discipline to inference.

Lean Inference Workflows are the systematic application of Lean/TPS (Toyota Production System) principles to the design of LLM-powered agent architectures. Not as metaphor—as engineering discipline.

The 7 Wastes of LLM Inference

Taiichi Ohno’s muda framework identified seven categories of waste in manufacturing. Each maps cleanly onto the failure modes we build into agents every day.

Overproduction — The Frontier Model Default

The most expensive waste is calling a 70B+ frontier model for tasks that don’t need it. Routing a support ticket to the right queue? That’s an 8B classification task. Extracting structured fields from a form submission? That’s a fine-tuned 3B model with a JSON schema. Summarizing a 500-word support thread? You don’t need GPT-4o.

The cost asymmetry is staggering. claude-sonnet runs ~3x the cost of haiku per token. GPT-4o runs ~10x the cost of GPT-4o-mini. When you reflexively reach for the frontier model on every step of a 15-step agent loop, you’re not just overspending—you’re adding latency at every node. If your task is a common one, you can even move to SLMs which are faster and two orders of magnitude cheaper.

Treat your agent’s model selection the same way a traffic engineer treats routing decisions—based on payload size, complexity score, and confidence threshold, not habit.

Inventory — RAG Bloat

Your vector database returns the top-20 chunks, and you shove all 20 into the context window “just in case.” That’s inventory waste: stockpiling inputs you probably won’t use, forcing the model to process them, inflating your input token count, and degrading retrieval precision in the process. More context isn’t better—it’s a longer assembly line with more defect opportunities.

Controlled inventory means retrieving fewer, better chunks via re-ranking (a cross-encoder pass over your top-k candidates), then truncating aggressively before injection.

Waiting — Sequential Blocking

Tool calls that could run in parallel are running in series. You need to fetch a user’s account history, check their subscription tier, and retrieve their recent support tickets. Instead of three parallel async calls, you have three sequential blocking calls: 300ms + 280ms + 310ms = 890ms of pure waiting.

async/await + parallel execution is the asyncio.gather or Promise.all call you should have made. In a multi-step agent DAG, every synchronous bottleneck is a latency tax.

Defects — Malformed Outputs and Retry Loops

An agent asks for a JSON tool call. The model returns Markdown-wrapped JSON with an extra trailing comma. Your parser throws. The orchestrator retries. The model hallucinates a different schema on the retry. You’re now three LLM calls deep on a task that should have been one.

Defects in inference are uniquely expensive because retries aren’t cheap reruns—they’re full-price LLM calls on an already-failed path. Structured outputs (OpenAI’s response_format, Anthropic’s tool use schemas, the instructor library for Python) eliminate this entirely by constraining output at the token-probability level.

Over-Processing — Unnecessary Chain-of-Thought

CoT is a forcing function for reasoning. It is not a default that belongs in every prompt. A routing classifier does not need to explain its reasoning to itself before assigning a ticket category. A field extractor does not need tokens. Stripping CoT from non-reasoning tasks can cut your output token count by 40–60% on those steps—with zero quality loss.

Core Principles of Lean Inference

Just-In-Time Context: The Pull System

In Lean manufacturing, a pull system means downstream demand triggers upstream production—nothing gets built until it’s needed. JIT Context means your agent fetches context exactly when a step requires it, scoped precisely to what that step needs.

The anti-pattern is the “God Context”: a single massive system prompt that pre-loads everything the agent might need across all possible execution paths. You pay the full token tax on every call, even when 80% of that context is never accessed.

The Lean pattern:

Semantic caching at the retrieval layer: if a semantically similar query was answered 30 seconds ago, return the cached embedding result, not a fresh DB round-trip.

Re-ranking before injection: Run a cross-encoder (a fast, cheap model like ms-marco-MiniLM-L-6-v2) over your retrieved chunks before injecting them into the LLM context. Top-3 precision beats top-20 recall for most tasks.

Step-scoped context: Each node in your agent DAG gets only the context its specific tool call requires. The summarization node doesn’t need the tool definitions. The routing node doesn’t need the document corpus.

Standardized Work: Deterministic Guardrails

Lean’s standardized work principle says that defined, repeatable processes reduce variation and defects. In agent architecture, this translates to: make your LLM do as little undirected reasoning as possible.

Logic that can be encoded deterministically should be. Your state machine transitions, routing rules, retry budgets, and tool call sequencing should live in code—not in a prompt asking the model to figure it out.

Tools like LangGraph let you encode agent control flow as an explicit graph: nodes are LLM calls or tool invocations, edges are conditional transitions, and the state machine is a first-class object you can inspect, test, and version-control. This is categorically different from a single ReAct loop where the model decides everything.

Structured outputs (via instructor, OpenAI’s strict: true JSON mode, or Anthropic’s tool schemas) are the manufacturing equivalent of a jig: they physically constrain the output to the valid shape, making defects structurally impossible rather than probabilistically unlikely.

Takt Time: The Latency Budget

Takt time in manufacturing is the maximum allowable time per unit to meet customer demand. In agent design, every workflow should have an explicit latency budget per step and per full execution path.

Define your takt time first. If your end-to-end SLA is 2 seconds and you have 6 agent steps, your average per-step budget is ~333ms. That budget forces architectural decisions:

Can this step use a smaller model to hit the latency target?

Should this step be parallelized?

Does this step even need an LLM, or is a heuristic or cached result sufficient?

DAG decomposition is your primary tool here. A complex task that looks like a single LLM call is often a DAG of 4–6 smaller model calls that can execute in parallel, each with faster TTFT on a smaller model, combining to lower overall latency than the single big call.

Prompt Caching as Kanban

Anthropic’s prompt caching and OpenAI’s equivalent cache system are Kanban cards for inference: reusable, pre-positioned work items that don’t need to be re-manufactured from scratch.

Your system prompt, tool definitions, and static knowledge base content are the same across thousands of requests. Cache them. On Anthropic’s API, a cache hit on a 10,000-token system prompt costs 10% of the base input token price. Over millions of calls, this is not a micro-optimization—it’s a cost structure change.

Design your prompts with cache-friendly prefix ordering: static system prompt first, static tool definitions second, dynamic context last. Anything that changes per-request must come after anything that doesn’t.

Before and After: Repo Analysis Agent

Before (Naive Architecture)

A single ReAct loop. One GPT-4o call per step. Full repository context dumped into the window on every iteration. Tool definitions re-sent each time. Sequential file reads. No output validation. Average: 14 seconds, ~85,000 tokens, ~$1.20 per run.

After (Lean Architecture)

A small router model (8B, fine-tuned) classifies the task type and selects the appropriate specialist pipeline — adds 80ms, saves 60% of downstream model costs

Prompt caching on tool definitions and system context — 90% cache hit rate after warmup

Parallel tool execution for file reads — 4 simultaneous reads instead of sequential

Structured output enforcement via instructor — zero retry loops in 500-run benchmark

Strict step budget: 6 steps max, with a fallback to human handoff at budget exhaustion

Result: 4.2 seconds average, ~18,000 tokens, ~$0.09 per run. Same output quality score on eval suite. 13x cost reduction. 3.3x latency improvement.

Conclusion

The next frontier of AI engineering isn’t a bigger context window or a more capable base model. It’s the discipline to use what we already have without waste.

Every unnecessary frontier model call, every bloated RAG context, every sequential blocking operation, every retry loop from malformed output—these are engineering failures, not model failures. We built them. We can fix them.

Lean Inference isn’t a philosophy—it’s a set of concrete architectural decisions you can make this sprint. Audit your agent’s token burn by step. Map your sequential calls. Add structured outputs. Right-size your models. Cache your static prompts.

Build leaner. Run faster. Spend less. Ship better agents.