AI News HubLIVE
站内改写6 min read

Session-Aware Agentic Routing: Continuity-Aware Model Selection for Long-Horizon

SAAR (Session-Aware Agentic Routing) is a session-aware model selection policy in vLLM Semantic Router. It reduces model switches by 79.29%, eliminates 3,836 unsafe switches, and cuts estimated physical-model cost by 78.71% across 21,600 deterministic turns. In 2,896 live AMD ROCm requests, it preserves session continuity with 0 observed violations.

SourceHacker News AIAuthor: matt_d

Long-horizon LLM agents create a routing problem that single-turn prompt routers were not designed to solve. A router still needs to know which model is best for the current request, but it also needs to know when switching models would break the session.

This post introduces Session-Aware Agentic Routing (SAAR), a session-aware model selection policy in vLLM Semantic Router. SAAR keeps semantic routing, but adds router-owned session memory, hard locks around tool loops and non-portable provider state, safe reset boundaries, prefix-cache-aware switch pricing, and replayable traces.

Across 21,600 deterministic turns, SAAR cuts model switches by 79.29%, eliminates 3,836 unsafe switches, and reduces estimated physical-model cost by 78.71%. Across 2,896 live AMD ROCm requests, it preserves session continuity with 0 observed violations.

Figure 1: Long-horizon agents need routing decisions that understand the session trajectory, not only the latest prompt.

From Prompt Routing To Session Routing

vLLM Semantic Router started from a simple systems observation: not every request should take the same path through an inference stack. A short factual question, a security-sensitive prompt, a multimodal request, a hard reasoning task, and a domain-specific query may all deserve different treatment.

The first generation of that idea was prompt routing. The router extracted signals from the current request, matched a routing decision, and selected an appropriate path. Iris made those signals composable. Athena made the router more strategic by expanding model selection, memory, replay, long-context signals, multimodal primitives, and AMD ROCm deployment paths.

Agents change the unit of routing again.

A coding or research agent is not one prompt. It is a session. It plans, calls tools, receives tool outputs, edits files, runs tests, recovers from errors, pauses, resumes, and often sends very short follow-up messages such as "continue", "fix it", "run that again", or "use the previous result." Those turns are meaningful only because of the trajectory that came before them.

That is why this milestone matters for Semantic Router. The router is no longer answering only:

Which model should handle this request?

For agent traffic, the router also has to answer:

Is it safe to switch models inside this session right now?

That second question is what SAAR is designed to handle.

Why Single-Turn Routing Breaks Down For Agents

Single-turn routing can be locally correct and still be wrong for the session.

Consider a typical tool-using agent loop:

TurnWhat the client sendsWhat a prompt router seesWhat a session router must remember

1"Refactor this module and run the tests."A coding taskThe session has started on a physical model

2The model emits a tool callA model responseThe next tool result belongs to the same model

3The client sends the tool resultA terse observationThe model that asked for the tool should receive the result

4The user says "fix the failing case"A short follow-upThe instruction depends on prior code, test output, and routing state

5The session idles and resumes laterA new short messageThe router can reconsider whether the old model is still worth holding

The latest message alone does not contain enough information. A prompt router may decide that the tool result looks cheap and send it to a smaller model. It may see a generic "continue" and re-run the normal selector. It may miss that provider-managed continuation state belongs to one physical backend. It may discard a warm prefix cache for a frontier model because the current message is short.

Each of those mistakes has a different failure mode:

A tool result can go to a model that did not make the tool call.

A non-portable continuation id can be sent to the wrong physical backend.

A long, warm session can lose prefix locality and become unnecessarily expensive.

A logical model such as auto can become hard to debug because users no longer know which physical model actually served the turn.

The important point is not that agents should never switch models. They should. A good router should still move from a cheap model to a stronger model when the task becomes harder, and it should move back when the session reaches a safe boundary. The problem is that the router needs session context to know which moments are safe.

The SAAR Design

SAAR keeps the existing Semantic Router decision pipeline. Signals are still extracted from the request, decisions are still matched, and model-selection algorithms still rank candidate models inside a matched decision.

SAAR adds a session-control layer around that result.

Figure 2: SAAR combines router memory, hard locks, reset boundaries, switch economics, and replayable traces before selecting a physical model.

There are five pieces:

PieceWhat it stores or decidesWhy it matters

Router memoryLast physical model, matched decision, phase, switch count, idle time, cache evidence, and replay metadataGives the router session context without becoming application memory

Hard locksPrevent switching during active tool loops or non-portable provider-managed statePreserves correctness before optimizing cost or quality

Reset boundariesAllow reselection after idle timeout or decision driftPrevents session-aware routing from degrading into sticky sessions

Switch economicsPrices handoff cost, switch history, remaining-turn priors, and prefix-cache checkoutMakes switching asymmetric across model tiers and session lengths

Replay tracesRecords why the router stayed, switched, or refused to switchMakes a logical model such as auto inspectable

This is a model-selection policy, not an endpoint load balancer. Semantic Router can choose a model or cluster through the gateway contract. Endpoint membership, health checks, and load balancing inside a cluster remain infrastructure responsibilities.

The Most Important Rule: Sometimes The Router Must Not Switch

The safest model switch is not always the one with the best score on the latest prompt. For agent traffic, some turns are continuity-constrained.

Figure 3: Tool loops and provider-managed continuation state are hard continuity constraints; idle and decision-drift boundaries permit safe reselection.

SAAR treats two cases as hard locks:

Tool-loop continuity. If a physical model asked for a tool call, the tool result should return to that same physical model. The follow-up observation is not a fresh prompt; it is part of a local execution loop.

Provider-managed state. If the request carries non-portable continuation state, such as a response identifier that belongs to one backend, SAAR holds the previous physical model instead of silently moving the state elsewhere.

These rules are intentionally stronger than cost rules. If a switch is unsafe, the router should not "buy" its way out with a cheaper model.

SAAR also defines the opposite boundary: when the router may switch again. Idle timeout and decision drift reopen the selection. If an agent pauses long enough, the value of continuity decays. If the matched decision changes because the user moved from code editing to synthesis or from retrieval to debugging, the old model choice should not stick forever.

This distinction is the heart of session-aware agentic routing:

SituationSAAR behaviorReason

Tool call is waiting for a tool resultHold the previous physical modelThe tool result belongs to that model's local reasoning loop

Request carries non-portable provider stateHold the previous physical modelThe state may not be valid on another backend

Session has idled past the configured boundaryAllow reselectionContinuity pressure has decayed

Matched routing decision changesAllow reselectionThe task shape changed

Session is long and warm on an expensive modelRaise the switch thresholdPrefix locality is valuable

Cheap short retry on a small modelLower the switch thresholdCheckout cost is small

Router Memory Is Not User Memory

The phrase "router memory" can be misleading, so the boundary is important.

SAAR memory is not conversation memory, retrieval memory, or user profile memory. It does not summarize the conversation and it does not try to remember facts for the model. Its job is narrower: keep enough routing state to make the next model-selection decision safe and explainable.

For each session, the router tracks facts such as:

the last physical model selected behind the logical model;

the last matched routing decision;

whether the session is in a normal, tool-loop, provider-state, idle-reset, or drift-reset phase;

how many recent switches happened;

the latest context length and cache evidence;

a replay id that links the response back to the router's decision trace.

That scope keeps the system operationally useful without turning the router into a second agent memory layer. Application memory should remain in the application. Retrieval memory should remain in the retrieval stack. SAAR memory exists only to make routing across turns coherent.

Prefix Cache Makes Model Switching Asymmetric

For long agent sessions, model switching is not just a quality decision. It is also an input-side systems decision.

Figure 4: The same switch has a different cost depending on model tier, session length, and physical prefix reuse.

A short retry on a cheap model and a 40-turn warm session on a frontier model should not be treated the same way. The latter has accumulated a valuable prefix. Switching away from it may require the next physical model to pay a much larger input cost even if the visible user message is short.

SAAR therefore prices a cached-input checkout delta: the gap between normal prompt input price and cached-input price for the physical model under consideration. The longer and more expensive the session, the stricter the policy becomes about discarding prefix locality.

This also clarifies cached-token accounting for a routed logical model. If the user calls auto, the router may map that logical name to different physical models over time. A cache hit reported by one backend is physical evidence for that backend. It is not automatically transferable to another backend. SAAR keeps backend-reported cached tokens separate from router-estimated reuse, and it does not rewrite upstream usage fields.

That separation is useful operationally. Operators can still inspect physical cache behavior while the router uses its own memory to decide whether switching is worth the checkout cost.

How A Request Moves Through SAAR

The serving path stays familiar. Clients send requests to the OpenAI-compatible gateway, usually with a logical model name such as auto. To enable session-aware routing, they also send a stable session identifier such as x-session-id.

SAAR then handles each turn in this order:

Read the current request, session id, tool-call context, provider-state markers, and candidate model set.

Run the normal Semantic Router signal and decision pipeline.

Produce a base model-selection result from the configured method, such as hybrid scoring.

Load the previous session routing state from router memory.

Apply hard locks for tool loops and provider-managed state.

Check idle timeout and decision drift boundaries.

Adjust switch scores using prefix-cache checkout cost and switch history.

Select the physical model and emit diagnostics.

Update router memory and write a replay trace.

The configuration lives inside a routing decision's model-selection algorithm:

routing: decisions:

  • name: agentic_routing

modelRefs:

  • model: qwen3-8b
  • model: qwen3-32b

algorithm: type: session_aware session_aware: base_method: hybrid idle_timeout_seconds: 300 tool_loop_hard_lock: true context_portability_hard_lock: true decision_drift_reset: true prefix_cache_weight: 0.20 switch_history_weight: 0.04

The values are intentionally policy knobs, not one-size-fits-all constants. A customer-service assistant with short sessions may use a more permissive idle

[truncated for AI cost control]