2026-05-31 13:20 UTCIn-site rewrite5 min readUpdated: 2026-06-30 13:03 UTC

The Self-Evolving Model Router

A composable six-tier dispatch architecture that dynamically selects language models through policy enforcement, prompt-aware retrieval, rule-based filtering, predictive re-ranking, contextual bandit learning, and challenger exploration, with graceful degradation and continuous online/offline learning.

SourceHacker News AIAuthor: suhaselcuk

WHITE PAPER v1.0 May 2026 VDF-WP-2026-002

The Self-Evolving Model Router.

A composable, six-tier dispatch architecture that turns model selection from a static configuration into a continuously-learning decision — combining policy enforcement, prompt-aware retrieval, rule-based filtering, predictive re-ranking, contextual bandits, and challenger exploration under a single, gracefully-degrading routing surface.

Read time 20 min

License CC BY 4.0

Read Online

ABSTRACT

Enterprise dispatch of large language models has historically been a configuration decision: operators bind a model to a workload and live with the choice. Real fleets, however, are non-stationary. Provider quotas oscillate, latency drifts on shared cloud endpoints, capabilities evolve as new model families arrive weekly, and the cost-quality-energy frontier shifts under the operator's feet[15][13]. A static binding is therefore a slowly-failing decision, and the problem is not solved by adding an A/B test on top of a static dispatcher — it is solved by treating routing itself as a non-stationary contextual decision.

This white paper documents how VDF AI Networks operationalises that view. Every request flows through a six-tier dispatcher: policy enforcement, prompt-aware retrieval shortlisting, rule-based filtering with a multi-objective scorer, predictive re-ranking on per-arm history, contextual-bandit selection under a disjoint-per-arm LinUCB learner[2], and challenger exploration that dual-routes a small fraction of traffic for live preference learning. Each tier is independently feature-gated and degrades to the next-simpler strategy when its signal is unavailable. The composition, not any single tier, is the contribution.

The router is self-evolving in three coupled senses. Online, every completed request becomes a reward observation that updates the chosen arm via a rank-one Sherman–Morrison update[10]; failures are folded back as a bounded penalty rather than dropped; and an offline trainer batches the run vault to re-derive priors that are atomically swapped into the live policy. We describe the design parameters, the graceful-degradation envelope, and the position of the work relative to the recent cost-quality routing literature. The paper is a design account and is deliberate about not over-claiming measured outcomes.

Keywords contextual bandits · LinUCB · model routing · disjoint per-arm learning · prompt-embedding retrieval · multi-objective scoring · online/offline learning duality · LLM serving · graceful degradation · policy-bound dispatch

AT A GLANCE

Six numbers that anchor the paper

Decision tiers

independently feature-gated layers in the dispatch stack

Context dim

sparse hashed features encoded per request

Exploration

α = 0.8

UCB confidence bonus on the contextual bandit

Window

~200 obs

per-model rolling latency and throughput window

Challenger

~2%

of traffic dual-routed for live preference learning

Failure reward

0.15

bounded penalty fed back to the bandit on timeout or error

FIGURE 1

The six-tier router — per-request lifecycle

Inputs arrive from the workflow specification on the left and exit as a routing decision and an ordered failover list on the right. Every tier is feature-gated; the dashed return loop depicts the online/offline learning duality that gives the router its name.

Fig. 1. Per-request routing lifecycle. Each tier is feature-gated and fails open to the next-simpler strategy when its signal is unavailable. The dashed return loop shows the online reward update and the offline retraining cycle that re-derives priors.

SECTION 1

Introduction & motivation

Three things change beneath an enterprise dispatcher in any given quarter. Provider quotas and rate limits drift, sometimes overnight; latency on shared cloud endpoints fluctuates with datacentre load and is correlated across tenants but invisible to any individual one; and the model catalog itself evolves — new families arrive, established ones deprecate, and the price- quality frontier moves[15]. None of these are visible to a dispatcher that selects models by static configuration.

A buyer accepting this state of affairs typically responds in one of three ways: pin the safest model and pay the premium, pin the cheapest model and absorb the variance, or layer an offline A/B test on top of a static dispatcher and update the configuration by hand. None of the three scales. The first wastes capacity; the second wastes outcomes; the third turns the dispatcher into a manual rebalancing job. What is needed is a routing layer that treats the choice of model as a non-stationary contextual decision — one that absorbs the drift instead of papering over it.

The Self-Evolving Model Router is the dispatch tier of VDF AI Networks. It is designed around the observation that every routing decision is a bandit problem with a context vector and a stream of delayed, partial rewards[2][5]: the dispatcher chooses an arm, the runtime returns an outcome (a quality score, a latency, an error), and the policy must update to make better choices the next time the same context recurs. The router solves the bandit problem with a per-arm linear UCB learner inside a broader, gracefully-degrading envelope of five sibling tiers, all of which can shape, accept, or override the bandit's recommendation depending on what the system knows about the request.

Scope and non-goals

This paper covers serving-time dispatch only. It does not propose a new bandit algorithm; the underlying linear UCB scheme is well-established[2][3]. The contribution is the composition — how policy, retrieval, multi-objective scoring, predictive re-ranking, online bandit learning, and challenger exploration are layered into one dispatcher with a clear graceful-degradation envelope and an online/offline learning duality. Where the design borrows from the literature we cite rather than re-derive. Empirical numbers beyond the design parameters are deliberately out of scope; the paper is a documented engineering pattern, not a benchmark.

SECTION 2

Background & related work

The theoretical backbone is the contextual multi-armed bandit. Auer[1] introduced the upper-confidence-bound family for the stochastic bandit; Li, Chu, Langford and Schapire[2] generalised it to the contextual case as LinUCB; Chu, Li, Reyzin and Schapire[3] gave the theoretical analysis of contextual bandits with linear payoffs. Agarwal et al.[4] established efficient algorithms for general contextual bandits, and the surveys of Lattimore and Szepesvári[5] and Slivkins[6] are the canonical references. The dispatcher's learning core is a faithful application of this line of work to a model-selection problem.

The exploration–exploitation literature offers two practical alternatives to UCB: Thompson sampling[8] and ε-greedy variants. Thompson sampling is attractive when posterior sampling is cheap and the reward distribution is well-modelled; UCB remains the more deterministic choice when telemetry is the primary debugging surface — every decision can be reproduced from the recorded arm statistics, which matters when an operator has to explain why one model was chosen over another. We chose UCB for the operational reproducibility, not for any sample-efficiency claim.

Within the LLM-routing literature, three lines of work are immediately relevant. FrugalGPT[13] frames routing as a cost-quality cascade; Hybrid LLM[14] formulates it as a query-router that switches between a strong and a weak model based on a difficulty estimator; RouteLLM[15] learns the router from preference data. All three are valuable, and all three concentrate the routing intelligence in a single learned function on a single objective axis. The contribution of the present paper is orthogonal: rather than propose another single-objective router, we describe a multi-objective, composable dispatcher in which the learned function is one tier among six.

A related but distinct body of work is the mixture-of-experts (MoE) literature. Shazeer et al.[11] and Fedus, Zoph and Shazeer[12] gate inside a model between expert sub-networks. Our dispatcher gates between independent models — different providers, different families, different deployment topologies. The two problems share a vocabulary (gating, routing, experts) but live at different levels of the stack.

The online-update mechanism — a rank-one Sherman–Morrison update to the per-arm regularised inverse Gram matrix — is the classical numerical-linear-algebra technique surveyed by Hager[10]. It permits exact incremental learning without recomputing matrix inverses, which is what makes the in-process online loop practical at serving rates.

What is consistently absent from the prior literature is a routing-layer account that ties policy, capability, cost, latency, energy, and continuous learning into one composable dispatch with a documented graceful-degradation envelope. That gap is what this paper documents.

SECTION 3

System architecture overview

The router is an in-process library rather than a standalone microservice. It is invoked once per node per request inside the orchestration engine, returns a routing decision with an ordered candidate list, and observes the reward asynchronously after the runtime completes the call. Two persistence surfaces support it: an in-memory rolling latency window of approximately two hundred observations per model — thread-safe, cleared on restart, used for live p50, p95, time-to-first-token, throughput, and timeout-rate statistics — and a vault-backed bandit state that stores, for each arm, the regularised inverse Gram matrix, the running reward vector, an observation count, and a cumulative reward sum.

Hot-reload is a first-class capability. The orchestration engine reloads the bandit state from the vault on a configurable cadence (default approximately thirty seconds), so a fresh offline retrain can land in production without restarting workers. The latency window remains process- local and is rebuilt naturally from live traffic.

Failover is enumerative, not re-routed. The router returns up to five ordered candidates per decision, and the engine walks the list until one succeeds. The ordering deliberately prefers provider-diverse alternates first — escaping correlated outages is the most expensive failure mode in production — and same-family fallbacks second, on the principle that a near-equivalent model in the same family incurs less context-switch cost than a complete provider change. The ordering is exposed in telemetry alongside the routing reason code.

The router carries no notion of session affinity, no warm-pool management, and no batching. It is a pure decision function with side effects only on its own bandit state. This is a deliberate architectural choice: every other concern (caching, batching, autoscaling) is owned by tiers above or below, and the dispatcher remains testable as a function of its inputs alone.

SECTION 4

Methodology — six tiers

Each subsection names one tier, describes the signal it consumes, and identifies how it degrades when that signal is unavailable. Graceful degradation is not a feature added late; it is the central design constraint that lets the dispatcher remain stable under simultaneously-failing dependencies.

Policy enforcement — the inviolable layer

Pinned models and regulated-domain allow-lists are evaluated before any scoring. A request that targets a regulated workload but cannot be served by an approved candidate halts with an explicit, machine-readable reason code; a soft mismatch (no candidate carries a requested capability) degrades with a logged relaxation rather than a silent failure. Policy is the only tier that can return an unrecoverable error.

Routing layer · Policy short-circuit