Model routing has changed shape in the last few months, but not as one clean wave. Research routers now consume internal activations, prefix-cache state, step-level confidence, and full trajectory state. Shipped systems mostly expose request-level routers, fallbacks, static role assignment, and serving schedulers. The useful question is therefore not whether “routing” won; it is which layer is routing, which objective it optimizes, and whether that evidence transfers to long-horizon agents.
This is a survey of that work, plus the production systems and vendor primitives that shipped alongside it. The aim is to make the literature legible without collapsing different claims into one frontier: model-selection routers are not cache schedulers, cascades are not pre-generation routers, and agent role dispatch is not the same as dynamic per-step model switching. At the end we collect a short set of candidate experiments — hypotheses worth trying, not commitments — drawn from the most credible ideas in the field.
Routers used to look at the query text and almost nothing else. That is no longer the strongest research signal for model selection. Recent work routes on internal model activations1, on the prefix-cache state of the serving stack2, and on confidence trajectories that emerge between reasoning steps3. NVIDIA's prefill-activations router, for instance, closes 45.58% of the gap between the strongest standalone model and an oracle while saving 74.31% over always-call-the-largest1. Query-only routing remains the cheap baseline — and kNN-over-past-performance is a surprisingly strong one — but it is no longer where the frontier evidence sits.
The unit of decision has shrunk in research. Where a 2025 router typically picked a model per query, recent papers pick per reasoning step or even per token. TRIM3 reports 5–6× cost-efficiency gains on MATH-500 and AIME by routing between the steps of a single chain-of-thought. ConfSpec4 stacks a step-level cascade on top of token-level speculative decoding for 2.24× end-to-end speedup with no quality drop. KAD36 goes a level further, framing per-token deferral as a 0–1 knapsack. Production has moved more slowly: most deployed routers still decide per request, per session, or per static agent role.
And the field is openly split on architecture. Centralized capability matrices — Topaz5, RouteProfile6, Dimension-Direct Routing14 — landed in the same few weeks as DiSRouter7, which dispenses with the central matrix and instead trains each model to assess its own competence. Both camps have credible papers, and neither has the resolution yet.
Vocabulary note. The word router covers anything that picks which model invocation handles a query — pre-generation routers, cascades, capability-matrix dispatchers, serving schedulers, fallback gateways, and agent-level role selectors. Authors disagree on the boundaries. This survey keeps the terms, but separates the evidence by layer whenever the distinction changes what can be concluded.
The rest of the post walks the literature stream by stream — signals, granularity, learned routers, cascades, capability matrices, self-routing, the cache layer, multi-objective work, agent dispatch, the vendor layer — and closes with experiments that force each claim onto a real workload, with terminal success, retry cost, cache behavior, and drift measured together.
Model routing is not a single research field. It is seven adjacent threads — learned routers, cascading inference, capability matrices, self-routing, cache-aware scheduling, agent-level dispatch, and vendor gateways — that have historically published in different venues with different vocabularies. The first job of this section is to name those threads and place the systems we cite into them.
The Trinity / Huawei Dynamic Routing Survey8 (2026-02-23, revised 2026-04-21) is the first cross-stream taxonomy to land in this window. It introduces a six-paradigm taxonomy and a three-dimension framework (when, what, and how to route) and is cited by most of the work that follows.
The threads also overlap by mechanism, not only by topic. RouteProfile6 and DiSRouter7 appear within a week of each other and argue against each other directly — one defending matrix profiles, the other defending self-confidence as the primary signal. Dynamo 1.02 and llm-d44 both consume vLLM's KVEvents API, which was a serving-systems detail eighteen months ago and is a routing signal now. TRIM3's step-level routing is the result "Is Escalation Worth It?"13 is responding to. The figure below places each system in its primary stream and draws an edge wherever two of them share a signal, a cache, or a benchmark.
Before naming the routers, we should name what they read. A router is a function from signal to model selection. Different signals carry different amounts of information and have different acquisition costs. The 2026-era routers organize themselves not by their architecture but by which signal they trust.
Seven signals matter. They form a rough hierarchy of information richness × acquisition cost:
Signals 3 through 7 are stronger than signals 1 and 2 in their respective regimes, but only after the objective is fixed. Cache state is a cost and latency signal, not a correctness signal. A confidence trajectory can be useful for escalation, but it is not the same as a calibrated probability of final task success. Query text alone, in May 2026, is a baseline rather than a competitive input for frontier model selection.
| Router layer | Objective | Primary signal | Failure mode |
|---|---|---|---|
| Query-level model router | quality per dollar | query text, embedding, capability vector | OOD brittleness; public benchmark mismatch |
| Agent-step router | terminal success per dollar | trajectory state, tool results, remaining budget | early cheap mistakes poison later steps |
| Cascade / verifier | avoid unnecessary strong calls | draft answer plus confidence or verifier score | generation tax; verifier false negatives |
| Cache-aware scheduler | TTFT, goodput, serving cost | prefix hash, KV location, queue depth | quality ignored unless combined with model routing |
| Gateway / fallback | availability, policy, compliance | provider health, region, tenant rules | hidden model changes alter quality and style |
| Self-router | decentralized competence selection | self-assessment, logprob, activations | miscalibration and confident hallucination |
The EU AI Act's general-purpose-model transparency obligations are in force, and US enterprise procurement explicitly asks for region-pinned audit logs. The Digital Omnibus of 2026-05-07 pushed the Annex III high-risk-system deadline out to 2026-12-02, but the GPAI obligations (Art. 53, 55) have been live since 2025-08-02 and are what most enterprise routing decisions trip on firstLA-21. Compliance needs a single audit trail. The gateway is the only component that sees every model invocation, every cache hit, every fallback hop — so it should own that log. Otherwise auditors are joining logs across three vendors and two clouds.
The Act is not just a logging requirement; specific articles directly constrain which model can run where, on whose data, with what disclosure. The mapping the gateway has to enforce:
| Article | Obligation | Routing decision it constrains |
|---|---|---|
| Art. 9 | Risk management lifecycle for high-risk systems | Provider must have documented risk governance — narrows the eligible model pool per workload class. |
| Art. 10 | Data governance, documented training-data lineage, prohibited-data exclusion | Model selection must exclude providers without published training-data summaries. |
| Art. 13 | Transparency to deployers, instructions for use | Routing response must carry model card, version, and provider identity through to the deployer. |
| Art. 14 | Human oversight, override, intervention | Router must expose abstain/escalate as typed outputs and route to a model that supports stop/pause tokens. |
| Art. 15 | Accuracy, robustness, cybersecurity per NIST baseline | Disallow models below the workload's accuracy floor; enforce non-CLOUD-Act geography for sensitive workloads via fallback chains that cannot cross zones. |
| Art. 50 | Disclosure of AI interaction; marking of synthetic content | Inject AI-system identification at response time; flag generated content before it reaches the end user. |
| Art. 53 | GPAI provider documentation obligations (Annex XI) | Route only to GPAI providers with published model cards, training summaries, evaluation results. |
| Art. 55 | Systemic-risk GPAI: adversarial testing, 24-hour incident reporting | Gate use of systemic-risk GPAI on the organisation's ability to meet the 24-hour reporting SLA. |
Several vendor primitives gate region pinning today. Azure Foundry exposes routing.models as a subset selector for compliance and data residency93, and Foundry's data-zone guarantees plus zero-prompt-storage policy25 are Microsoft's anchor against GDPR-extension auditors. AWS Bedrock ships Cross-Region Inference with SCP-policy enforcement98, so the data-zone boundary is enforced at the Control-Tower layer rather than asserted by the application. Vertex AI applies VPC Service Controls to context caches92, which means the residency of a cached prefix is bounded by the same network perimeter as the model endpoint. NVIDIA Dynamo exposes prefix_id for KV-cache pinning and latency_sensitivity as a routing hint for SLO-band selection97. Anthropic's cache-isolation granularity — per-API-key versus per-organization — is not disclosed in current docs, which leaves a common procurement question unanswered, and Anthropic-direct has no documented EU-resident inference path as of 2026-05-08; Anthropic's announced EU Sovereign offering is on the roadmap, not in production. Among gateways, Requesty (2026-05) and Kong AI Gateway (2025-11) are the only two with published article-by-article AI Act mappings114115; AWS, Azure, Vertex, OpenAI ship building blocks (region pinning, guardrails, zero retention) but leave the article-to-decision mapping as homework for the integrator.
A logging convention that the Pharos Production 2026 synthesis66 finds across most viable production systems: every dispatch carries a small fixed set of fields — request id, routed model, routed region, cache-hit bit, routing-signal value, cost, latency — and the log lands in a per-tenant data zone matching the user's residency contract rather than a global pool. The point Pharos draws out from 25-plus production systems is that compliance, routing, and observability only converge cleanly at the gateway layer; anything that tries to bolt them together higher up tends to grow disagreements between vendors and clouds.
Aside — the ngrok 2026 verdict62 is the cleanest summary of where this lands: agentic workflows in regulated industries make the gateway non-negotiable. By the time an agent has fired 20–50 LLM calls per user action, there is no version of "we will figure out the audit log later" that survives a compliance review.
Through 2025 the dominant question was which classifier predicts which model wins. By April 2026 that's the wrong question. RouteLMT11 (ACL 2026 Industry, 2026-04-24) and TRACER34 (2026-04-16) land within forty-eight hours of each other and arrive independently at the same conclusion: predict the marginal gain Δ = Q_large − Q_small, not the absolute quality Q_large. RouteLMT does it with an in-model LoRA adapter probing the small model's prompt-token representations — no external classifier, no hypothesis decoding. TRACER does it with a lightweight surrogate trained on production logs and a parity gate at threshold α: the surrogate answers only when its agreement with the teacher exceeds α; otherwise the query escalates. On a 77-class intent benchmark the surrogate covers 83–100% of traffic, and on one 150-class workload it fully replaces the teacher.
The other learned-router story is NVIDIA's Prefill Activations Router1, posted on 2026-03-21. Rather than reading the query text, it reads the model's internal hidden states during the prefill pass. The Encoder-Target Decoupling trick lets an open-weight encoder (Llama-3) produce routing features for a closed-source target (GPT-5); a SharedTrunkNet MLP predicts correctness probabilities across all candidates simultaneously. Headlines: 45.58% of the gap to oracle closed, 74.31% cost saved versus always-call-the-largest. The implication is unsettling and, in retrospect, obvious: the best signal for routing was inside the model the whole time.
Two more pieces fill in the picture. MTRouter27 (ACL 2026, 2026-04-26) is the first router to encode full multi-turn interaction history as the routing signal, surpassing GPT-5 quality on ScienceWorld at 58.7% lower cost (and 43.4% on HLE). DialRouter31 uses Monte Carlo Tree Search over dialogue branches to discover an emergent specialization by turn position — some models are better at early-turn context-setting, others at mid-dialogue reasoning, others at final-turn summarization, a structure single-turn routers cannot see. BayesianRouter32 fuses an offline Bradley-Terry head with online Thompson sampling for reward-model routing, and ParetoBandit33 closes a long-standing gap by enforcing a dollar-denominated cost ceiling in closed loop — budget compliance within 0.4% across a 530× cost range, with a cold-started new model integrated in roughly 142 steps.
The most-cited industry pilot is RouteNLP35 (ACL 2026 Industry, 2026-04-26): an 8-week deployment at a customer-service division processing ~5,000 queries / day. The novel piece is a distillation-routing co-optimization loop — failure clusters from the router feed targeted distillation of the cheap models, which then feeds the next router-retraining round. Result: 58% cost cut, 91% response acceptance, and p99 latency from 1,847 ms to 387 ms. Routing is no longer a one-time engineering decision; it is a continuous loop coupled to the retraining pipeline.
The negative controls matter as much as the wins. LLMRouterBench110 reports that top routers can cut cost at matched best-single performance, but also that several recent and commercial routers fail to beat a simple best-single baseline under a unified harness. Unsolvability Ceiling108 shows why: judge noise, truncation, parse failures, and unsolvable-label artifacts can manufacture routing opportunity. A routing number is only comparable when the model pool, task distribution, judge, token budget, and retry policy are the same.
The embarrassing counter-finding belongs to an EACL 2026 short paper9: a plain kNN average over past per-model performance, indexed by query embedding, matches the sophisticated learned routers across SPROUT, RouterBench, LiveBench, BigGenBench, and EmbedLLM with 1% of the training data, out of distribution. Heuristics aren't dead; they were under-leveraged.
We do not need a more complex model. We need to predict the right quantity. — paraphrase, RouteLMT, ACL 2026
Khayyam's five-question framework87 — logic, data, time, explainability, resource — is the cleanest way to decide whether a router should be rules-based or ML-based, and the empirical claim that anchors it is direct: most product teams at sub-100K daily requests hit the barrier at the data and resource questions before they hit any architectural one. Heuristics are not a placeholder; for that regime they are the answer. The point pairs naturally with the EACL kNN result9: most teams should keep routing rule-based or kNN-based until scale and labeled data justify training anything more elaborate.
The reference implementation that comes up most often is the LiteLLM Complexity Router pattern. Four tiers — SIMPLE → MEDIUM → COMPLEX → REASONING — with under 1 ms of routing overhead and no external dependency. The natural comparison is the Semantic Auto Router, which depends on an embedding API and pays 100 to 500 ms per decision for the round-trip. FIG. 15 lays out the latency cost across mechanisms. No learned-router architecture currently beats the complexity router on routing-decision latency by an order of magnitude, so the real tradeoff is whether per-query routing is worth the embedding cost — and below roughly 100K daily requests, it usually isn't.
One counter-result worth knowing about is the GIL ceiling. LiteLLM degrades at around 300 to 500 RPS under sustained load — Python's GIL bottleneck — while Bifrost, written in Go, reaches roughly 50× lower gateway latency at 5,000 RPS (11 µs vs hundreds of µs)68. This isn't a critique of learned routing; it's a fact about the runtime of the gateway. Routing-decision cost depends on the implementation language as much as the algorithm.
The Portkey 2026 in-prod aggregate67 (vendor-published, so directional rather than authoritative) reports multi-LLM team adoption jumping from 23% to 40% in ten months across 650-plus teams and 2T-plus tokens, with semantic caching saving an average of ~38% on LLM cost and average tokens per request quadrupling. Read directionally, that's the field shifting from "pick one model" to "route between several" inside a year. The dominant operational pattern at sub-100K-RPM tenants is still rule-based dispatch with a semantic cache underneath, not learned-router-as-a-service.
Aside — the "five to ten simple rules cover 80% of routing needs" claim is real (LogRocket field practitioners; absorbed into ref-87) but it has a sharp edge: that 20% is where the bandit, the kNN baseline, and the capability matrix earn their keep. The complexity router is the floor, not the ceiling.
The dominant production routing unit is still usually the request or session; the research frontier is the step. TRIM3 (LinkedIn / CMU, ICLR 2026) routes between the steps of a single chain-of-thought trace using a process reward model that scores per-step correctness confidence; only the steps likely to derail the solution go to a larger model. The simplest threshold variant already yields 5× cost efficiency on MATH-500 over query-level routing at matched accuracy, and up to 6× on AIME. The TRIM authors phrase the insight cleanly: "expensive calls confined to precisely those steps where stronger models prevent cascading errors."
ConfSpec4 (arXiv:2602.18447) exploits a different asymmetry: generating a correct step is hard, but verifying one is a constrained discriminative task that the small draft model handles well within its competence range. ConfSpec produces both the step and a confidence score for it; high-confidence steps are accepted, low-confidence steps are escalated to the target model. The framework is explicitly orthogonal to token-level speculative decoding, and the two compose multiplicatively for 2.24× end-to-end speedup at no quality loss — a rare stackable result.
CascadeDebate12 (2026-04-14) inserts a third option at the escalation boundary — deliberate. When the confidence router flags uncertainty, instead of immediately escalating it activates a lightweight ensemble of agents at the same scale to debate; only failed deliberation triggers escalation. Across five benchmarks, CascadeDebate reports +26.75% accuracy over strong single-model cascades; an online threshold optimizer alone contributes another 20.98–52.33% relative gain over fixed-threshold policies. In practice this means static thresholds are brittle under production distribution shift.
The granularity goes finer in two directions. Confidence Leaps37 (EACL 2026) shows that confidence is non-monotonic: it stays flat and then spikes at a discrete moment of insight that's detectable by a token-entropy drop. The trace up to the leap transfers across model families as a prefix, which hints at a class of reasoning-transfer protocols nobody has built yet. KAD36 (Paris-Saclay / INSA Rennes, EACL 2026) frames per-token deferral as a 0–1 knapsack with primal and dual approximations; the dual gives an adaptive threshold that tightens or loosens with the budget's shadow price. Saguaro39 (Stanford / Princeton / Together AI, 2026-03-03) does an analogous trick at the token level on the speculation side, predicting the verification outcome during verification and pre-computing speculations for each anticipated result — 30% faster than optimized speculative-decoding baselines, up to 5× over autoregressive.
A counter-paper landed on 2026-05-07, one day before this report goes out. "Is Escalation Worth It?"13 gives a decision-theoretic characterization of two-model cascades using constrained optimization and Lagrangian duality. The cost-quality frontier turns out to be piecewise concave on decreasing-benefit regions of the confidence support; for a pool of k models, the achievable frontier is the pointwise envelope of all C(k,2) pairwise cascades. Validated across MATH, MMLU, TriviaQA, SimpleQA, and LiveCodeBench with eight models from five providers, the result is sobering: a lightweight pre-generation router beats the best cascade policy on four of five datasets. The reason is structural — cascades pay the cheap model's generation cost before the escalation decision is made, and that generation tax is unavoidable in a cascade no matter how thresholds are tuned. Pre-generation routers bypass it.
Reading these alongside Routing, Cascades & User Choice38 (2026-02-10, U Ottawa / NVIDIA), which models routing as a Stackelberg game between provider and user (see §12 for the build-side implication), gives two papers with opposite normative conclusions but the same underlying observation: cascading isn't free.
Worth flagging: Diminishing Returns of Early-Exit40 (2026-03-24) argues that modern LLMs trained with improved recipes have less early-exit potential than older ones. Dense transformers retain more layer redundancy than MoE or SSM architectures, alignment training homogenizes later layers, and base-pretrained models above 20B still have early-exit headroom but their fine-tuned descendants do not. The 2026 frontier is later in the stack, not earlier.
Routing without a verifier is optimism with extra steps. — paraphrase, CascadeDebate, 2026-04-14
Capability matrices stopped being a thought experiment in April. Topaz5 (Georgia Tech, CHI 2026 HCXAI Spotlight, 2026-04-04) builds an explicit M[model × skill] matrix by synthesizing public benchmark performance across diverse tasks, and routes each sub-task in an agentic workflow to the matching specialist. The fact that this lands at CHI rather than NeurIPS is itself meaningful: interpretable routing is becoming a user-experience requirement rather than only an engineering one. Topaz's routing decisions produce full execution traces showing how skill-match scores were weighted against cost, with developer-facing natural-language explanations generated automatically — so an operator can audit why GPT-4o was chosen for step 3 but Haiku for step 7. The system also avoids the cold-start problem most matrix routers run into: it bootstraps from public benchmarks rather than learning a proprietary profile from interaction history, so it's deployable on day one without any labeled routing data.
RouteProfile6 (arXiv:2605.00180, 2026-04-30) treats LLM profiling as an independent research problem and formalises a four-axis design space: organisational form (flat vector vs. structured graph), representation type (discrete benchmark scores vs. dense learned embeddings), aggregation depth (domain-level vs. query-level signals), and learning configuration (fixed vs. trainable). Across three router families — classifier, preference-based, embedding-based — structured profiles (GNN-based) consistently win, query-level signals beat domain-level proxies, and trainable structured profiles generalize best when a novel LLM joins the pool mid-deployment. The reframing is what carries: before asking which router, RouteProfile argues you should ask which profile format.
Dimension-Direct Routing14 (Research Square, 2026-04-07) takes the strongest matrix-first position: a 12-model × 15-dimension capability matrix replaces the kNN-over-embeddings step entirely. The router predicts which of 15 dimensions the query primarily needs and then looks up the matrix; LLM-as-Judge across four quality dimensions reports +25.9% depth and +17.4% completeness over the embedding baseline. The paper is unusually honest about failure modes: semantic accumulation bias (the router over-weights the last few turns in long conversations and misidentifies the primary capability needed) and cross-domain routing instability (queries on the boundary of two dimensions trigger thrashing between specialists). Explicit matrices are explainable, but fragile — they break when the world adds an axis the matrix doesn't have.
The axis set itself has expanded in two directions. Route-To-Reason29 (USTC, ACM WWW 2026, 2026-04-12) is the first system to embed reasoning strategy — chain-of-thought, tool-use, few-shot, direct — as a first-class capability axis alongside model identity. RTR learns dense vectors of (model, strategy) pairs jointly and selects a strategy-model bundle, reporting 60% cost reduction by matching lightweight models with cheap strategies for simple queries. The embedding space surfaces something a flat model × benchmark matrix collapses: some models have narrow strategy "comfort zones" (excel at chain-of-thought, degrade with direct answers) while others are strategy-agnostic. SkillRouter30 (2026-03-23) addresses the dual problem at the tool layer. As agent skill registries grow past 80,000 entries, exposing all skills at inference becomes infeasible, but exposing only names and short descriptions hides the implementation details that turn out to be the routing signal. Hiding the full skill body costs 31 to 44 percentage points of routing accuracy across architectures; SkillRouter retrieves and reranks over the full skill body and hits 74.0% Hit@1 with 13× fewer parameters and 5.8× faster than the strongest base pipeline.
The orchestration-layer take on the same problem is GraphPlanner28 (UIUC, ICLR 2026, 2026-04-26), an RL-trained MDP that selects both an LLM backbone and an agent role (planner, executor, summarizer) at each step, over a heterogeneous graph (GARNet) that captures interaction memories across queries, agent instances, and responses. Reported numbers: +9.3% accuracy across 14 tasks, with GPU memory dropping from 186.26 GiB to 1.04 GiB.
The surveyed vendor primitives do not expose modality as a first-class routing knob. No major model host (Bedrock, Foundry, Vertex AI) offers a "this is a vision query, route differently" primitive at the model-selection layer: AWS Bedrock supports vision on Claude and Titan-Multimodal but doesn't surface modality-specific routing; OpenAI, Anthropic, and Google Gemini all serve vision through the same endpoint with no modal-aware dispatch. Gateway products like OpenRouter's video routing (§11) sit one level up — they pick a provider, not a model within a provider, which is a different layer of the routing stack. The only formal evaluation harness for routing across modalities is MMR-Bench (§18, LA-15), and that's pre-window. Modality today is an attribute the routing layer infers from the request payload, not a knob the model host exposes.
On document-conversion workloads (PDF→Markdown is the canonical example), Dynamo's multimodal worker selection15 picks the worker holding the highest cache overlap including image content blocks, so consecutive pages of the same document route to the same worker — the cache-locality argument from §09 reapplied across modalities. The Dynamo osl hint97 (expected output sequence length) lets the scheduler reserve decode capacity for the long Markdown output that a PDF page produces. Vertex's context-caching92 is agnostic to modality and survives cross-modal calls; the cache resource is the same whether the cached prefix is text, image, or both. None of these primitives does modal-quality dispatch — there is no "send vision-only queries to model A, vision+text to model B" — only cache-locality and SLO-hint routing.
A bet in §15 (BET 07) is that this gap closes by mid-2027 — a modality-dispatch knob landing as a first-class API primitive in at least one of AWS Bedrock, Azure Foundry, or NVIDIA Dynamo. Until then, the multimodal routing surface is constructed at the application layer from the primitives above.
While Topaz and RouteProfile invest in centralized capability matrices, DiSRouter7 (Shanghai Jiao Tong / Kai Yu lab, ICLR 2026, 2026-04-22) rejects the premise. Each LLM in the fleet undergoes Self-Awareness Training — a calibration objective that teaches the model to score its own competence on a query within a trustworthy range — and then a distributed protocol lets the most-confident model take the query. There is no central matrix to maintain, no benchmark suite to run on every fleet expansion. Adding a new model means training the new model, not retraining the router. The hard condition is calibration: without it, self-assessment routes fluent overconfidence rather than competence.
Architecturally, this is the cleanest answer this quarter to a problem central-matrix designs tend to understate: the maintenance burden compounds with every model. RouteProfile's own design-space study makes the point indirectly — profile choice is not free, structured profiles need ongoing curation, and central matrices encode capability information about models that change every week. DiSRouter pushes that information into the models themselves, where it is updated each time the model is fine-tuned anyway.
The tradeoff has teeth. Explainability gets harder when the routing decision lives inside ten different models' hidden states. A miscalibrated self-assessor causes silent quality regressions with no external audit trail. And cross-organizational coordination — when no single party owns all the models in the fleet — becomes the bottleneck the central matrix used to absorb. The likeliest 2026–2027 outcome is hybrid: a sparse central matrix capturing broad archetypes, with distributed self-assessment within each archetype for fine-grained selection. GraphPlanner's28 heterogeneous graph, which captures both model-level and role-level capability structure, already gestures at that convergence — and §15's predictions take the bet that at least one frontier lab ships a self-routing fleet in production before the end of 2026.
DiSRouter7 argues that each model can emit a calibrated self-assessment and the most-confident model can take the query, without a central matrix to coordinate them. Appendix A.9 makes the argument that matters here: "high-capability closed-source LLMs (e.g., GPT-4) may already possess strong intrinsic self-awareness and can be integrated without additional training." If that's true, a router can be built from frozen signals already exposed at inference, with no fine-tuning required.
Four such signals appear in the literature, ordered by how much evidence supports them. Average token log-probability reaches AUROC 0.65–0.83 in-distribution and 0.72–0.83 out-of-distribution10; it's a property of the generation rather than the query, which means it generalizes across query distributions. A GSA v3 single-token YES/NO probe reaches AUROC 0.56–0.72 at about 500 ms latency10 and can run before the cheap generation as a pre-generation filter. Self-consistency spread across N samples is more accurate but pays N× the cost, which limits it to high-stakes spot checks. Activation probes — linear probes on residual-stream activations — reach AUROC above 0.70 (§18, LA-16), with the NVIDIA Prefill Activations Router1 as the in-window production-grade cousin.
One vendor constraint to flag: Anthropic does not expose raw logprobs at the time of writing, which means the logprob signal is unavailable on Claude calls. That's a routing-architecture limitation rather than a model-quality one, and the §12 semver-lie failure mode74 compounds it — even when logprobs are exposed elsewhere, they shift across minor snapshot bumps.
FIG. 16 walks the deployment topology. An incoming query first hits a pre-generation filter (the GSA v3 probe). If the probe's confidence is below threshold_low, the query routes directly to the strong model and the cheap path is skipped entirely. Otherwise the cheap model generates a response and emits per-token log-probabilities. The average logprob is compared to a calibrated threshold: a high average returns the cheap response, a low one escalates to the strong model. On a generic enterprise workload the expected operating point is around 94% of strong-model quality at roughly 18% reduction in expensive calls; on TriviaQA-shaped data, where the cheap model is more accurate than the strong/weak prior typically assumes, savings are larger10. Two tradeoffs sit underneath these numbers. The cheap generation is paid for before the escalation decision, so the cost floor is non-zero. And the logprob threshold drifts with every snapshot bump, so it needs recalibration against a held-out canary at each transition.
A related idea worth surfacing: Tian Pan88 argues that abstention should live in the router as a typed output — { answer: string } | { abstain: { reason, missing_information } } — rather than as a system-prompt rule. A typed abstention unlocks four downstream routing decisions: escalate to a stronger model, escalate to a tool, escalate to a human queue, or ask the user for clarification. DiSRouter's Self-Awareness Training implements the same idea on the model side; the typed-output contract makes it composable across a multi-model topology without baking it into a prompt.
One failure mode worth being explicit about: a model can be confidently wrong. High log-probability is not the same as correctness — RLHF-trained models often hallucinate at high confidence — and a shadow self-router will return those confident hallucinations at the cheap tier. SMARTCAL109 is the relevant tool-use warning: self-aware tool use improved only after explicit recalibration reduced overconfidence and tool abuse. Self-assessment is a routing primitive, not an audit primitive. Anything deployed on top of it still needs a separate verification layer (parity gates, eval harnesses, human review on regulated workloads) to catch the confident-wrong case.
The most consequential infrastructure story of the quarter is that the prefix-cache became a primary serving signal in production stacks. It is not a substitute for quality routing; it changes the cost and latency terms that a model router should optimize. vLLM's KVEvents API standardized the surface — every cache block allocation and eviction emits an event with the block hash, worker id, tier, and action — and a half-dozen routers built on top of it. NVIDIA Dynamo 1.0 GA15 (2026-03-16, production-ready 2026-04-17) ships a cluster-wide Flash Indexer that maintains a real-time map of which KV blocks live on which workers, updated by KVEvents emissions, at 170M ops/s with sub-millisecond lookups. The KV-aware router weights queue depth and KV-cache overlap fraction; an Agent Hints API lets each request declare latency sensitivity, expected output sequence length, and a cache-pinning duration; a multimodal router downloads images, encodes them, and selects the worker with the highest cache overlap including image content blocks. Llama 3.1 on Dynamo + the NeMo Agent Toolkit reports 4× lower TTFT and 1.5× higher throughput.
llm-d ships the same idea inside Google's Kubernetes Inference Gateway. Its Predicted-Latency Scheduler44 (2026-03-13) runs an XGBoost model continuously retrained on Vertex AI with live serving telemetry — input length, output-length estimate, queue depth, KV utilization, prefix-cache hit ratio — and reports −70% P50 TTFT versus heuristic queue-depth routing. The companion Precise Prefix Cache Routing45 guide subscribes to the KVEvents stream so the EPP knows which blocks each worker actually holds at any moment, not which it used to hold; 80–95% cache-hit rates on multi-turn conversation against 12% for random load balancing and 40–60% for hash-routing without eviction awareness.
Ranvier16 (2026-03-16) rebuilds the idea engine-agnostically. The data structure is an Adaptive Radix Tree rather than a hash table, so a prompt that partially matches a cached system prompt routes to the worker that has the most of it — not just exact matches. Benchmarks on 13B models report 79–85% P99 latency reduction over round-robin; the P99 gap is larger than P50 because P99 is exactly where the worst-case re-prefill penalty lives. A single deployable binary, OpenAI-API compatible, no vendor lock-in.
The user-visible quote of the quarter belongs to Augment Prism17 (2026-05-02). Switching models mid-conversation evicts the prompt cache; the next call pays roughly 10× the marginal token cost. A naive per-turn router that switches frequently can be more expensive than always using the frontier model. Prism's solution is a small planner that fires only when it predicts a switch is worth the eviction cost, and it stays sticky across the agent loop — for tool-result follow-ups (about 96% of all chat-host turns) the planner reuses its prior decision and runs on only ~4% of turns. The planner itself costs $0.91 per $2,649 of total spend (0.03% overhead). Net: matches the best individual frontier model on quality at 20–30% lower cost per task. Model-switching has become a first-class cost the router must price.
The frontier of cache-aware routing is PrfaaS43 (Moonshot AI, 2026-04-16). Prefill-decode disaggregation has historically pinned both phases to the same datacenter — sometimes the same rack — because the KV transfer bandwidth is too high. Kimi Linear's hybrid attention shrinks KV cache size enough that cross-DC transfer over commodity Ethernet becomes practical; PrfaaS selectively offloads only long-context prefills to remote compute-dense clusters, monitoring queue depth and Ethernet link utilization to avoid congestion. +54% throughput, −64% P90 TTFT against local-only PD on a 20× scaled-up evaluation cluster.
| System | Date | Mechanism | Headline |
|---|---|---|---|
| NVIDIA Dynamo 1.0 GA | 2026-03-16 | Flash Indexer + 4-tier KV hierarchy + Thompson sampling | 170M ops/s · 4× TTFT |
| llm-d (predicted-latency) | 2026-03-13 | XGBoost TTFT/TPOT predictor in Vertex AI | −70% TTFT p50 |
| llm-d (precise prefix-cache) | 2026-03 | Direct KVEvents introspection | 80–95% hit rate vs. 12% random |
| Ranvier | 2026-03-16 | ART-based prefix router · OSS · engine-agnostic | 79–85% P99 ↓ |
| Augment Prism | 2026-05-02 | Cache-aware coding-agent router; eviction-cost-aware | ~30% cost ↓ · eviction-cost-aware planner |
| PrfaaS (Moonshot AI) | 2026-04-16 | Cross-datacenter KV routing | +54% throughput · −64% P90 TTFT |
The cache-as-router argument of §09 is real, but it ships as a long list of vendor-specific API knobs that a builder has to actually invoke. The table below is the cookbook — every primitive we use, what it gates, when we invoke it, and which routing-or-cache property it controls.
| Vendor | Primitive | What it gates | When we invoke it | Routing / cache role |
|---|---|---|---|---|
| OpenAI | prompt cache | per-prefix automatic cache | every request >1024 tokens | break-even at first cache hit86 |
| Anthropic | cache_control | explicit cache breakpoints · 5m/1h TTL | system / tool / sliding-conv boundaries | three-breakpoint BP1/BP2/BP3 architecture9184 |
| Anthropic | max_tokens: 0 | cache pre-warm without billing output | pre-deploy warm-up of long system prompts | amortise 25% write premium91 |
| AWS Bedrock | cross-region inference | capacity-aware geographic routing | compliance-band traffic | ~10% savings · SCP-gated98 |
| AWS Bedrock | Intelligent Prompt Routing | vendor-managed model selection | baseline option for generic workloads | OpenAI-compatible cross-provider endpoint46 |
| Azure Foundry | routing.mode | balanced / cost / quality preset | tenant-level routing-policy default | data-residency via routing.models subset93 |
| Google Vertex | cachedContent | explicit cache resource · TTL | long-context Gemini calls | 90% discount on 2.5+92 |
| Google Vertex | VPC Service Controls | data-residency on cache resources | EU and regulated tenancy | compliance is a routing decision92 |
| Portkey | strategy: fallback | ordered provider list with retry | edge gateway · all traffic | three-tier fallback chain24 |
| Portkey | strategy: loadbalance · hash_fields | deterministic hash-based dispatch | session-affinity workloads | cache-locality preservation |
| Portkey | strategy.conditions.query | MongoDB-style metadata routing | per-request policy override | conditional model selection94 |
| OpenRouter | X-OpenRouter-Cache | gateway-layer exact-match cache · 300s | idempotent / repeated traffic | HIT/MISS observability header95 |
| Vercel | providerOptions.gateway.models | ordered fallback array | edge ai-gateway · per request | modelAttempts response metadata96 |
| NVIDIA Dynamo | nvext.agent_hints | latency_sensitivity · osl · priority | every backend dispatch | routing hints into the scheduler97 |
| NVIDIA Dynamo | prefix_id · CachePinType | KV-cache pinning per prefix | PDF→MD page sequences | cache locality is the routing key97 |
| LangChain | Command(goto=…) · Send(…) | single-agent dispatch · parallel fan-out | LangGraph application layer | routing is code-explicit, not LLM-decided99 |
| LlamaIndex | RouterQueryEngine | LLMSingleSelector · PydanticMultiSelector | application-layer retrieval | JSON or function-call selection100 |
| Anthropic | Managed Agents · session | brain/hands decoupling · session pin | multi-turn agent workloads | session-level cache and routing26 |
The procurement-honest answer: no vendor publishes the added latency of its routing step — Azure Model Router's LLM-inference overhead, Bedrock IPR prediction time, cache-lookup latency on the cache-resource APIs. The build position is unchanged: every primitive we invoke is gated by our own measured-latency dashboard, not by the vendor's marketing number. This is the cookbook content, not the editorial position; the editorial position remains own the router, integrate the primitives.
Multi-objective routing got real this quarter. BOute18 (MLSys 2026 oral, 2026-02-11) treats routing and GPU placement as a single coupled non-convex problem: query routing thresholds under quality and latency constraints, heterogeneous GPU resource allocation, parallelism strategies. A multi-objective Bayesian optimization loop wraps a serving simulator and converges in far fewer trials than RL or grid baselines. Result: up to 157% throughput gain (59% on average), or equivalently 15–61% cost savings (38% on average) at matched quality. No routing table is hand-tuned; the optimizer discovers thresholds for the specific GPU mix and cost target.
AdaServe19 (EuroSys '26, 2026-04-26 → 30) makes per-request speculative-decoding trees a routing decision. Each request carries an SLO in metadata; AdaServe looks up a pre-profiled SLO-to-tree-config mapping and assigns a customized draft configuration — tighter SLO yields a larger, more aggressive speculative tree, relaxed SLO yields minimal speculation. The routing question becomes "how much speculative compute for this request?", a continuous knob rather than a discrete model selection. 4.3× fewer SLO violations, 1.9× higher goodput against best-performing baselines.
BEAM42 (MLSys 2026 oral) adds energy as a first-class SLO. Running atop vLLM with sub-millisecond event-driven scheduling, BEAM evaluates the energy cost of each candidate batch / worker / DVFS action and picks the lowest-energy action that meets all per-request SLOs. 51% end-to-end GPU energy reduction without violating SLOs.
CEDAR20 (GreenSys @ EuroSys '26, 2026-04-26) puts carbon on the objective list alongside cost and latency. The DRL agent observes real-time grid carbon intensity, spot pricing per endpoint, and queue depth across a multi-region deployment; the reward function explicitly distinguishes marginal grid carbon (the schedulable kind) from average carbon (the uncontrollable kind), which most prior carbon-routing papers conflate. Result: 26% cost reduction and 27% carbon reduction at negligible SLO degradation.
The scheduling-side companion is Kairos41 (2026-05-04), urgency-based SLO scheduling for prefill–decode disaggregation. The prefill side computes urgency = slack_to_TTFT_SLO / request_length and greedily picks the highest-urgency chunks first; this directly solves head-of-line blocking, where a 128K-token request would otherwise starve 8K requests at deadline. The decode side runs slack-guided adaptive batching: when slack is positive, short requests decouple from stragglers and run ahead. At the QPS = 3.0 inflection on Minimax-M2.5, end-to-end SLO attainment goes from 55.8% (DistServe) to 89.6% (Kairos), a +33.8% swing.
Every Pareto plot you see in this literature is shaped by the workload the authors had on hand. The frontier moves when you change the workload. Treat each curve as an instrument, not as ground truth.
The strongest negative result of the quarter came from Google on 2026-04-2621, consolidating an earlier 180-configuration study52 with field experience from 30+ production deployments51. The verdict: multi-agent collaboration helps on parallelizable, divergent, or cross-context-window tasks, but hurts sequential planning by 39–70%. Independent agents amplify errors 17× as a single mistake propagates through subsequent stages. Hub-and-spoke architectures — one strong orchestrator routing to many specialists — won decisively over open-mesh agent collaboration. The "swarm" framing is dead in production; "dispatch" is the live frame.
The agent-specific routing question is narrower than the multi-agent question: which steps are safe to route down without damaging terminal success? The strongest direct evidence is not from coding-agent product guides but from agent benchmarks. Ares104 routes reasoning effort per step and matches high-effort baselines on several agentic tasks while cutting reasoning tokens by roughly 35–45%. BAAR / BoPO105 trains a budget-aware small/large policy for ScienceWorld, ALFWorld, and AppWorld; it improves the Pareto curve but still trails always-large under tight budgets. Dynamic Mix Precision106 shows the same structure at the precision layer on ALFWorld: many steps can run cheap, but critical steps need the expensive representation. HORIZON107 explains why: most long-horizon failures are process failures, so early routing mistakes can corrupt the whole trajectory.
The six-row step taxonomy below is our synthesis from the routing implications of those four papers together with the role-routing patterns Augment, Cursor 3, and Cline ship; it is not lifted from any single source. The closest published role taxonomies — HuggingGPT's decomposition / execution / aggregation roles and GraphPlanner's Planner / Executor / Summariser28 — cover the endpoints (decomposition and final synthesis) but not the middle of the loop where most of the cost actually lives.
| Agent step | Routing implication | Typical failure if routed too cheap |
|---|---|---|
| Initial decomposition | high-leverage; usually protect with stronger model | wrong subgoal poisons every later action |
| Search / retrieval fanout | cheap and parallelizable when verifier exists | query drift; misses decisive evidence |
| Mechanical edit or extraction | safe to downroute if schema and tests are tight | format errors and retries erase savings |
| Tool-error diagnosis | often needs stronger reasoning; feedback is sparse | loops on the wrong repair hypothesis |
| Compression / summary | cheap model may work, but losses are sticky | drops constraints needed later |
| Final synthesis | depends on user-visible quality and risk | correct trajectory becomes a weak answer |
Coding agents converged on a stable role-routing pattern by mid-April48. Treat this as static role assignment, not evidence of dynamic per-step routing inside one trajectory. The split that ships in production:
| Agent role | Typical model tier | Rationale |
|---|---|---|
| Planning / orchestration | Claude Opus 4.6 | Ambiguous decomposition; errors cascade downstream |
| Code implementation | Claude Sonnet 4.6 | 80%+ SWE-bench with 30% fewer tokens than Opus |
| File navigation / search | Claude Haiku 4.5 | High-frequency, short structured queries |
| Code review | GPT-5.2 | Strong instruction adherence on multi-file diffs |
Augment, Cursor 349 (2026-04-02, parallel agent dispatch with a local-cloud split), and Cline ship variations of this and report roughly 51% cost reduction versus monolithic single-model agents. Treat that number as product evidence, not a benchmark result. The danger to avoid is the obvious one: assigning planning to a weak model causes errors in task decomposition that swamp every downstream cost saving.
AG2's four-layer handoff stack22 (2026-03-05) became the most-copied open-source reference architecture. On every agent turn, four mechanisms are evaluated in strict priority order: (1) context-based conditions (deterministic state-variable routing, no LLM call) → (2) LLM-based conditions (the agent's own LLM evaluates handoff predicates as tools) → (3) tool-based handoffs (a tool returns a routing signal) → (4) after-work fallback. The ordering matters: the cheapest routing always fires first, and a well-structured production system executes most decisions at layer 1 — zero LLM tokens, zero added latency.
The frontier-lab response was infrastructural. Anthropic Managed Agents26 (2026-04-08) decouples brain from hands: the harness becomes stateless, reading from a durable session log and routing tool calls to whichever sandbox or MCP server is available, while containers (the hands) are provisioned only when actually needed. Three stable interfaces — execute(name, input), emitEvent(id, event), wake(sessionId) — let many brains share hands and many hands answer to one brain. The latency story: p50 TTFT −60%, p95 TTFT −90%, because inference no longer waits for container startup. Claude Code subagents50 (2026-04-30) push the routing surface further into natural language: the parent's description field is the routing rule, no classifier and no routing table; the spawned subagent reads the description, takes its own context window, runs in parallel, and returns. Week 17's forked-subagent flag (CLAUDE_CODE_FORK_SUBAGENT=1) lets the fork inherit the full conversation context for deep parallel exploration.
GraphPlanner28 formalizes the dispatch pattern as an MDP over (model, role) pairs with graph-encoded interaction memory; AgentFloor53 (2026-05-01) is the first agentic-routing benchmark with a 6-tier capability ladder, which gives the field a shared yardstick it has lacked.
The role-routed coding-agent split (planner / implementer / reviewer) is now the consensus production pattern, but it is a working compromise, not a settled design. It rewards readers with a clear architecture and punishes them with three handoff seams where context goes stale, prompts diverge, and small role misclassifications produce large bills. The Slate report names a version of this experience23; in the surveyed work, it is the gap between the role-split's average-case win and its tail-case failures.
Vendor infrastructure became the dominant production story this quarter. The defining M&A story is Portkey: $15M Series A on 2026-02-19 (Bessemer, Uncorrelated Ventures); gateway open-sourced at over 1T tokens/day on 2026-03-24; acquired by Palo Alto Networks on 2026-04-30 (terms not disclosed; press estimates have circulated in the $700M range but are not vendor-confirmed)24. Seventy days from Series A to exit. The deal establishes the AI gateway as a security-critical control plane for autonomous agents, embedded in PANW's Prisma AIRS platform — which is itself the most candid signal that the routing layer has become infrastructure-of-record, not a developer convenience.
Microsoft Foundry Model Router25 (2026-03-18, expanded 2026-04-28) is a trained ML router (not a rules engine) dispatching across up to 18 underlying LLMs — Claude, GPT-5.x, Grok, DeepSeek, Llama, OSS — from a single Azure endpoint. It analyzes prompt complexity, task type, and latency target; honors data-zone boundaries; and stores no prompts. AWS Bedrock46 shipped a cross-provider OpenAI-compatible endpoint on 2026-04-28, putting GPT models on Bedrock alongside Anthropic and Mistral.
OpenRouter's April was a four-step sprint47: video routing on 2026-04-15, Workspaces (per-workspace routing defaults) on 2026-04-22, Agent SDK on 2026-04-24, Response Caching on 2026-04-30. Martian55 reached approximately $1.3B on the secondary market on 2026-04-04, cementing routing-as-a-service as a standalone business category.
The frontier-model labs responded by building routing inside their own products. Anthropic Managed Agents26 (2026-04-08) decouples brain and hands as in §10 (p95 TTFT −90%); Claude Code subagents50 (2026-04-30) use the description field as a natural-language routing rule with parallel fork-join dispatch. Anthropic Adaptive Thinking and Gemini 3 Deep Think56 push routing inside the model itself: the model autonomously adjusts its reasoning budget, bypassing an external router for the easy / hard split. These are production-relevant primitives, but they are not public evidence that dynamic per-step model switching beats a single strong model in mixed tool-use agents. The production pattern is more conservative: request-level model routers, user-visible model pickers, fallbacks, static role assignment, and model-internal effort knobs.
OpenAI's GPT-5 router rollout is the useful cautionary case112. A router can be technically cost-effective and still fail the product contract if users experience opaque changes in quality, personality, latency, or refusal behavior. Production routers need override controls, audit logs, and rollback paths for the same reason they need accuracy metrics.
The OSS layer kept growing in parallel. NadirClaw (3-tier proxy router, 454★) and bitrouter (Rust agent-native proxy, 79★) ship as deployable binaries; Dr.LLM (ICLR '26) released dynamic layer routing on 2026-04-2459. The pattern is the same at every layer: routing is the value, the model fleet is the commodity.
Most routers degrade not from external attack but from drift inside the system. Four modes have been named clearly enough this quarter to be worth knowing by name. The detectors below are proposed operating checks, not universal thresholds; each workload needs its own baseline, alert budget, and remediation cadence.
A fifth candidate — refusal verbosity and retry-rate shift76 — folds into the calibration row in practice, since the same conformal-replay remediation covers it.
The Pharos field synthesis66 looks at 25+ production systems and lands on a sobering corollary: the eval harness is the only reliable drift detector, regardless of whether the routing layer is bought or built.
A small but consequential thread of work argues that routers should predict counterfactual marginal gain — what the stronger model would have produced if we had called it — rather than absolute model quality. The NeurIPS 2025 paper that introduced the framingLA-2 treats routing as off-policy estimation on observational logs, since production traffic only ever shows you the outcome of the model you actually called.
The in-window descendant is MTRouter27, which learns a joint history-model embedding from logged trajectories and treats each step's routing decision as a regret-minimisation problem. Reported numbers: 58.7% cost reduction vs GPT-5 on ScienceWorld, 43.4% on HLE. The pre-generation analogue is RouteLMT11, which puts a LoRA on the small model's prompt-token representations to predict Δ = Q_large − Q_small rather than Q_large directly. Same shape, different deployment stage.
The framing matters because logged production traffic is observational, not interventional — every log is the outcome of a policy that already biased which model saw which query. Treating routing as causal inference is what makes offline re-fit on logged data legitimate rather than self-confirming.
Three off-policy estimators from the contextual-bandits literature do the work. Let μ(a|x) be the logging policy that actually chose model a on query x, and π(a|x) be the target policy we want to evaluate from logs alone. Let r be the observed reward (terminal-success bit, judge score, or whatever the harness records).
| Estimator | Formula | Property that matters in production |
|---|---|---|
| IPS | V̂_IPS(π) = (1/N) Σᵢ [π(aᵢ|xᵢ) / μ(aᵢ|xᵢ)] · rᵢ | Unbiased under positivity (μ(a|x) > 0 wherever π(a|x) > 0). Variance scales as O(1/μ²), which blows up when the logging policy almost never chose an arm. Standard mitigation is weight clipping at a fixed constant c. |
| SNIPS | V̂_SNIPS(π) = (Σᵢ wᵢ rᵢ) / (Σᵢ wᵢ), wᵢ = π/μ | Self-normalises by the weight sum, trading a small bias for substantially lower variance. Recovers a value in [0, 1] when rewards are. |
| Doubly Robust | V̂_DR = (1/N) Σᵢ [q̂(xᵢ, π(xᵢ)) + (π/μ)·(rᵢ − q̂(xᵢ, aᵢ))] | Unbiased if either the propensity model μ or the reward model q̂ is correctly specified — not necessarily both. In practice the residual r − q̂ is near zero, which crushes variance even when μ is rough. |
The gap in the routing literature is visible from this angle: of the 2025–2026 LLM-routing systems surveyed — FrugalGPT, RouteLLM, LLM-Blender, Smoothie, Router-R1, MTRouter, RouteLMT, BEST-Route, LLMRouterBench — none use IPS, SNIPS, or DR as the training objective. Routers are typically fit with supervised ERM on logged outcomes or with RL on a heuristic reward, both of which inherit the logging policy's bias. The Tsiourvas et al. ancestorLA-2 is the one paper in the lineage explicitly framed against this gap; nothing in-window has closed it.
To make the variance behaviour concrete: imagine a logged routing dataset of N=200 queries, 100 routed to a cheap arm A and 100 to an expensive arm B, under a logging policy μ(A)=0.80, μ(B)=0.20. Observed rewards: r̄_A = 0.65, r̄_B = 0.80. We want to estimate the value of a target policy π that flips to a uniform 50/50 split.
| Estimator | Calculation | Value |
|---|---|---|
| V̂_IPS | w_A = 0.5/0.8 = 0.625; w_B = 0.5/0.2 = 2.50. (0.625·0.65·100 + 2.50·0.80·100)/200 | 1.203 (unbounded scale) |
| V̂_SNIPS | numerator 240.625, denominator 312.5 | 0.770 (back in [0,1]) |
| V̂_DR | using empirical means as q̂: 0.5·0.65 + 0.5·0.80; residuals zero by construction | 0.725 (exact) |
Now change one number: μ(B) = 0.02 instead of 0.20 — the logging policy almost never chose the expensive arm, which is what happens in real cost-conscious deployments. w_B rises to 25.0. V̂_IPS jumps to 10.20 (uninterpretable noise). V̂_SNIPS collapses to 0.094 as the denominator inflates. Clipping w_B at 5.0 stabilises both: V̂_IPS_clipped = 2.20, V̂_SNIPS_clipped = 0.783. The clipping introduces bias, but estimating a finite quantity beats estimating an infinite-variance one. V̂_DR is the cheapest of the three to stabilise on real logs: the reward model absorbs most of the variance the IPS weight would otherwise carry, so the estimate stays bounded even when propensities are small.
A marginal-gain router needs counterfactual evidence: randomised exploration, shadow calls to stronger models, canary slices where every model answers, or a held-out all-model replay set. The production consensus across Augment, Folkman, and the routing-as-bandit literature is a 5% uniform-random shadow slice — large enough to guarantee positivity (μ(a|x) = 0.05 for every arm), small enough that the shadow cost stays under a few percent of the bill, and simple enough that the propensity model is just a constant. Lower-volume systems run 1–2%; uncertainty-weighted exploration is more sample-efficient but loses the constant-propensity simplicity, and the bookkeeping gets harder. Clipping at 5–10× the inverse propensity is the standard variance fix; without exploration and clipping, the router mostly learns to justify the incumbent policy.
The same machinery gives a usable drift signal. Run V̂_IPS per day on the last 30 days of routing logs against a fixed baseline value (the policy the team committed to ship); alarm when the daily regret exceeds a threshold for several consecutive days. The 5% uniform shadow slice is what makes the per-day estimate well-defined.
def compute_VIPS(logs, target_policy_scores, propensity_fn, clip=5.0): """Off-policy IPS estimate from a routing log slice.""" N, weighted, weights = len(logs), [], [] for entry in logs: a = entry["arm_selected"]; r = entry["reward"] pi = target_policy_scores.get(a, 0.0) mu = propensity_fn(a) w = min(pi / mu, clip) weighted.append(w * r); weights.append(w) return sum(weighted) / N, weights def detect_regret_drift(logs_by_day, shadow_fraction=0.05, # uniform mu for every arm baseline_value=0.0, # committed policy's V̂ regret_threshold=0.05, consecutive=3, lookback_days=30): """Alarm if IPS-estimated regret exceeds threshold for N days in a row.""" prop_fn = lambda a: shadow_fraction streak, alarmed = 0, [] for d in sorted(logs_by_day)[-lookback_days:]: logs = logs_by_day[d] if len(logs) < 50: continue counts = {} for e in logs: counts[e["arm_selected"]] = counts.get(e["arm_selected"], 0) + 1 pi = {a: c / len(logs) for a, c in counts.items()} V, _ = compute_VIPS(logs, pi, prop_fn) regret = baseline_value - V if regret > regret_threshold: streak += 1; alarmed.append(d) if streak >= consecutive: raise Alarm("regret drift", days=alarmed, recommend="hold policy; audit recent retrain") else: streak, alarmed = 0, [] return {"status": "ok", "streak": streak}
This is the smallest operational version of off-policy evaluation: a single estimator (IPS), a single propensity (the shadow fraction), a single decision (raise or hold). Replacing IPS with DR is a one-line change once a reward model exists. The reason to run it is that without it, any router retrained on logged data is updating against a target the previous router already shaped — the "self-confirming" failure mode the Tsiourvas paper names. With it, the team has a per-day number that says the last 30 days are still net-positive relative to the committed baseline, which is the answer the question "is the router working in production?" was asking for in the first place.
None of this is exotic. IPS dates to 1952; DR to 2011; SNIPS to 2015. All three are standard in causal inference and contextual bandits, and across the 2025–2026 LLM-routing systems surveyed here, none are used as the training objective. The opportunity is to import them, not invent them.
The events covered by this survey, plotted on the time axis. The shape itself is informative: a sparse February (BOute, Portkey's Series A, the Trinity / Huawei survey); a dense mid-March infrastructure burst (Dynamo 1.0, MS Foundry, llm-d, Ranvier, Portkey OSS); a late-April academic flood (ICLR 2026 in Singapore, ACL 2026 Industry, EuroSys '26 in Edinburgh, all overlapping); and a closing week of consequential vendor events (Portkey's acquisition, AgentFloor, Augment Prism, Kairos). Drag the scrubber to inspect along the time axis; click any dot to open the source.
One of the strongest findings of the past few months is also one of the least resolved: RouteProfile6 shows that the format of a model capability profile matters more than the router mechanism on top of it. But the field hasn't converged on what that format should look like — discrete benchmark scores, dense learned embeddings, and structured hybrids all have credible papers behind them. The matrix-versus-self-routing argument (Topaz5 and Dimension-Direct14 against DiSRouter7) is similarly unsettled. What follows is six predictions about where these debates land, and seven bets about how the vendor and benchmark layers respond.
Vendor lock-in tension will peak in Q3 2026 as MS Foundry Model Router's25 incumbency advantage on Azure clashes with the OpenRouter / Bedrock cross-provider plays; benchmarks (LLMRouterBench, AgentFloor) will unify by year-end. Both follow from the six above and do not need separate predictions.
Main claims use items dated in the strict window 2026-02-08 → 2026-05-08. A small number of numbered references are pre-window context (each explicitly footnoted as such); lineage references for late-2025 ancestor work appear separately in §18.
The main report (§01–§17) is bounded strictly to 2026-02-08 → 2026-05-08. The fourteen-week window is the report's discipline; this appendix names the late-2025 architectural and benchmark substrate the in-window papers stand on. Late-2025 items are cited here for ancestry only — they are not used as evidence for any in-window claim, and they carry the LA-N prefix to keep them visually distinct from the numbered references above.
| LA-ID | Title | Venue / Date | Inherited by | What 2026 added |
|---|---|---|---|---|
| Online routing parents | ||||
| LA-1 | PORT — Efficient Training-Free Online Routing for High-Volume Multi-LLM Serving | NeurIPS 2025 · 2025-09-02 | ParetoBandit 33, AdaServe 19 | Training-free online routing with competitive-ratio guarantee; ANN query features + bootstrap optimisation became the substrate ParetoBandit's primal-dual pacer extends. |
| Causal routing parents | ||||
| LA-2 | Causal LLM Routing — Regret Minimization from Observational Data | NeurIPS 2025 · 2025-12-02 | MTRouter 27; §13 causal routing | Routes by marginal causal gain (counterfactual improvement), not absolute model quality; interval-conditioned architecture and end-to-end regret minimisation from observational logs. |
| RL multi-round routing parents | ||||
| LA-3 | Router-R1 — LLM-Native Multi-Round Routing | NeurIPS 2025 poster · 2025-12-09 | GraphPlanner 28 | Router is itself an LLM that interleaves "think" and "route" actions across multi-round contexts; routing reframed as reasoning, not classification. |
| Token/step granularity parents | ||||
| LA-4 | R2R — Token-Level Small-Large Model Routing | NeurIPS 2025 · arXiv:2505.21600 · 2025-11-15 | TRIM 3 | Token-level divergence detection + automatic routing-label generation; TRIM moved the granularity from token to step (more deployable). |
| LA-5 | Lookahead Routing — Predicting Output Representations | NeurIPS 2025 poster · 2025-12-04 | NVIDIA Prefill Activations Router 1 | Predicts latent output representations using causal/masked LMs; the in-window ref-1 substitutes real internal states for predicted ones. |
| Confidence / self-routing parents | ||||
| LA-8 | DiSRouter — Distributed Self-Routing for LLM Selections | ICLR 2026 · arXiv:2510.19208 · 2025-10-22 | DiSRouter 7; shadow self-router (§08b) | Already referenced as v5 ref-7; listed here for lineage completeness. Appendix A.9's "frontier models possess strong intrinsic self-awareness without fine-tuning" is the key premise for the shadow self-router thesis. |
| LA-9 | Self-REF — Learning to Route LLMs with Confidence Tokens | ICML 2025 · 2025-05 | Abstention as a route 88 | Established that LLMs can learn to emit explicit confidence tokens predicting downstream correctness — the lineage parent for the typed-abstention contract. |
| LA-16 | LLMs Encode Their Failures: Predicting Success from Pre-Generation Activations | arXiv:2602.09924 · 2026-02-06 | Shadow self-router (§08b); zero-shot confidence 10 | Pre-Feb-8 by 2 days. AUROC > 0.70 from linear probes on residual-stream activations; cross-model encoder pattern that enables shadow self-router on closed APIs without fine-tuning. |
| Benchmark parents | ||||
| LA-6 | RouterEval — 200M Record Comprehensive Benchmark | EMNLP 2025 Findings · 2025-11-12 | RouterArena; Dynamic Routing Survey 8 | First benchmark documenting model-level scaling — as candidate pool grows, capable router exceeds best individual model. |
| LA-7 | RouterArena — Open Platform for Comprehensive Router Comparison | ICLR 2026 · arXiv:2510.00202 · 2025-10-01 (ongoing) | Capability matrix §06; drift §12 | Live leaderboard infrastructure for router comparison — 44 categories, Bloom difficulty levels, 5 metrics, commercial-router inclusion; the missing primitive for auto-refreshed capability matrices. |
| LA-10 | RouteLLM — Strong/Weak Cascade Foundation | ICLR 2025 (Berkeley / Anyscale) · 2025-01 | DiSRouter 7; Martian 55; Not Diamond | Foundational ancestor cited by DiSRouter and many 2026 systems; established the strong/weak model cascade concept that the productised commercial routers extend. |
| LA-14 | LLMRouterBench — Massive Benchmark and Unified Framework | arXiv:2601.07206 · 2026-01-12 | §13 drift / operational tooling | Pre-Feb-8 by 27 days. Unified evaluation harness (400K+ instances, 21 datasets, 33 models, 10 routing baselines); cited inside lineage appendix only. |
| LA-15 | MMR-Bench — Multimodal LLM Routing Benchmark | arXiv:2601.17814 · 2026-01-28 | Multimodal dispatch (§07b); BET 08 | Pre-Feb-8 by 11 days. Multimodal routing benchmark referenced by EquiRouter 71 and others. Lineage parent for §07b multimodal-dispatch gap. |
| Field-substrate parents | ||||
| LA-11 | Hidden Cost of LLM Drift: How to Detect Subtle Shifts Before Quality Drops | insightfinder.com/blog · 2025-12-08 | 90-day degradation 75; drift taxonomy 77 | Drift detection cost ~1–2% overhead; undetected drift costs 5–20% revenue over 90 days; 10–20× ROI on drift monitoring. |
| LA-12 | ZenML — What 1,200 Production Deployments Reveal About LLMOps in 2025 | zenml.io/blog · 2025-12-19 | Gateway framing 62; outage postmortem 65 | Aggregates Amazon Rufus multi-model evolution, GetOnStack $127/wk → $47K/mo recursive-loop incident, Cursor Tab 400M req/day, OpenTelemetry instrumentation patterns. |
| LA-13 | Hugging Face TGI Maintenance Mode | huggingface.co · 2025-12-15 | Fleet-rotation framing 69 | Foundational fleet-rotation evidence: the most-widely-used open-source inference framework went into maintenance; teams that built on TGI face migration cost. |
| LA-17 | Semantic Router | github.com · 2024 → 2026 (ongoing) | Heuristic routing §05b; rule-vs-ML framework 87 | Lineage no-train system (utterance-embedding routing); 10–100× faster than LLM-based routing; primary tradeoff: utterances go stale on distribution shift. |
| LA-18 | AWS Bedrock Intelligent Prompt Routing — original docs | docs.aws.amazon.com · 2025-11-15 | AWS Bedrock 46, cross-region 98 | Documented as "may not always provide optimal routing for unique or specialized use cases" — vendor-managed black-box that cannot adapt to application-specific distribution. |
| LA-19 | AWS Bedrock 1-Hour Prompt Caching | aws.amazon.com · 2026-01-26 | Bedrock cross-region 98 | Pre-Feb-8 by 13 days. Closes gap with Anthropic Direct on cache TTL; combined with Bedrock IPR (LA-18) is the most powerful Bedrock-native composition. |
| LA-20 | Red Hat — Master KV Cache Aware Routing with llm-d | developers.redhat.com · 2025-10-07 | llm-d predicted-latency 44; precise prefix-cache 45 | EPP scoring of vLLM decode pods, 87.4% cache hit rate, 88% TTFT reduction, 99.92% session affinity. |
| Regulatory parents | ||||
| LA-21 | EU AI Act (Reg. 2024/1689) + 2026-05-07 Digital Omnibus amendment | EUR-Lex · 2024-07-12 / Commission · 2026-05-07 | §03b audit surface; gateway routing decisions (Art. 9, 10, 13, 14, 15, 50, 53, 55) | Original Annex III deadline 2026-08-02 was pushed to 2026-12-02 by the Digital Omnibus of 2026-05-07. GPAI obligations (Art. 53, 55) remain live since 2025-08-02; systemic-risk reporting clock is unchanged. The article-by-article mapping in §03b is grounded in the EUR-Lex text. |
| Cascade economics parents | ||||
| LA-22 | FrugalGPT — How to Use Large Language Models While Reducing Cost and Improving Performance | arXiv:2305.05176 · 2023-05 (TMLR · 2024-12) | §04b "80% rule"; cascade tier discussion | Three-tier learned cascade (GPT-J → J1-L → GPT-4) trained against a DistilBERT scoring function. The headline result behind the "80% rule" naming: 80% cost reduction on HEADLINES at matched or improved accuracy; across-dataset range 50–98%. The 2026 cascade literature (BoundaryRouter, ConfSpec, Mahmood et al. ICLR 2026) reads as direct extension of this baseline. |
One sentence per cluster names the in-window descendant. Online routing parents (LA-1) feed ParetoBandit and AdaServe. Causal routing (LA-2) feeds MTRouter27 and the §14 causal-correction discussion. RL multi-round routing (LA-3) feeds GraphPlanner28. Token/step granularity parents (LA-4, LA-5) feed TRIM3 and the NVIDIA Prefill Activations Router1. Confidence and self-routing parents (LA-8, LA-9, LA-16) feed DiSRouter7 and the shadow-self-routing pattern in §08b. Benchmark parents (LA-6, LA-7, LA-14, LA-15) feed §07b multimodal dispatch and §13 drift instrumentation. Field-substrate parents (LA-11–LA-13, LA-17–LA-20) feed the in-window gateway, outage, drift, vendor and KV-cache references throughout the survey.