ROUTING the ROUTERS abstract references

SURVEY · last few months

Routing the Routers

a survey of recent model-routing research

Model routing has changed shape in the last few months, but not as one clean wave. Research routers now consume internal activations, prefix-cache state, step-level confidence, and full trajectory state. Shipped systems mostly expose request-level routers, fallbacks, static role assignment, and serving schedulers. The useful question is therefore not whether “routing” won; it is which layer is routing, which objective it optimizes, and whether that evidence transfers to long-horizon agents.

This is a survey of that work, plus the production systems and vendor primitives that shipped alongside it. The aim is to make the literature legible without collapsing different claims into one frontier: model-selection routers are not cache schedulers, cascades are not pre-generation routers, and agent role dispatch is not the same as dynamic per-step model switching. At the end we collect a short set of candidate experiments — hypotheses worth trying, not commitments — drawn from the most credible ideas in the field.

§ 01

#What's changed

Routers used to look at the query text and almost nothing else. That is no longer the strongest research signal for model selection. Recent work routes on internal model activations1, on the prefix-cache state of the serving stack2, and on confidence trajectories that emerge between reasoning steps3. NVIDIA's prefill-activations router, for instance, closes 45.58% of the gap between the strongest standalone model and an oracle while saving 74.31% over always-call-the-largest1. Query-only routing remains the cheap baseline — and kNN-over-past-performance is a surprisingly strong one — but it is no longer where the frontier evidence sits.

The unit of decision has shrunk in research. Where a 2025 router typically picked a model per query, recent papers pick per reasoning step or even per token. TRIM3 reports 5–6× cost-efficiency gains on MATH-500 and AIME by routing between the steps of a single chain-of-thought. ConfSpec4 stacks a step-level cascade on top of token-level speculative decoding for 2.24× end-to-end speedup with no quality drop. KAD36 goes a level further, framing per-token deferral as a 0–1 knapsack. Production has moved more slowly: most deployed routers still decide per request, per session, or per static agent role.

And the field is openly split on architecture. Centralized capability matrices — Topaz5, RouteProfile6, Dimension-Direct Routing14 — landed in the same few weeks as DiSRouter7, which dispenses with the central matrix and instead trains each model to assess its own competence. Both camps have credible papers, and neither has the resolution yet.

Vocabulary note. The word router covers anything that picks which model invocation handles a query — pre-generation routers, cascades, capability-matrix dispatchers, serving schedulers, fallback gateways, and agent-level role selectors. Authors disagree on the boundaries. This survey keeps the terms, but separates the evidence by layer whenever the distinction changes what can be concluded.

The rest of the post walks the literature stream by stream — signals, granularity, learned routers, cascades, capability matrices, self-routing, the cache layer, multi-objective work, agent dispatch, the vendor layer — and closes with experiments that force each claim onto a real workload, with terminal success, retry cost, cache behavior, and drift measured together.

§ 02

#Background — seven research streams

Model routing is not a single research field. It is seven adjacent threads — learned routers, cascading inference, capability matrices, self-routing, cache-aware scheduling, agent-level dispatch, and vendor gateways — that have historically published in different venues with different vocabularies. The first job of this section is to name those threads and place the systems we cite into them.

The Trinity / Huawei Dynamic Routing Survey8 (2026-02-23, revised 2026-04-21) is the first cross-stream taxonomy to land in this window. It introduces a six-paradigm taxonomy and a three-dimension framework (when, what, and how to route) and is cited by most of the work that follows.

The threads also overlap by mechanism, not only by topic. RouteProfile6 and DiSRouter7 appear within a week of each other and argue against each other directly — one defending matrix profiles, the other defending self-confidence as the primary signal. Dynamo 1.02 and llm-d44 both consume vLLM's KVEvents API, which was a serving-systems detail eighteen months ago and is a routing signal now. TRIM3's step-level routing is the result "Is Escalation Worth It?"13 is responding to. The figure below places each system in its primary stream and draws an edge wherever two of them share a signal, a cache, or a benchmark.

FIG. 1 N = 70 · 2026-02-08 → 2026-05-08

FIG. 1 — Routing landscape, May 2026. Each column is a research stream; each dot is a paper or shipped system dated in the window. Bold dots are the most-cited entries in their column. Hairline edges connect entries that cite each other or solve the same sub-problem with a different signal. The most consequential cross-stream edges run between cascades (TRIM) and cache (Dynamo), and between capability matrices (RouteProfile) and self-routing (DiSRouter).

§ 03

#The signal — what routers actually consume

Before naming the routers, we should name what they read. A router is a function from signal to model selection. Different signals carry different amounts of information and have different acquisition costs. The 2026-era routers organize themselves not by their architecture but by which signal they trust.

Seven signals matter. They form a rough hierarchy of information richness × acquisition cost:

Query text. The raw user input. Cheapest to acquire; weakest signal. Heuristic and classifier routers consume this. The Dynamic Routing Survey8 places it at the bottom of the information hierarchy.
Query embedding. A learned representation of the query. The dominant signal of the 2024–2025 era. Still strong: a plain kNN average of past per-model performance over query embeddings matches sophisticated learned routers on five OOD benchmarks with only 1% of the historical data9.
Internal activations during prefill. The model's own hidden states on the first token, before any output is generated. NVIDIA's prefill-activations router1 uses an Encoder-Target Decoupling trick — an open-weight Encoder produces the routing features, the closed-source Target is what is being predicted — and a SharedTrunkNet MLP closes 45.58% of the gap to oracle. Fisher Separability picks which transformer layer to read from.
Confidence trajectory. A model's per-step or per-token confidence as it generates. Step-level cascades like TRIM3 and ConfSpec4 consume this; Confidence Leaps37 shows that conviction is non-monotonic — entropy drops sharply at a discrete "moment of insight" — and uses the leap as a training-free early-stop signal worth ~31% tokens. Zero-shot average log-prob10 beats supervised classifiers OOD at 0.83 vs 0.56 AUROC.
Capability vector. An explicit profile of which model is best at which axis (math, code, multilingual, tool-use, reasoning strategy). Topaz5, RouteProfile6, Dimension-Direct Routing14, and — for tool registries — SkillRouter30 all consume some flavor of this.
Cache state. The serving stack's prefix-cache locality, surfaced through vLLM's KVEvents API. The signal that quietly became the most reliable production input — used by Dynamo2, llm-d45, Ranvier16, and Augment Prism17. With KVEvents subscription, llm-d reports 80–95% cache-hit rates against 12% for random load balancing.
Marginal-gain estimate. The predicted Δquality of escalating, not the absolute quality of any model. RouteLMT11 and TRACER34 arrive independently at the same conclusion: predict Q_large − Q_small, not Q_large. The useful training signal is the disagreement between candidate routes; routers should predict marginal gain, not absolute quality.

Signals 3 through 7 are stronger than signals 1 and 2 in their respective regimes, but only after the objective is fixed. Cache state is a cost and latency signal, not a correctness signal. A confidence trajectory can be useful for escalation, but it is not the same as a calibrated probability of final task success. Query text alone, in May 2026, is a baseline rather than a competitive input for frontier model selection.

Router layer	Objective	Primary signal	Failure mode
Query-level model router	quality per dollar	query text, embedding, capability vector	OOD brittleness; public benchmark mismatch
Agent-step router	terminal success per dollar	trajectory state, tool results, remaining budget	early cheap mistakes poison later steps
Cascade / verifier	avoid unnecessary strong calls	draft answer plus confidence or verifier score	generation tax; verifier false negatives
Cache-aware scheduler	TTFT, goodput, serving cost	prefix hash, KV location, queue depth	quality ignored unless combined with model routing
Gateway / fallback	availability, policy, compliance	provider health, region, tenant rules	hidden model changes alter quality and style
Self-router	decentralized competence selection	self-assessment, logprob, activations	miscalibration and confident hallucination

FIG. 3 SIGNAL × COST × INFORMATION

FIG. 3 — The seven routing signals. Acquisition cost on the x-axis; information richness on the y-axis. The dashed curve is what a 2025-era router could see; the solid curve is the 2026 frontier. The most consequential moves of the quarter — prefill activations and cache state — sit on the upper-right portion of the new frontier.

§ 03b

#Audit surface — region pinning, data zones, and the EU AI Act clock

The EU AI Act's general-purpose-model transparency obligations are in force, and US enterprise procurement explicitly asks for region-pinned audit logs. The Digital Omnibus of 2026-05-07 pushed the Annex III high-risk-system deadline out to 2026-12-02, but the GPAI obligations (Art. 53, 55) have been live since 2025-08-02 and are what most enterprise routing decisions trip on firstLA-21. Compliance needs a single audit trail. The gateway is the only component that sees every model invocation, every cache hit, every fallback hop — so it should own that log. Otherwise auditors are joining logs across three vendors and two clouds.

The Act is not just a logging requirement; specific articles directly constrain which model can run where, on whose data, with what disclosure. The mapping the gateway has to enforce:

Article	Obligation	Routing decision it constrains
Art. 9	Risk management lifecycle for high-risk systems	Provider must have documented risk governance — narrows the eligible model pool per workload class.
Art. 10	Data governance, documented training-data lineage, prohibited-data exclusion	Model selection must exclude providers without published training-data summaries.
Art. 13	Transparency to deployers, instructions for use	Routing response must carry model card, version, and provider identity through to the deployer.
Art. 14	Human oversight, override, intervention	Router must expose abstain/escalate as typed outputs and route to a model that supports stop/pause tokens.
Art. 15	Accuracy, robustness, cybersecurity per NIST baseline	Disallow models below the workload's accuracy floor; enforce non-CLOUD-Act geography for sensitive workloads via fallback chains that cannot cross zones.
Art. 50	Disclosure of AI interaction; marking of synthetic content	Inject AI-system identification at response time; flag generated content before it reaches the end user.
Art. 53	GPAI provider documentation obligations (Annex XI)	Route only to GPAI providers with published model cards, training summaries, evaluation results.
Art. 55	Systemic-risk GPAI: adversarial testing, 24-hour incident reporting	Gate use of systemic-risk GPAI on the organisation's ability to meet the 24-hour reporting SLA.

Several vendor primitives gate region pinning today. Azure Foundry exposes routing.models as a subset selector for compliance and data residency93, and Foundry's data-zone guarantees plus zero-prompt-storage policy25 are Microsoft's anchor against GDPR-extension auditors. AWS Bedrock ships Cross-Region Inference with SCP-policy enforcement98, so the data-zone boundary is enforced at the Control-Tower layer rather than asserted by the application. Vertex AI applies VPC Service Controls to context caches92, which means the residency of a cached prefix is bounded by the same network perimeter as the model endpoint. NVIDIA Dynamo exposes prefix_id for KV-cache pinning and latency_sensitivity as a routing hint for SLO-band selection97. Anthropic's cache-isolation granularity — per-API-key versus per-organization — is not disclosed in current docs, which leaves a common procurement question unanswered, and Anthropic-direct has no documented EU-resident inference path as of 2026-05-08; Anthropic's announced EU Sovereign offering is on the roadmap, not in production. Among gateways, Requesty (2026-05) and Kong AI Gateway (2025-11) are the only two with published article-by-article AI Act mappings114 115; AWS, Azure, Vertex, OpenAI ship building blocks (region pinning, guardrails, zero retention) but leave the article-to-decision mapping as homework for the integrator.

A logging convention that the Pharos Production 2026 synthesis66 finds across most viable production systems: every dispatch carries a small fixed set of fields — request id, routed model, routed region, cache-hit bit, routing-signal value, cost, latency — and the log lands in a per-tenant data zone matching the user's residency contract rather than a global pool. The point Pharos draws out from 25-plus production systems is that compliance, routing, and observability only converge cleanly at the gateway layer; anything that tries to bolt them together higher up tends to grow disagreements between vendors and clouds.

Aside — the ngrok 2026 verdict62 is the cleanest summary of where this lands: agentic workflows in regulated industries make the gateway non-negotiable. By the time an agent has fired 20–50 LLM calls per user action, there is no version of "we will figure out the audit log later" that survives a compliance review.

§ 04

#Learned routers grew up

Through 2025 the dominant question was which classifier predicts which model wins. By April 2026 that's the wrong question. RouteLMT11 (ACL 2026 Industry, 2026-04-24) and TRACER34 (2026-04-16) land within forty-eight hours of each other and arrive independently at the same conclusion: predict the marginal gain Δ = Q_large − Q_small, not the absolute quality Q_large. RouteLMT does it with an in-model LoRA adapter probing the small model's prompt-token representations — no external classifier, no hypothesis decoding. TRACER does it with a lightweight surrogate trained on production logs and a parity gate at threshold α: the surrogate answers only when its agreement with the teacher exceeds α; otherwise the query escalates. On a 77-class intent benchmark the surrogate covers 83–100% of traffic, and on one 150-class workload it fully replaces the teacher.

The other learned-router story is NVIDIA's Prefill Activations Router1, posted on 2026-03-21. Rather than reading the query text, it reads the model's internal hidden states during the prefill pass. The Encoder-Target Decoupling trick lets an open-weight encoder (Llama-3) produce routing features for a closed-source target (GPT-5); a SharedTrunkNet MLP predicts correctness probabilities across all candidates simultaneously. Headlines: 45.58% of the gap to oracle closed, 74.31% cost saved versus always-call-the-largest. The implication is unsettling and, in retrospect, obvious: the best signal for routing was inside the model the whole time.

Two more pieces fill in the picture. MTRouter27 (ACL 2026, 2026-04-26) is the first router to encode full multi-turn interaction history as the routing signal, surpassing GPT-5 quality on ScienceWorld at 58.7% lower cost (and 43.4% on HLE). DialRouter31 uses Monte Carlo Tree Search over dialogue branches to discover an emergent specialization by turn position — some models are better at early-turn context-setting, others at mid-dialogue reasoning, others at final-turn summarization, a structure single-turn routers cannot see. BayesianRouter32 fuses an offline Bradley-Terry head with online Thompson sampling for reward-model routing, and ParetoBandit33 closes a long-standing gap by enforcing a dollar-denominated cost ceiling in closed loop — budget compliance within 0.4% across a 530× cost range, with a cold-started new model integrated in roughly 142 steps.

The most-cited industry pilot is RouteNLP35 (ACL 2026 Industry, 2026-04-26): an 8-week deployment at a customer-service division processing ~5,000 queries / day. The novel piece is a distillation-routing co-optimization loop — failure clusters from the router feed targeted distillation of the cheap models, which then feeds the next router-retraining round. Result: 58% cost cut, 91% response acceptance, and p99 latency from 1,847 ms to 387 ms. Routing is no longer a one-time engineering decision; it is a continuous loop coupled to the retraining pipeline.

The negative controls matter as much as the wins. LLMRouterBench110 reports that top routers can cut cost at matched best-single performance, but also that several recent and commercial routers fail to beat a simple best-single baseline under a unified harness. Unsolvability Ceiling108 shows why: judge noise, truncation, parse failures, and unsolvable-label artifacts can manufacture routing opportunity. A routing number is only comparable when the model pool, task distribution, judge, token budget, and retry policy are the same.

The embarrassing counter-finding belongs to an EACL 2026 short paper9: a plain kNN average over past per-model performance, indexed by query embedding, matches the sophisticated learned routers across SPROUT, RouterBench, LiveBench, BigGenBench, and EmbedLLM with 1% of the training data, out of distribution. Heuristics aren't dead; they were under-leveraged.

We do not need a more complex model. We need to predict the right quantity. — paraphrase, RouteLMT, ACL 2026

FIG. 4 MARGINAL GAIN VS. ABSOLUTE QUALITY

FIG. 4 — Why marginal gain is the right objective. Predicting absolute model quality (A) gives a noisy proxy for whether escalation will help: the cloud has no fittable decision boundary. Predicting Δquality directly (B) yields a tight diagonal a router can actually use. Both RouteLMT and TRACER converged on this independently in late April11.

§ 04b

#When heuristic wins — the complexity router pattern, the 80% rule, and the GIL ceiling

Khayyam's five-question framework87 — logic, data, time, explainability, resource — is the cleanest way to decide whether a router should be rules-based or ML-based, and the empirical claim that anchors it is direct: most product teams at sub-100K daily requests hit the barrier at the data and resource questions before they hit any architectural one. Heuristics are not a placeholder; for that regime they are the answer. The point pairs naturally with the EACL kNN result9: most teams should keep routing rule-based or kNN-based until scale and labeled data justify training anything more elaborate.

The reference implementation that comes up most often is the LiteLLM Complexity Router pattern. Four tiers — SIMPLE → MEDIUM → COMPLEX → REASONING — with under 1 ms of routing overhead and no external dependency. The natural comparison is the Semantic Auto Router, which depends on an embedding API and pays 100 to 500 ms per decision for the round-trip. FIG. 15 lays out the latency cost across mechanisms. No learned-router architecture currently beats the complexity router on routing-decision latency by an order of magnitude, so the real tradeoff is whether per-query routing is worth the embedding cost — and below roughly 100K daily requests, it usually isn't.

One counter-result worth knowing about is the GIL ceiling. LiteLLM degrades at around 300 to 500 RPS under sustained load — Python's GIL bottleneck — while Bifrost, written in Go, reaches roughly 50× lower gateway latency at 5,000 RPS (11 µs vs hundreds of µs)68. This isn't a critique of learned routing; it's a fact about the runtime of the gateway. Routing-decision cost depends on the implementation language as much as the algorithm.

The Portkey 2026 in-prod aggregate67 (vendor-published, so directional rather than authoritative) reports multi-LLM team adoption jumping from 23% to 40% in ten months across 650-plus teams and 2T-plus tokens, with semantic caching saving an average of ~38% on LLM cost and average tokens per request quadrupling. Read directionally, that's the field shifting from "pick one model" to "route between several" inside a year. The dominant operational pattern at sub-100K-RPM tenants is still rule-based dispatch with a semantic cache underneath, not learned-router-as-a-service.

Aside — the "five to ten simple rules cover 80% of routing needs" claim is real (LogRocket field practitioners; absorbed into ref-87) but it has a sharp edge: that 20% is where the bandit, the kNN baseline, and the capability matrix earn their keep. The complexity router is the floor, not the ceiling.

FIG. 15 ROUTING DECISION · LATENCY × MECHANISM

FIG. 15 — Routing-decision latency, by mechanism. Rules-only and complexity-routing pay sub-millisecond cost; Bifrost's Go gateway is two orders of magnitude faster at 5,000 RPS than LiteLLM under load. Semantic and learned routers pay 100–500 ms per decision — recoverable when the routed-model decision saves more than that in token cost, lossy otherwise68 87.

§ 05

#Cascades, now with receipts

The dominant production routing unit is still usually the request or session; the research frontier is the step. TRIM3 (LinkedIn / CMU, ICLR 2026) routes between the steps of a single chain-of-thought trace using a process reward model that scores per-step correctness confidence; only the steps likely to derail the solution go to a larger model. The simplest threshold variant already yields 5× cost efficiency on MATH-500 over query-level routing at matched accuracy, and up to 6× on AIME. The TRIM authors phrase the insight cleanly: "expensive calls confined to precisely those steps where stronger models prevent cascading errors."

ConfSpec4 (arXiv:2602.18447) exploits a different asymmetry: generating a correct step is hard, but verifying one is a constrained discriminative task that the small draft model handles well within its competence range. ConfSpec produces both the step and a confidence score for it; high-confidence steps are accepted, low-confidence steps are escalated to the target model. The framework is explicitly orthogonal to token-level speculative decoding, and the two compose multiplicatively for 2.24× end-to-end speedup at no quality loss — a rare stackable result.

CascadeDebate12 (2026-04-14) inserts a third option at the escalation boundary — deliberate. When the confidence router flags uncertainty, instead of immediately escalating it activates a lightweight ensemble of agents at the same scale to debate; only failed deliberation triggers escalation. Across five benchmarks, CascadeDebate reports +26.75% accuracy over strong single-model cascades; an online threshold optimizer alone contributes another 20.98–52.33% relative gain over fixed-threshold policies. In practice this means static thresholds are brittle under production distribution shift.

The granularity goes finer in two directions. Confidence Leaps37 (EACL 2026) shows that confidence is non-monotonic: it stays flat and then spikes at a discrete moment of insight that's detectable by a token-entropy drop. The trace up to the leap transfers across model families as a prefix, which hints at a class of reasoning-transfer protocols nobody has built yet. KAD36 (Paris-Saclay / INSA Rennes, EACL 2026) frames per-token deferral as a 0–1 knapsack with primal and dual approximations; the dual gives an adaptive threshold that tightens or loosens with the budget's shadow price. Saguaro39 (Stanford / Princeton / Together AI, 2026-03-03) does an analogous trick at the token level on the speculation side, predicting the verification outcome during verification and pre-computing speculations for each anticipated result — 30% faster than optimized speculative-decoding baselines, up to 5× over autoregressive.

A counter-paper landed on 2026-05-07, one day before this report goes out. "Is Escalation Worth It?"13 gives a decision-theoretic characterization of two-model cascades using constrained optimization and Lagrangian duality. The cost-quality frontier turns out to be piecewise concave on decreasing-benefit regions of the confidence support; for a pool of k models, the achievable frontier is the pointwise envelope of all C(k,2) pairwise cascades. Validated across MATH, MMLU, TriviaQA, SimpleQA, and LiveCodeBench with eight models from five providers, the result is sobering: a lightweight pre-generation router beats the best cascade policy on four of five datasets. The reason is structural — cascades pay the cheap model's generation cost before the escalation decision is made, and that generation tax is unavoidable in a cascade no matter how thresholds are tuned. Pre-generation routers bypass it.

Reading these alongside Routing, Cascades & User Choice38 (2026-02-10, U Ottawa / NVIDIA), which models routing as a Stackelberg game between provider and user (see §12 for the build-side implication), gives two papers with opposite normative conclusions but the same underlying observation: cascading isn't free.

Worth flagging: Diminishing Returns of Early-Exit40 (2026-03-24) argues that modern LLMs trained with improved recipes have less early-exit potential than older ones. Dense transformers retain more layer redundancy than MoE or SSM architectures, alignment training homogenizes later layers, and base-pretrained models above 20B still have early-exit headroom but their fine-tuned descendants do not. The 2026 frontier is later in the stack, not earlier.

Routing without a verifier is optimism with extra steps. — paraphrase, CascadeDebate, 2026-04-14

FIG. 5 CASCADE TRACE · INTERACTIVE

FIG. 5 — Cascade trace simulator. A canned query streams through tier 1; per-token confidence is computed; if it drops below the threshold for two consecutive tokens, the trace escalates. The slider redrives the same trace; counters update live. Per-step cascading — TRIM-style3 — beats query-level cascading by 5–6× on MATH/AIME because the unit of decision shrunk.

§ 06

#The capability matrix moment

Capability matrices stopped being a thought experiment in April. Topaz5 (Georgia Tech, CHI 2026 HCXAI Spotlight, 2026-04-04) builds an explicit M[model × skill] matrix by synthesizing public benchmark performance across diverse tasks, and routes each sub-task in an agentic workflow to the matching specialist. The fact that this lands at CHI rather than NeurIPS is itself meaningful: interpretable routing is becoming a user-experience requirement rather than only an engineering one. Topaz's routing decisions produce full execution traces showing how skill-match scores were weighted against cost, with developer-facing natural-language explanations generated automatically — so an operator can audit why GPT-4o was chosen for step 3 but Haiku for step 7. The system also avoids the cold-start problem most matrix routers run into: it bootstraps from public benchmarks rather than learning a proprietary profile from interaction history, so it's deployable on day one without any labeled routing data.

RouteProfile6 (arXiv:2605.00180, 2026-04-30) treats LLM profiling as an independent research problem and formalises a four-axis design space: organisational form (flat vector vs. structured graph), representation type (discrete benchmark scores vs. dense learned embeddings), aggregation depth (domain-level vs. query-level signals), and learning configuration (fixed vs. trainable). Across three router families — classifier, preference-based, embedding-based — structured profiles (GNN-based) consistently win, query-level signals beat domain-level proxies, and trainable structured profiles generalize best when a novel LLM joins the pool mid-deployment. The reframing is what carries: before asking which router, RouteProfile argues you should ask which profile format.

Dimension-Direct Routing14 (Research Square, 2026-04-07) takes the strongest matrix-first position: a 12-model × 15-dimension capability matrix replaces the kNN-over-embeddings step entirely. The router predicts which of 15 dimensions the query primarily needs and then looks up the matrix; LLM-as-Judge across four quality dimensions reports +25.9% depth and +17.4% completeness over the embedding baseline. The paper is unusually honest about failure modes: semantic accumulation bias (the router over-weights the last few turns in long conversations and misidentifies the primary capability needed) and cross-domain routing instability (queries on the boundary of two dimensions trigger thrashing between specialists). Explicit matrices are explainable, but fragile — they break when the world adds an axis the matrix doesn't have.

The axis set itself has expanded in two directions. Route-To-Reason29 (USTC, ACM WWW 2026, 2026-04-12) is the first system to embed reasoning strategy — chain-of-thought, tool-use, few-shot, direct — as a first-class capability axis alongside model identity. RTR learns dense vectors of (model, strategy) pairs jointly and selects a strategy-model bundle, reporting 60% cost reduction by matching lightweight models with cheap strategies for simple queries. The embedding space surfaces something a flat model × benchmark matrix collapses: some models have narrow strategy "comfort zones" (excel at chain-of-thought, degrade with direct answers) while others are strategy-agnostic. SkillRouter30 (2026-03-23) addresses the dual problem at the tool layer. As agent skill registries grow past 80,000 entries, exposing all skills at inference becomes infeasible, but exposing only names and short descriptions hides the implementation details that turn out to be the routing signal. Hiding the full skill body costs 31 to 44 percentage points of routing accuracy across architectures; SkillRouter retrieves and reranks over the full skill body and hits 74.0% Hit@1 with 13× fewer parameters and 5.8× faster than the strongest base pipeline.

The orchestration-layer take on the same problem is GraphPlanner28 (UIUC, ICLR 2026, 2026-04-26), an RL-trained MDP that selects both an LLM backbone and an agent role (planner, executor, summarizer) at each step, over a heterogeneous graph (GARNet) that captures interaction memories across queries, agent instances, and responses. Reported numbers: +9.3% accuracy across 14 tasks, with GPU memory dropping from 186.26 GiB to 1.04 GiB.

FIG. 6 M[MODEL × SKILL] · INTERACTIVE

FIG. 6 — Capability heatmap. Eight models × twelve skills. Cell luminance encodes score on each skill's source eval (named on hover). The matrix is illustrative; the real frontier in 2026 is which axes you choose — RouteProfile6 frames this as a four-axis design space, and Topaz5 generates traceable rationale per dispatch decision.

§ 06b

#Multimodal dispatch — PDF→MD as the canonical case

The surveyed vendor primitives do not expose modality as a first-class routing knob. No major model host (Bedrock, Foundry, Vertex AI) offers a "this is a vision query, route differently" primitive at the model-selection layer: AWS Bedrock supports vision on Claude and Titan-Multimodal but doesn't surface modality-specific routing; OpenAI, Anthropic, and Google Gemini all serve vision through the same endpoint with no modal-aware dispatch. Gateway products like OpenRouter's video routing (§11) sit one level up — they pick a provider, not a model within a provider, which is a different layer of the routing stack. The only formal evaluation harness for routing across modalities is MMR-Bench (§18, LA-15), and that's pre-window. Modality today is an attribute the routing layer infers from the request payload, not a knob the model host exposes.

On document-conversion workloads (PDF→Markdown is the canonical example), Dynamo's multimodal worker selection15 picks the worker holding the highest cache overlap including image content blocks, so consecutive pages of the same document route to the same worker — the cache-locality argument from §09 reapplied across modalities. The Dynamo osl hint97 (expected output sequence length) lets the scheduler reserve decode capacity for the long Markdown output that a PDF page produces. Vertex's context-caching92 is agnostic to modality and survives cross-modal calls; the cache resource is the same whether the cached prefix is text, image, or both. None of these primitives does modal-quality dispatch — there is no "send vision-only queries to model A, vision+text to model B" — only cache-locality and SLO-hint routing.

A bet in §15 (BET 07) is that this gap closes by mid-2027 — a modality-dispatch knob landing as a first-class API primitive in at least one of AWS Bedrock, Azure Foundry, or NVIDIA Dynamo. Until then, the multimodal routing surface is constructed at the application layer from the primitives above.

§ 07

#Self-routing — the contrarian play

While Topaz and RouteProfile invest in centralized capability matrices, DiSRouter7 (Shanghai Jiao Tong / Kai Yu lab, ICLR 2026, 2026-04-22) rejects the premise. Each LLM in the fleet undergoes Self-Awareness Training — a calibration objective that teaches the model to score its own competence on a query within a trustworthy range — and then a distributed protocol lets the most-confident model take the query. There is no central matrix to maintain, no benchmark suite to run on every fleet expansion. Adding a new model means training the new model, not retraining the router. The hard condition is calibration: without it, self-assessment routes fluent overconfidence rather than competence.

Architecturally, this is the cleanest answer this quarter to a problem central-matrix designs tend to understate: the maintenance burden compounds with every model. RouteProfile's own design-space study makes the point indirectly — profile choice is not free, structured profiles need ongoing curation, and central matrices encode capability information about models that change every week. DiSRouter pushes that information into the models themselves, where it is updated each time the model is fine-tuned anyway.

The tradeoff has teeth. Explainability gets harder when the routing decision lives inside ten different models' hidden states. A miscalibrated self-assessor causes silent quality regressions with no external audit trail. And cross-organizational coordination — when no single party owns all the models in the fleet — becomes the bottleneck the central matrix used to absorb. The likeliest 2026–2027 outcome is hybrid: a sparse central matrix capturing broad archetypes, with distributed self-assessment within each archetype for fine-grained selection. GraphPlanner's28 heterogeneous graph, which captures both model-level and role-level capability structure, already gestures at that convergence — and §15's predictions take the bet that at least one frontier lab ships a self-routing fleet in production before the end of 2026.

FIG. 7 CENTRALIZED VS. DISTRIBUTED

FIG. 7 — Centralized vs. distributed routing. Left: a central router consults a capability matrix and dispatches to the highest-scoring model. Right: every model self-assesses competence; the most confident takes the query and the others stand down. Numbers on the right are illustrative self-confidence scores (0–1).

§ 07b

#Shadow self-routing — DiSRouter-like cascades without fine-tuning

DiSRouter7 argues that each model can emit a calibrated self-assessment and the most-confident model can take the query, without a central matrix to coordinate them. Appendix A.9 makes the argument that matters here: "high-capability closed-source LLMs (e.g., GPT-4) may already possess strong intrinsic self-awareness and can be integrated without additional training." If that's true, a router can be built from frozen signals already exposed at inference, with no fine-tuning required.

Four such signals appear in the literature, ordered by how much evidence supports them. Average token log-probability reaches AUROC 0.65–0.83 in-distribution and 0.72–0.83 out-of-distribution10; it's a property of the generation rather than the query, which means it generalizes across query distributions. A GSA v3 single-token YES/NO probe reaches AUROC 0.56–0.72 at about 500 ms latency10 and can run before the cheap generation as a pre-generation filter. Self-consistency spread across N samples is more accurate but pays N× the cost, which limits it to high-stakes spot checks. Activation probes — linear probes on residual-stream activations — reach AUROC above 0.70 (§18, LA-16), with the NVIDIA Prefill Activations Router1 as the in-window production-grade cousin.

One vendor constraint to flag: Anthropic does not expose raw logprobs at the time of writing, which means the logprob signal is unavailable on Claude calls. That's a routing-architecture limitation rather than a model-quality one, and the §12 semver-lie failure mode74 compounds it — even when logprobs are exposed elsewhere, they shift across minor snapshot bumps.

FIG. 16 walks the deployment topology. An incoming query first hits a pre-generation filter (the GSA v3 probe). If the probe's confidence is below threshold_low, the query routes directly to the strong model and the cheap path is skipped entirely. Otherwise the cheap model generates a response and emits per-token log-probabilities. The average logprob is compared to a calibrated threshold: a high average returns the cheap response, a low one escalates to the strong model. On a generic enterprise workload the expected operating point is around 94% of strong-model quality at roughly 18% reduction in expensive calls; on TriviaQA-shaped data, where the cheap model is more accurate than the strong/weak prior typically assumes, savings are larger10. Two tradeoffs sit underneath these numbers. The cheap generation is paid for before the escalation decision, so the cost floor is non-zero. And the logprob threshold drifts with every snapshot bump, so it needs recalibration against a held-out canary at each transition.

A related idea worth surfacing: Tian Pan88 argues that abstention should live in the router as a typed output — { answer: string } | { abstain: { reason, missing_information } } — rather than as a system-prompt rule. A typed abstention unlocks four downstream routing decisions: escalate to a stronger model, escalate to a tool, escalate to a human queue, or ask the user for clarification. DiSRouter's Self-Awareness Training implements the same idea on the model side; the typed-output contract makes it composable across a multi-model topology without baking it into a prompt.

One failure mode worth being explicit about: a model can be confidently wrong. High log-probability is not the same as correctness — RLHF-trained models often hallucinate at high confidence — and a shadow self-router will return those confident hallucinations at the cheap tier. SMARTCAL109 is the relevant tool-use warning: self-aware tool use improved only after explicit recalibration reduced overconfidence and tool abuse. Self-assessment is a routing primitive, not an audit primitive. Anything deployed on top of it still needs a separate verification layer (parity gates, eval harnesses, human review on regulated workloads) to catch the confident-wrong case.

FIG. 16 SHADOW SELF-ROUTER · FROZEN SIGNALS

FIG. 16 — Shadow self-router: a frozen-signal cascade requiring no fine-tuning. Pre-gen GSA v3 probe handles the obvious-route cases; cheap-model generation plus average token log-probability handles the ambiguous middle. Anthropic's not-exposing-raw-logprobs is a routing-architecture constraint, not a model-quality one10.

§ 08

#Cache is the router

The most consequential infrastructure story of the quarter is that the prefix-cache became a primary serving signal in production stacks. It is not a substitute for quality routing; it changes the cost and latency terms that a model router should optimize. vLLM's KVEvents API standardized the surface — every cache block allocation and eviction emits an event with the block hash, worker id, tier, and action — and a half-dozen routers built on top of it. NVIDIA Dynamo 1.0 GA15 (2026-03-16, production-ready 2026-04-17) ships a cluster-wide Flash Indexer that maintains a real-time map of which KV blocks live on which workers, updated by KVEvents emissions, at 170M ops/s with sub-millisecond lookups. The KV-aware router weights queue depth and KV-cache overlap fraction; an Agent Hints API lets each request declare latency sensitivity, expected output sequence length, and a cache-pinning duration; a multimodal router downloads images, encodes them, and selects the worker with the highest cache overlap including image content blocks. Llama 3.1 on Dynamo + the NeMo Agent Toolkit reports 4× lower TTFT and 1.5× higher throughput.

llm-d ships the same idea inside Google's Kubernetes Inference Gateway. Its Predicted-Latency Scheduler44 (2026-03-13) runs an XGBoost model continuously retrained on Vertex AI with live serving telemetry — input length, output-length estimate, queue depth, KV utilization, prefix-cache hit ratio — and reports −70% P50 TTFT versus heuristic queue-depth routing. The companion Precise Prefix Cache Routing45 guide subscribes to the KVEvents stream so the EPP knows which blocks each worker actually holds at any moment, not which it used to hold; 80–95% cache-hit rates on multi-turn conversation against 12% for random load balancing and 40–60% for hash-routing without eviction awareness.

Ranvier16 (2026-03-16) rebuilds the idea engine-agnostically. The data structure is an Adaptive Radix Tree rather than a hash table, so a prompt that partially matches a cached system prompt routes to the worker that has the most of it — not just exact matches. Benchmarks on 13B models report 79–85% P99 latency reduction over round-robin; the P99 gap is larger than P50 because P99 is exactly where the worst-case re-prefill penalty lives. A single deployable binary, OpenAI-API compatible, no vendor lock-in.

The user-visible quote of the quarter belongs to Augment Prism17 (2026-05-02). Switching models mid-conversation evicts the prompt cache; the next call pays roughly 10× the marginal token cost. A naive per-turn router that switches frequently can be more expensive than always using the frontier model. Prism's solution is a small planner that fires only when it predicts a switch is worth the eviction cost, and it stays sticky across the agent loop — for tool-result follow-ups (about 96% of all chat-host turns) the planner reuses its prior decision and runs on only ~4% of turns. The planner itself costs $0.91 per $2,649 of total spend (0.03% overhead). Net: matches the best individual frontier model on quality at 20–30% lower cost per task. Model-switching has become a first-class cost the router must price.

The frontier of cache-aware routing is PrfaaS43 (Moonshot AI, 2026-04-16). Prefill-decode disaggregation has historically pinned both phases to the same datacenter — sometimes the same rack — because the KV transfer bandwidth is too high. Kimi Linear's hybrid attention shrinks KV cache size enough that cross-DC transfer over commodity Ethernet becomes practical; PrfaaS selectively offloads only long-context prefills to remote compute-dense clusters, monitoring queue depth and Ethernet link utilization to avoid congestion. +54% throughput, −64% P90 TTFT against local-only PD on a 20× scaled-up evaluation cluster.

System	Date	Mechanism	Headline
NVIDIA Dynamo 1.0 GA	2026-03-16	Flash Indexer + 4-tier KV hierarchy + Thompson sampling	170M ops/s · 4× TTFT
llm-d (predicted-latency)	2026-03-13	XGBoost TTFT/TPOT predictor in Vertex AI	−70% TTFT p50
llm-d (precise prefix-cache)	2026-03	Direct KVEvents introspection	80–95% hit rate vs. 12% random
Ranvier	2026-03-16	ART-based prefix router · OSS · engine-agnostic	79–85% P99 ↓
Augment Prism	2026-05-02	Cache-aware coding-agent router; eviction-cost-aware	~30% cost ↓ · eviction-cost-aware planner
PrfaaS (Moonshot AI)	2026-04-16	Cross-datacenter KV routing	+54% throughput · −64% P90 TTFT

FIG. 8 KV-CACHE LOCALITY · INTERACTIVE

FIG. 8 — KV-cache locality and the eviction tax. Each cell is a shard in the serving stack; particles are incoming requests with prefix hashes. A particle that matches a shard's cached prefix is served at near-zero marginal cost; a particle that forces a model-switch evicts the cache and pays the eviction tax described by Augment Prism17. Click SWITCH MODEL to watch the cache invalidate and the tax counter climb.

§ 08b

#Vendor primitives — the API knobs we actually invoke

The cache-as-router argument of §09 is real, but it ships as a long list of vendor-specific API knobs that a builder has to actually invoke. The table below is the cookbook — every primitive we use, what it gates, when we invoke it, and which routing-or-cache property it controls.

Vendor	Primitive	What it gates	When we invoke it	Routing / cache role
OpenAI	prompt cache	per-prefix automatic cache	every request >1024 tokens	break-even at first cache hit86
Anthropic	cache_control	explicit cache breakpoints · 5m/1h TTL	system / tool / sliding-conv boundaries	three-breakpoint BP1/BP2/BP3 architecture91 84
Anthropic	max_tokens: 0	cache pre-warm without billing output	pre-deploy warm-up of long system prompts	amortise 25% write premium91
AWS Bedrock	cross-region inference	capacity-aware geographic routing	compliance-band traffic	~10% savings · SCP-gated98
AWS Bedrock	Intelligent Prompt Routing	vendor-managed model selection	baseline option for generic workloads	OpenAI-compatible cross-provider endpoint46
Azure Foundry	routing.mode	balanced / cost / quality preset	tenant-level routing-policy default	data-residency via routing.models subset93
Google Vertex	cachedContent	explicit cache resource · TTL	long-context Gemini calls	90% discount on 2.5+92
Google Vertex	VPC Service Controls	data-residency on cache resources	EU and regulated tenancy	compliance is a routing decision92
Portkey	strategy: fallback	ordered provider list with retry	edge gateway · all traffic	three-tier fallback chain24
Portkey	strategy: loadbalance · hash_fields	deterministic hash-based dispatch	session-affinity workloads	cache-locality preservation
Portkey	strategy.conditions.query	MongoDB-style metadata routing	per-request policy override	conditional model selection94
OpenRouter	X-OpenRouter-Cache	gateway-layer exact-match cache · 300s	idempotent / repeated traffic	HIT/MISS observability header95
Vercel	providerOptions.gateway.models	ordered fallback array	edge ai-gateway · per request	modelAttempts response metadata96
NVIDIA Dynamo	nvext.agent_hints	latency_sensitivity · osl · priority	every backend dispatch	routing hints into the scheduler97
NVIDIA Dynamo	prefix_id · CachePinType	KV-cache pinning per prefix	PDF→MD page sequences	cache locality is the routing key97
LangChain	Command(goto=…) · Send(…)	single-agent dispatch · parallel fan-out	LangGraph application layer	routing is code-explicit, not LLM-decided99
LlamaIndex	RouterQueryEngine	LLMSingleSelector · PydanticMultiSelector	application-layer retrieval	JSON or function-call selection100
Anthropic	Managed Agents · session	brain/hands decoupling · session pin	multi-turn agent workloads	session-level cache and routing26

The procurement-honest answer: no vendor publishes the added latency of its routing step — Azure Model Router's LLM-inference overhead, Bedrock IPR prediction time, cache-lookup latency on the cache-resource APIs. The build position is unchanged: every primitive we invoke is gated by our own measured-latency dashboard, not by the vendor's marketing number. This is the cookbook content, not the editorial position; the editorial position remains own the router, integrate the primitives.

§ 09

#Cost, latency, carbon — multi-objective routing

Multi-objective routing got real this quarter. BOute18 (MLSys 2026 oral, 2026-02-11) treats routing and GPU placement as a single coupled non-convex problem: query routing thresholds under quality and latency constraints, heterogeneous GPU resource allocation, parallelism strategies. A multi-objective Bayesian optimization loop wraps a serving simulator and converges in far fewer trials than RL or grid baselines. Result: up to 157% throughput gain (59% on average), or equivalently 15–61% cost savings (38% on average) at matched quality. No routing table is hand-tuned; the optimizer discovers thresholds for the specific GPU mix and cost target.

AdaServe19 (EuroSys '26, 2026-04-26 → 30) makes per-request speculative-decoding trees a routing decision. Each request carries an SLO in metadata; AdaServe looks up a pre-profiled SLO-to-tree-config mapping and assigns a customized draft configuration — tighter SLO yields a larger, more aggressive speculative tree, relaxed SLO yields minimal speculation. The routing question becomes "how much speculative compute for this request?", a continuous knob rather than a discrete model selection. 4.3× fewer SLO violations, 1.9× higher goodput against best-performing baselines.

BEAM42 (MLSys 2026 oral) adds energy as a first-class SLO. Running atop vLLM with sub-millisecond event-driven scheduling, BEAM evaluates the energy cost of each candidate batch / worker / DVFS action and picks the lowest-energy action that meets all per-request SLOs. 51% end-to-end GPU energy reduction without violating SLOs.

CEDAR20 (GreenSys @ EuroSys '26, 2026-04-26) puts carbon on the objective list alongside cost and latency. The DRL agent observes real-time grid carbon intensity, spot pricing per endpoint, and queue depth across a multi-region deployment; the reward function explicitly distinguishes marginal grid carbon (the schedulable kind) from average carbon (the uncontrollable kind), which most prior carbon-routing papers conflate. Result: 26% cost reduction and 27% carbon reduction at negligible SLO degradation.

The scheduling-side companion is Kairos41 (2026-05-04), urgency-based SLO scheduling for prefill–decode disaggregation. The prefill side computes urgency = slack_to_TTFT_SLO / request_length and greedily picks the highest-urgency chunks first; this directly solves head-of-line blocking, where a 128K-token request would otherwise starve 8K requests at deadline. The decode side runs slack-guided adaptive batching: when slack is positive, short requests decouple from stragglers and run ahead. At the QPS = 3.0 inflection on Minimax-M2.5, end-to-end SLO attainment goes from 55.8% (DistServe) to 89.6% (Kairos), a +33.8% swing.

Every Pareto plot you see in this literature is shaped by the workload the authors had on hand. The frontier moves when you change the workload. Treat each curve as an instrument, not as ground truth.

FIG. 9 PARETO FRONTIER · COST × QUALITY

FIG. 9 — Pareto frontier of routers in the window. Twenty-five named systems plotted on normalized cost and published-quality axes. This is not a shared empirical Pareto frontier: MTRouter, TRIM, Dynamo, Portkey, and Anthropic Managed Agents optimize different tasks and report different metrics. The plot is useful as a map of where systems sit; defensible Pareto claims require a shared harness, shared model pool, shared retry policy, and shared cost definition.

§ 10

#Agents dispatch, agents don't collaborate

The strongest negative result of the quarter came from Google on 2026-04-2621, consolidating an earlier 180-configuration study52 with field experience from 30+ production deployments51. The verdict: multi-agent collaboration helps on parallelizable, divergent, or cross-context-window tasks, but hurts sequential planning by 39–70%. Independent agents amplify errors 17× as a single mistake propagates through subsequent stages. Hub-and-spoke architectures — one strong orchestrator routing to many specialists — won decisively over open-mesh agent collaboration. The "swarm" framing is dead in production; "dispatch" is the live frame.

The agent-specific routing question is narrower than the multi-agent question: which steps are safe to route down without damaging terminal success? The strongest direct evidence is not from coding-agent product guides but from agent benchmarks. Ares104 routes reasoning effort per step and matches high-effort baselines on several agentic tasks while cutting reasoning tokens by roughly 35–45%. BAAR / BoPO105 trains a budget-aware small/large policy for ScienceWorld, ALFWorld, and AppWorld; it improves the Pareto curve but still trails always-large under tight budgets. Dynamic Mix Precision106 shows the same structure at the precision layer on ALFWorld: many steps can run cheap, but critical steps need the expensive representation. HORIZON107 explains why: most long-horizon failures are process failures, so early routing mistakes can corrupt the whole trajectory.

The six-row step taxonomy below is our synthesis from the routing implications of those four papers together with the role-routing patterns Augment, Cursor 3, and Cline ship; it is not lifted from any single source. The closest published role taxonomies — HuggingGPT's decomposition / execution / aggregation roles and GraphPlanner's Planner / Executor / Summariser28 — cover the endpoints (decomposition and final synthesis) but not the middle of the loop where most of the cost actually lives.

Agent step	Routing implication	Typical failure if routed too cheap
Initial decomposition	high-leverage; usually protect with stronger model	wrong subgoal poisons every later action
Search / retrieval fanout	cheap and parallelizable when verifier exists	query drift; misses decisive evidence
Mechanical edit or extraction	safe to downroute if schema and tests are tight	format errors and retries erase savings
Tool-error diagnosis	often needs stronger reasoning; feedback is sparse	loops on the wrong repair hypothesis
Compression / summary	cheap model may work, but losses are sticky	drops constraints needed later
Final synthesis	depends on user-visible quality and risk	correct trajectory becomes a weak answer

Coding agents converged on a stable role-routing pattern by mid-April48. Treat this as static role assignment, not evidence of dynamic per-step routing inside one trajectory. The split that ships in production:

Agent role	Typical model tier	Rationale
Planning / orchestration	Claude Opus 4.6	Ambiguous decomposition; errors cascade downstream
Code implementation	Claude Sonnet 4.6	80%+ SWE-bench with 30% fewer tokens than Opus
File navigation / search	Claude Haiku 4.5	High-frequency, short structured queries
Code review	GPT-5.2	Strong instruction adherence on multi-file diffs

Augment, Cursor 349 (2026-04-02, parallel agent dispatch with a local-cloud split), and Cline ship variations of this and report roughly 51% cost reduction versus monolithic single-model agents. Treat that number as product evidence, not a benchmark result. The danger to avoid is the obvious one: assigning planning to a weak model causes errors in task decomposition that swamp every downstream cost saving.

AG2's four-layer handoff stack22 (2026-03-05) became the most-copied open-source reference architecture. On every agent turn, four mechanisms are evaluated in strict priority order: (1) context-based conditions (deterministic state-variable routing, no LLM call) → (2) LLM-based conditions (the agent's own LLM evaluates handoff predicates as tools) → (3) tool-based handoffs (a tool returns a routing signal) → (4) after-work fallback. The ordering matters: the cheapest routing always fires first, and a well-structured production system executes most decisions at layer 1 — zero LLM tokens, zero added latency.

The frontier-lab response was infrastructural. Anthropic Managed Agents26 (2026-04-08) decouples brain from hands: the harness becomes stateless, reading from a durable session log and routing tool calls to whichever sandbox or MCP server is available, while containers (the hands) are provisioned only when actually needed. Three stable interfaces — execute(name, input), emitEvent(id, event), wake(sessionId) — let many brains share hands and many hands answer to one brain. The latency story: p50 TTFT −60%, p95 TTFT −90%, because inference no longer waits for container startup. Claude Code subagents50 (2026-04-30) push the routing surface further into natural language: the parent's description field is the routing rule, no classifier and no routing table; the spawned subagent reads the description, takes its own context window, runs in parallel, and returns. Week 17's forked-subagent flag (CLAUDE_CODE_FORK_SUBAGENT=1) lets the fork inherit the full conversation context for deep parallel exploration.

GraphPlanner28 formalizes the dispatch pattern as an MDP over (model, role) pairs with graph-encoded interaction memory; AgentFloor53 (2026-05-01) is the first agentic-routing benchmark with a 6-tier capability ladder, which gives the field a shared yardstick it has lacked.

The role-routed coding-agent split (planner / implementer / reviewer) is now the consensus production pattern, but it is a working compromise, not a settled design. It rewards readers with a clear architecture and punishes them with three handoff seams where context goes stale, prompts diverge, and small role misclassifications produce large bills. The Slate report names a version of this experience23; in the surveyed work, it is the gap between the role-split's average-case win and its tail-case failures.

FIG. 10 ROLE-ROUTED CODING AGENT · INTERACTIVE

FIG. 10 — Coding-agent role-routing swimlane. A representative coding task — "fix the failing test in parser.py and update the schema" — animated through four role lanes. Handoffs occur at well-defined boundaries: planning hands off to navigation when files are unknown; navigation hands off to implementation; review runs at the end. The same architecture, all-Opus, would cost roughly twice as much; Augment, Cursor 3, and Cline all converged on this split by mid-April22.

§ 11

#The vendor layer

Vendor infrastructure became the dominant production story this quarter. The defining M&A story is Portkey: $15M Series A on 2026-02-19 (Bessemer, Uncorrelated Ventures); gateway open-sourced at over 1T tokens/day on 2026-03-24; acquired by Palo Alto Networks on 2026-04-30 (terms not disclosed; press estimates have circulated in the $700M range but are not vendor-confirmed)24. Seventy days from Series A to exit. The deal establishes the AI gateway as a security-critical control plane for autonomous agents, embedded in PANW's Prisma AIRS platform — which is itself the most candid signal that the routing layer has become infrastructure-of-record, not a developer convenience.

Microsoft Foundry Model Router25 (2026-03-18, expanded 2026-04-28) is a trained ML router (not a rules engine) dispatching across up to 18 underlying LLMs — Claude, GPT-5.x, Grok, DeepSeek, Llama, OSS — from a single Azure endpoint. It analyzes prompt complexity, task type, and latency target; honors data-zone boundaries; and stores no prompts. AWS Bedrock46 shipped a cross-provider OpenAI-compatible endpoint on 2026-04-28, putting GPT models on Bedrock alongside Anthropic and Mistral.

OpenRouter's April was a four-step sprint47: video routing on 2026-04-15, Workspaces (per-workspace routing defaults) on 2026-04-22, Agent SDK on 2026-04-24, Response Caching on 2026-04-30. Martian55 reached approximately $1.3B on the secondary market on 2026-04-04, cementing routing-as-a-service as a standalone business category.

The frontier-model labs responded by building routing inside their own products. Anthropic Managed Agents26 (2026-04-08) decouples brain and hands as in §10 (p95 TTFT −90%); Claude Code subagents50 (2026-04-30) use the description field as a natural-language routing rule with parallel fork-join dispatch. Anthropic Adaptive Thinking and Gemini 3 Deep Think56 push routing inside the model itself: the model autonomously adjusts its reasoning budget, bypassing an external router for the easy / hard split. These are production-relevant primitives, but they are not public evidence that dynamic per-step model switching beats a single strong model in mixed tool-use agents. The production pattern is more conservative: request-level model routers, user-visible model pickers, fallbacks, static role assignment, and model-internal effort knobs.

OpenAI's GPT-5 router rollout is the useful cautionary case112. A router can be technically cost-effective and still fail the product contract if users experience opaque changes in quality, personality, latency, or refusal behavior. Production routers need override controls, audit logs, and rollback paths for the same reason they need accuracy metrics.

The OSS layer kept growing in parallel. NadirClaw (3-tier proxy router, 454★) and bitrouter (Rust agent-native proxy, 79★) ship as deployable binaries; Dr.LLM (ICLR '26) released dynamic layer routing on 2026-04-2459. The pattern is the same at every layer: routing is the value, the model fleet is the commodity.

FIG. 11 VENDOR POSITIONING · CAPABILITY × COST AWARENESS

FIG. 11 — Vendor positioning, May 2026. Fifteen routing-adjacent vendors plotted on two qualitative axes: capability-aware (x) versus cost-aware (y). Positions are coarse and based on each vendor's published mechanism through 2026-05-08. Hover any satellite to read its in-window changelog; click to pin.

§ 12

#Drift — four ways a router can quietly stop working

Most routers degrade not from external attack but from drift inside the system. Four modes have been named clearly enough this quarter to be worth knowing by name. The detectors below are proposed operating checks, not universal thresholds; each workload needs its own baseline, alert budget, and remediation cadence.

Routing collapse.: Trained routers default to dominated (over-expensive) models even when cheaper alternatives match performance. The Routing Collapse Index (RCI = fraction of queries routed to dominated models) makes it measurable; EquiRouter's pairwise-ranking objective is the proposed remediation71.
Calibration coverage collapse.: Sequential fine-tuning drops conformal coverage 3.4× faster than accuracy (e.g. 0.92 → 0.61 over five tasks while accuracy moves three points), so confidence-based escalation silently breaks. Detect with per-task coverage on a held-out buffer; remediate with calibration replay after every model update72.
Semver lie / snapshot drift.: Minor model version bumps (claude-x.6 → claude-x.7) silently change tokenizer behaviour, temperature semantics, JSON-mode strictness, refusal verbosity, and calibration. Negative-flip rate on a canary prompt set is the detection signal; pinning snapshot IDs rather than aliases is the discipline74 73.
Query distribution shift.: Technical metrics stay green while business KPIs erode 5–15 points over the first ~90 days from concept drift, RAG/knowledge-base rot, and prompt brittleness. Rolling KL on query embeddings is the standard detector; periodic router re-evaluation against recent traffic is the standard fix75 77 66.

A fifth candidate — refusal verbosity and retry-rate shift76 — folds into the calibration row in practice, since the same conformal-replay remediation covers it.

The Pharos field synthesis66 looks at 25+ production systems and lands on a sobering corollary: the eval harness is the only reliable drift detector, regardless of whether the routing layer is bought or built.

FIG. 17 DRIFT MODE × DETECT × REMEDIATE

FIG. 17 — Four named drift modes. Each row lists a proposed detection signal and remediation. The thresholds are starting points, not standards. The fifth candidate — refusal/retry pattern shift76 — folds into the calibration row72.

§ 13

#Causal routing — predicting the marginal gain you didn't get

A small but consequential thread of work argues that routers should predict counterfactual marginal gain — what the stronger model would have produced if we had called it — rather than absolute model quality. The NeurIPS 2025 paper that introduced the framingLA-2 treats routing as off-policy estimation on observational logs, since production traffic only ever shows you the outcome of the model you actually called.

The in-window descendant is MTRouter27, which learns a joint history-model embedding from logged trajectories and treats each step's routing decision as a regret-minimisation problem. Reported numbers: 58.7% cost reduction vs GPT-5 on ScienceWorld, 43.4% on HLE. The pre-generation analogue is RouteLMT11, which puts a LoRA on the small model's prompt-token representations to predict Δ = Q_large − Q_small rather than Q_large directly. Same shape, different deployment stage.

The framing matters because logged production traffic is observational, not interventional — every log is the outcome of a policy that already biased which model saw which query. Treating routing as causal inference is what makes offline re-fit on logged data legitimate rather than self-confirming.

The three estimators a routing log can support

Three off-policy estimators from the contextual-bandits literature do the work. Let μ(a|x) be the logging policy that actually chose model a on query x, and π(a|x) be the target policy we want to evaluate from logs alone. Let r be the observed reward (terminal-success bit, judge score, or whatever the harness records).

Estimator	Formula	Property that matters in production
IPS Horvitz–Thompson 1952; Dudík–Langford–Li 2011	V̂_IPS(π) = (1/N) Σᵢ [π(aᵢ\|xᵢ) / μ(aᵢ\|xᵢ)] · rᵢ	Unbiased under positivity (μ(a\|x) > 0 wherever π(a\|x) > 0). Variance scales as O(1/μ²), which blows up when the logging policy almost never chose an arm. Standard mitigation is weight clipping at a fixed constant c.
SNIPS Swaminathan–Joachims 2015 — Counterfactual Risk Minimisation	V̂_SNIPS(π) = (Σᵢ wᵢ rᵢ) / (Σᵢ wᵢ), wᵢ = π/μ	Self-normalises by the weight sum, trading a small bias for substantially lower variance. Recovers a value in [0, 1] when rewards are.
Doubly Robust Dudík–Langford–Li 2011	V̂_DR = (1/N) Σᵢ [q̂(xᵢ, π(xᵢ)) + (π/μ)·(rᵢ − q̂(xᵢ, aᵢ))]	Unbiased if either the propensity model μ or the reward model q̂ is correctly specified — not necessarily both. In practice the residual r − q̂ is near zero, which crushes variance even when μ is rough.

The gap in the routing literature is visible from this angle: of the 2025–2026 LLM-routing systems surveyed — FrugalGPT, RouteLLM, LLM-Blender, Smoothie, Router-R1, MTRouter, RouteLMT, BEST-Route, LLMRouterBench — none use IPS, SNIPS, or DR as the training objective. Routers are typically fit with supervised ERM on logged outcomes or with RL on a heuristic reward, both of which inherit the logging policy's bias. The Tsiourvas et al. ancestorLA-2 is the one paper in the lineage explicitly framed against this gap; nothing in-window has closed it.

A two-arm worked example

To make the variance behaviour concrete: imagine a logged routing dataset of N=200 queries, 100 routed to a cheap arm A and 100 to an expensive arm B, under a logging policy μ(A)=0.80, μ(B)=0.20. Observed rewards: r̄_A = 0.65, r̄_B = 0.80. We want to estimate the value of a target policy π that flips to a uniform 50/50 split.

Estimator	Calculation	Value
V̂_IPS	w_A = 0.5/0.8 = 0.625; w_B = 0.5/0.2 = 2.50. (0.625·0.65·100 + 2.50·0.80·100)/200	1.203 (unbounded scale)
V̂_SNIPS	numerator 240.625, denominator 312.5	0.770 (back in [0,1])
V̂_DR	using empirical means as q̂: 0.5·0.65 + 0.5·0.80; residuals zero by construction	0.725 (exact)

Now change one number: μ(B) = 0.02 instead of 0.20 — the logging policy almost never chose the expensive arm, which is what happens in real cost-conscious deployments. w_B rises to 25.0. V̂_IPS jumps to 10.20 (uninterpretable noise). V̂_SNIPS collapses to 0.094 as the denominator inflates. Clipping w_B at 5.0 stabilises both: V̂_IPS_clipped = 2.20, V̂_SNIPS_clipped = 0.783. The clipping introduces bias, but estimating a finite quantity beats estimating an infinite-variance one. V̂_DR is the cheapest of the three to stabilise on real logs: the reward model absorbs most of the variance the IPS weight would otherwise carry, so the estimate stays bounded even when propensities are small.

A shadow-call budget makes this estimable, not theoretical

A marginal-gain router needs counterfactual evidence: randomised exploration, shadow calls to stronger models, canary slices where every model answers, or a held-out all-model replay set. The production consensus across Augment, Folkman, and the routing-as-bandit literature is a 5% uniform-random shadow slice — large enough to guarantee positivity (μ(a|x) = 0.05 for every arm), small enough that the shadow cost stays under a few percent of the bill, and simple enough that the propensity model is just a constant. Lower-volume systems run 1–2%; uncertainty-weighted exploration is more sample-efficient but loses the constant-propensity simplicity, and the bookkeeping gets harder. Clipping at 5–10× the inverse propensity is the standard variance fix; without exploration and clipping, the router mostly learns to justify the incumbent policy.

A drift detector you can actually run on logs

The same machinery gives a usable drift signal. Run V̂_IPS per day on the last 30 days of routing logs against a fixed baseline value (the policy the team committed to ship); alarm when the daily regret exceeds a threshold for several consecutive days. The 5% uniform shadow slice is what makes the per-day estimate well-defined.

FIG. 13 DRIFT DETECTOR · IPS-WEIGHTED REGRET ON 30-DAY ROUTING LOGS PY · 37 LOC

def compute_VIPS(logs, target_policy_scores, propensity_fn, clip=5.0):    """Off-policy IPS estimate from a routing log slice."""    N, weighted, weights = len(logs), [], []    for entry in logs:        a = entry["arm_selected"]; r = entry["reward"]        pi = target_policy_scores.get(a, 0.0)        mu = propensity_fn(a)        w  = min(pi / mu, clip)        weighted.append(w * r); weights.append(w)    return sum(weighted) / N, weights def detect_regret_drift(logs_by_day,                        shadow_fraction=0.05,    # uniform mu for every arm                        baseline_value=0.0,     # committed policy's V̂                        regret_threshold=0.05,                        consecutive=3,                        lookback_days=30):    """Alarm if IPS-estimated regret exceeds threshold for N days in a row."""    prop_fn = lambda a: shadow_fraction    streak, alarmed = 0, []    for d in sorted(logs_by_day)[-lookback_days:]:        logs = logs_by_day[d]        if len(logs) < 50: continue        counts = {}        for e in logs:            counts[e["arm_selected"]] = counts.get(e["arm_selected"], 0) + 1        pi = {a: c / len(logs) for a, c in counts.items()}        V, _ = compute_VIPS(logs, pi, prop_fn)        regret = baseline_value - V        if regret > regret_threshold:            streak += 1; alarmed.append(d)            if streak >= consecutive:                raise Alarm("regret drift", days=alarmed,                            recommend="hold policy; audit recent retrain")        else:            streak, alarmed = 0, []    return {"status": "ok", "streak": streak}

Pseudocode, not a library. compute_VIPS applies the IPS estimator from earlier in the section with weight clipping at c=5; detect_regret_drift walks 30 days of logs and alarms only after N consecutive days exceed the regret threshold, which is what stops noisy single-day estimates from triggering. Replace the IPS call with the DR estimator once a reward model exists — one line.

This is the smallest operational version of off-policy evaluation: a single estimator (IPS), a single propensity (the shadow fraction), a single decision (raise or hold). Replacing IPS with DR is a one-line change once a reward model exists. The reason to run it is that without it, any router retrained on logged data is updating against a target the previous router already shaped — the "self-confirming" failure mode the Tsiourvas paper names. With it, the team has a per-day number that says the last 30 days are still net-positive relative to the committed baseline, which is the answer the question "is the router working in production?" was asking for in the first place.

None of this is exotic. IPS dates to 1952; DR to 2011; SNIPS to 2015. All three are standard in causal inference and contextual bandits, and across the 2025–2026 LLM-routing systems surveyed here, none are used as the training objective. The opportunity is to import them, not invent them.

§ 14

#Timeline

The events covered by this survey, plotted on the time axis. The shape itself is informative: a sparse February (BOute, Portkey's Series A, the Trinity / Huawei survey); a dense mid-March infrastructure burst (Dynamo 1.0, MS Foundry, llm-d, Ranvier, Portkey OSS); a late-April academic flood (ICLR 2026 in Singapore, ACL 2026 Industry, EuroSys '26 in Edinburgh, all overlapping); and a closing week of consequential vendor events (Portkey's acquisition, AgentFloor, Augment Prism, Kairos). Drag the scrubber to inspect along the time axis; click any dot to open the source.

FIG. 14 2026-02-08 → 2026-05-08 · N = 93

2026-02-08

CATEGORY · 2026-MM-DD —

PAPER VENDOR OSS BENCHMARK FRAMEWORK

FIG. 14 — Recent routing events on a timeline. Each dot is a paper, vendor announcement, OSS release, benchmark, or framework update with a verifiable date in the window. Vertical position encodes category; horizontal position encodes date. The scrubber updates the inset card; clicking a dot opens the source page.

§ 15

#Open questions and bets on where the field goes next

One of the strongest findings of the past few months is also one of the least resolved: RouteProfile6 shows that the format of a model capability profile matters more than the router mechanism on top of it. But the field hasn't converged on what that format should look like — discrete benchmark scores, dense learned embeddings, and structured hybrids all have credible papers behind them. The matrix-versus-self-routing argument (Topaz5 and Dimension-Direct14 against DiSRouter7) is similarly unsettled. What follows is six predictions about where these debates land, and seven bets about how the vendor and benchmark layers respond.

Predictions

By 2026-10: at least one frontier lab ships a self-routing fleet (DiSRouter-style7) in production. The architectural elegance is too clean to stay academic, and the central-matrix approach is a maintenance burden that compounds with every model release. The hybrid endgame — sparse central matrix over broad archetypes, distributed self-assessment within each archetype — is already visible in GraphPlanner28's heterogeneous graph.
By 2026-12: the Routing Collapse Index54 becomes a leaderboard column on every public router benchmark. The field needs a fairness metric — too many "learned routers" silently degenerate to one model — and EquiRouter's RCI is the one that will stick. Expect LLMRouterBench and AgentFloor53 to add it first.
By 2027-Q1: carbon-intensity routing ships in EU clouds as a default option. CEDAR's20 marginal-vs-average distinction will be the framing every paper adopts, and the EU AI Act's reporting requirements will push providers to surface the metric. The decisive moment will be when one hyperscaler exposes a "carbon-aware" SLO tier; the rest will follow within a quarter.
By 2027-Q1: cache-compute colocation becomes a default deployment pattern. PrfaaS43 demonstrates that cross-DC KV transfer over commodity Ethernet is practical with hybrid attention; Dynamo 1.0's15 KV-aware routing already assumes colocation in production. The combination yields a topology where prefill and decode can be separately scheduled even across racks, and the routing layer is the orchestrator.
By 2027-Q2: agentic routing gets a standardized router schema. AG2's four-layer handoff stack22, Anthropic Managed Agents' three-interface contract26, Claude Code subagents' description-field convention50 — these are converging on a recognisable schema. AgentFloor53 is already the benchmark substrate; one of the major orchestration frameworks will publish a schema spec by Q2.
By 2027-Q3: step-level routing moves inside frontier models, not outside them. TRIM-style halting heads3 become a default architectural component, the way speculative decoding became a default in 2024. Cascades-as-a-thing dissolve into the model; what was an external router policy becomes a per-step learned head (see §06 generation-tax13).

Vendor lock-in tension will peak in Q3 2026 as MS Foundry Model Router's25 incumbency advantage on Azure clashes with the OpenRouter / Bedrock cross-provider plays; benchmarks (LLMRouterBench, AgentFloor) will unify by year-end. Both follow from the six above and do not need separate predictions.

Bets, with falsifiers

BET 01CONF 0.70

Verifier-gated cascade becomes a smaller share of total routing decisions by 2026-11, not a larger one. The Bouchard generation-tax result13 will be replicated on at least one additional cost-frontier benchmark and the field's center of gravity moves toward pre-generation routing as the default.

FALSIFIERA Q4 2026 paper produces step-level cascade beating pre-generation routing on a majority of cost-frontier benchmarks. Wrong probability: ~30%.
BET 02CONF 0.75

KVEvents becomes the lingua franca of inference routing; every major serving stack consumes it by year-end. vLLM ships it, Dynamo 1.0 consumes it15, llm-d documents the pattern45, Ranvier adopts it16, SGLang/TGI follow.

FALSIFIERA competing primitive (e.g. block-level cache fingerprints, capability-vector signaling) ships in 2 of 4 major stacks; KVEvents stalls. Wrong probability: ~25%.
BET 03CONF 0.65

Cross-provider aggregator routers (MS Foundry25, OpenRouter, Bedrock) stay within ~5% of best-of-fleet on generic benchmarks but lag > 15% on domain-specialist tasks. Aggregators are generic by construction; they cannot ingest a tenant's fine-tuned specialist into their capability matrix without per-tenant work.

FALSIFIEROne of {Foundry, OpenRouter, Bedrock} ships per-tenant fine-tuned-specialist integration before November 2026. Wrong probability: ~35%.
BET 04CONF 0.60

Cache-eviction cost becomes a standard line item on vendor pricing pages by year-end. Augment Prism17 made the cost visible; at least 2 of {OpenRouter, Together, Fireworks, Modal} surface a cache-eviction tier in their pricing by November.

FALSIFIERBy November 2026, no major aggregator surfaces cache-eviction pricing explicitly. Wrong probability: ~40%.
BET 05CONF 0.65

Drift collapses faster than routing innovation. By the end of 2026 the field will have shipped 50+ new routing papers, but two-thirds of production systems will be losing 5–15 points of business KPI per quarter to one of the four drift modes in §13 — routing collapse71, calibration coverage collapse72, the semver lie74, or query-distribution shift75 77.

FALSIFIER(a) Continual Calibration72 or its successor lands as a vendor primitive in OpenAI / Anthropic / Google by Q4 2026, or (b) the EU AI Act's transparency reporting forces vendors to publish snapshot diffs that make the semver lie74 go away. Either kills the bet. Wrong probability: ~35%.
BET 06CONF 0.55

Abstention becomes a typed output, not a system-prompt rule, in at least one frontier vendor's API. Tian Pan's argument88 — abstention belongs in the router as {answer} | {abstain: {reason, missing_information}} — is the cleanest formalisation of a pattern DiSRouter7 already trains for on the model side. A major frontier provider exposes this as a typed response by Q4 2026.

FALSIFIERThe typed-abstention contract is so contested by safety teams (false refusal, over-abstention) that it stays in the prompt for another year. Wrong probability: ~45%.
BET 07CONF 0.50

Modality-aware routing becomes a first-class API knob by mid-2027. No major model host (Bedrock, Foundry, Vertex AI) exposes modality dispatch at the model-selection layer today; the only formal evaluation harness is the pre-window MMR-Bench (§18, LA-15). At least one of AWS Bedrock, Azure Foundry, or NVIDIA Dynamo ships a modality-dispatch knob by mid-2027.

FALSIFIERModality-aware routing turns out to be cheaper to implement client-side via prompt-classifier than as a server-side primitive — which is plausible. Wrong probability: ~50%.

§ References

#References

Main claims use items dated in the strict window 2026-02-08 → 2026-05-08. A small number of numbered references are pre-window context (each explicitly footnoted as such); lineage references for late-2025 ancestor work appear separately in §18.

Routing via prefill activations NVIDIA · arXiv:2603.20895 · 2026-03-21
Dynamo 1.0 GA — Flash Indexer + KV-aware routing NVIDIA developer blog · 2026-03-16
TRIM — step-level routing via process reward models LinkedIn / CMU · ICLR 2026 · 2026-04-22→26
ConfSpec — cascade × speculative decoding arXiv:2602.18447 · 2026-02-26
Topaz — explainable model routing via M[model × skill] CHI 2026 · 2026-04-04
RouteProfile — design space of LLM capability profiles arXiv:2605.00180 · 2026-04-30
DiSRouter — distributed self-routing via Self-Awareness Training ICLR 2026 · 2026-04-22
Dynamic routing in LLMs — a survey and 6-paradigm taxonomy Trinity / Huawei · arXiv:2603.04445 · 2026-02-23 → 2026-04-21
kNN baselines for cross-distribution router generalization EACL 2026 · 2026-03
Zero-shot confidence estimation outperforms supervised routers OOD arXiv:2605.02241 · 2026-05-04
RouteLMT — predicting marginal gain, not absolute quality ACL 2026 · 2026-04-24
CascadeDebate — multi-agent deliberation at cascade tiers arXiv:2604.12262 · 2026-04-14
Is escalation worth it? — decision-theoretic analysis of cascades arXiv:2605.06350 · 2026-05-07
Dimension-Direct Routing — explicit 12 × 15 capability matrix Research Square · 2026-04-07
NVIDIA Dynamo 1.0 GA — Flash Indexer + Thompson sampling bandit NVIDIA · 2026-03-16 / 2026-04-17
Ranvier — engine-agnostic ART-based prefix router OSS · 2026-03-16
Augment Prism — cache-aware coding-agent router Augment blog · 2026-05-02
BOute — multi-objective Bayesian optimization for routing + GPU placement MLSys 2026 oral · 2026-02-11
AdaServe — multi-SLO speculative decoding trees EuroSys '26 · 2026-04-27→30
CEDAR — carbon × cost × latency routing under DRL GreenSys @ EuroSys '26 · 2026-04-27
Multi-agent collaboration hurts sequential planning Google research · 2026-04-26
AG2 — four-layer handoff stack framework release · 2026-03-05
Slate — moving beyond ReAct and RLM Random Labs · 2026-03-09 (orthogonal — quoted for voice, not for routing claims)
Palo Alto Networks to acquire Portkey (terms undisclosed) paloaltonetworks.com · 2026-04-30 · official PR states terms not disclosed; The New Stack ran a "$700M-class" analyst-estimate framing on 2026-05-04, not an official figure
Microsoft Foundry Model Router Microsoft Build / Foundry blog · 2026-03-18
Anthropic Managed Agents — brain/hands decoupling Anthropic blog · 2026-04-08
MTRouter — multi-turn history-model embeddings ACL 2026 · arXiv:2604.23530 · 2026-04-26
GraphPlanner — RL routing over (model, role) with graph memory UIUC · ICLR 2026 · arXiv:2604.23626 · 2026-04-26
Route-To-Reason — reasoning strategy as a capability axis USTC · ACM WWW 2026 · 2026-04-12
SkillRouter — full-text skill bodies as the routing signal arXiv:2603.22455 · 2026-03-23
DialRouter — MCTS-based multi-turn dialogue routing arXiv:2604.12385 · 2026-04-14
BayesianRouter — offline BT + online Thompson sampling for RM routing NUS · ICLR 2026 · 2026-04-26
ParetoBandit — primal-dual budget pacer with non-stationarity adaptation arXiv:2604.00136 · 2026-03-31
TRACER — production logs as a zero-cost training set; parity gate arXiv:2604.14531 · 2026-04-16
RouteNLP — distillation-routing co-optimization in production ACL 2026 Industry · arXiv:2604.23577 · 2026-04-26
KAD — token-level deferral as 0-1 knapsack Paris-Saclay / INSA Rennes · EACL 2026 · arXiv:2510.27017
Confidence Leaps in LLM Reasoning AIRI / Skoltech / HSE · EACL 2026 · 2026-03
Routing, Cascades & User Choice — Stackelberg game analysis U Ottawa / NVIDIA · ICLR 2026 · arXiv:2602.09902 · 2026-02-10
Saguaro — Speculative Speculative Decoding Stanford / Princeton / Together AI · arXiv:2603.03251 · 2026-03-03
Diminishing Returns of Early-Exit in Modern LLMs arXiv:2603.23701 · 2026-03-24
Kairos — urgency-based SLO scheduling for disaggregated inference arXiv:2605.02329 · 2026-05-04
BEAM — joint resource–power optimization under SLO constraints MLSys 2026 oral · 2026
PrfaaS — cross-datacenter KV-cache routing (Moonshot AI / Kimi) arXiv:2604.15039 · 2026-04-16
llm-d Predicted-Latency Scheduling — XGBoost in the critical path llm-d / Google / Red Hat · 2026-03-13
llm-d Precise Prefix Cache Routing — KVEvents introspection llm-d guides · 2026-03
AWS Bedrock adds OpenAI-compatible cross-provider endpoint AWS · 2026-04-28
OpenRouter April sprint — Video / Workspaces / Agent SDK / Response Caching OpenRouter · 2026-04-15 → 2026-04-30
Augment — coding role → model routing guide (Opus / Sonnet / Haiku / GPT-5.2) Augment Code · 2026-04-12
Cursor 3 — parallel agent dispatch with local↔cloud split Cursor · 2026-04-02
Claude Code subagents — description-field as routing rule Anthropic · 2026-04-30
Multi-Agent in Production 2026 — hub-and-spoke verdict Micheal Lanham · 2026-04-26
Towards a Science of Scaling Agent Systems — 180-config study Google Research · 2026-01-28 (orthogonal — pre-window context for ref-21)
AgentFloor — 6-tier agentic-routing benchmark arXiv:2605.00334 · 2026-05-01
EquiRouter — Routing Collapse Index (RCI) GitHub · 2026-03 → 2026-04
Martian valuation reaches ~$1.3B (secondary market) 2026-04-04
Adaptive Thinking (Anthropic) and Gemini 3 Deep Think (Google) — model-internal reasoning routing Anthropic / Google · 2026-02 → 2026-04
Dr.LLM — dynamic layer routing (skip layers adaptively) ICLR 2026 · 2026-04-24
State of Routing in Model Serving — Switchboard → Lightbulb Netflix Tech Blog · netflixtechblog.com · 2026-05-01
Flow generation through natural language: An agentic modeling approach Shopify Engineering · shopify.engineering · 2026-04-22
What are AI gateways in 2026, and do you actually need one now? ngrok · ngrok.com/blog · 2026-04-09
The Build-vs-Buy LLM Infrastructure Decision Most Teams Get Wrong Tian Pan · tianpan.co/blog · 2026-04-15
5 Hidden Failure Modes When Routing Between 10+ LLM Providers in 2026 Xidao · dev.to · 2026-05-08
Apr 28 Claude Outage: Postmortem Recovery Playbook Abhishek Gautam · abhs.in/blog · 2026-04-29
State of Production AI Engineering 2026 Pharos Production · pharosproduction.com · 2026-05-01
LLMs in Prod '25 Benchmark Report Portkey · portkey.ai · 2026-05-01 (vendor-published)
Top 5 LLM Gateways for Production in 2026 Hadil Ben Abdallah · dev.to · 2026-02-12
Azure AI Foundry vs AWS Bedrock (2026) MyEngineeringPath · myengineeringpath.dev · 2026-03-20
6 best LLM gateways for developers in 2026 Braintrust · braintrust.dev · 2026-03-20
When Routing Collapses: On the Degenerate Convergence of LLM Routers Lai & Ye (EquiRouter) · arXiv:2602.03478 · 2026-02-08 (v2 in-window; extends ref-54)
Continual Calibration: Coverage Can Collapse Before Accuracy in Lifelong LLM Fine-Tuning arXiv:2604.23987 · 2026-04-30
The GPT-4.1 Deprecation Forces Organizations to Change TensorOps · tensorops.ai/blog · 2026-03-18
The Semver Lie: Why a Minor LLM Update Breaks Production More Reliably Than a Major Refactor Tian Pan · tianpan.co/blog · 2026-04-29
Why Your Production LLM Degrades After 90 Days OptimusAI · optimusai.ai · 2026-03-12
Monitoring LLM Behavior: Drift, Retries, and Refusal Patterns VentureBeat · venturebeat.com · 2026-04-26
Understanding Model Drift and Data Drift in LLMs (2026 Guide) Orq.ai · orq.ai/blog · 2026-04-17
Conjunctive Prompt Attacks in Multi-Agent Routing arXiv:2604.16543 · 2026-04-22
How Robust Are Router-LLMs? Analysis of the Fragility of LLM Routing Capabilities EACL 2026 · aclanthology 2026.eacl-long.351 · 2026-03-26
How We Cut LLM Costs by 59% With Prompt Caching ProjectDiscovery · projectdiscovery.io/blog · 2026-04-10
Hidden Switching Costs of LLM Vendor Lock-In Tian Pan · tianpan.co/blog · 2026-04-17
Prompt Caching Infrastructure: LLM Cost & Latency Reduction Guide 2025 Introl · introl.com/blog · 2026-03-17
My framework for choosing between rule-based and ML-based systems Khayyam H. · Medium · 2026-02-17
Abstention as a Routing Decision: Why 'I Don't Know' Belongs in the Router, Not the Prompt Tian Pan · tianpan.co/blog · 2026-04-28
How to Roll Out New LLMs Safely Using Shadow Testing codeant.ai · codeant.ai/blogs · 2026-05-08
LLM Observability Stack: Langfuse vs Helicone vs Portkey buildmvpfast.com · 2026-03-29
Anthropic Prompt Caching — Tool use with prompt caching Anthropic · platform.claude.com · 2026-03-15
Vertex AI Context Caching — Overview Google Cloud · docs.cloud.google.com · 2026-05-08
Azure Foundry Model Router — How-to Guide Microsoft · learn.microsoft.com · 2026-03-18 (upd. 2026-05-06; extends ref-25)
Portkey Conditional Routing — Docs Portkey · docs.portkey.ai · 2026-02-15
OpenRouter Response Caching — Docs (Beta) OpenRouter · openrouter.ai/docs · 2026-04-30 (extends ref-47)
Vercel AI Gateway — Model Fallbacks Vercel · vercel.com/docs · 2026-03-07
NVIDIA NeMo Agent Toolkit — Dynamo LLM API v1.6 NVIDIA · docs.nvidia.com/nemo · 2026-04-17 (extends ref-15)
AWS Bedrock — Cross-Region Inference AWS · docs.aws.amazon.com · 2026-02-20 (extends ref-46)
LangChain — Multi-agent Router Pattern LangChain · docs.langchain.com · 2026-02-25
LlamaIndex Router QueryEngine + SubQuestion QueryEngine LlamaIndex · developers.llamaindex.ai · 2026-03-04
Why AI Tool Routing Fails in Production (MCP Deep Dive & Fixes) Waghela · Medium · 2026-04-18
MCP Fault Taxonomy arXiv:2603.05637 · 2026-03-08
Ares — Adaptive Reasoning Effort Selection for Efficient LLM Agents arXiv:2603.07915 · 2026-03
Budget-Aware Agentic Routing via Boundary-Guided Training BAAR / BoPO · arXiv:2602.21227 · 2026-02
Dynamic Mix Precision Routing for Efficient Multi-step LLM Interaction arXiv:2602.02711 · 2026-02
The Long-Horizon Task Mirage? Diagnosing Where and Why Agentic Systems Break HORIZON · arXiv:2604.11978 · 2026-04
Unsolvability Ceiling in Multi-LLM Routing: An Empirical Study of Evaluation Artifacts arXiv:2605.07395 · 2026-05
SMARTCAL: An Approach to Self-Aware Tool-Use Evaluation and Calibration EMNLP 2024 Industry · calibration context
LLMRouterBench: A Massive Benchmark and Unified Framework for LLM Routing arXiv:2601.07206 · 2026-01 · negative-control context
RouteLLM: An Open-Source Framework for Cost-Effective LLM Routing LMSYS · 2024-07 · baseline context
OpenAI rolls back ChatGPT's model router system for most users WIRED · production-router user-control context
Regulation (EU) 2024/1689 — AI Act, full text EUR-Lex · 2024-07-12 (article references for §03b table)
Requesty — EU-compliant AI routing under GDPR and the AI Act requesty.ai · 2026-05-08 · explicit article-by-article gateway mapping
Kong AI Gateway — EU AI Act compliance mapping konghq.com · 2025-11-26 · article-by-article mapping (PII Sanitizer, Prompt/Response Guard, RAG Injector, Header Injection, Semantic Cache)

§ 18 · appendix

#Lineage — the late-2025 floor the 2026 window stands on

The main report (§01–§17) is bounded strictly to 2026-02-08 → 2026-05-08. The fourteen-week window is the report's discipline; this appendix names the late-2025 architectural and benchmark substrate the in-window papers stand on. Late-2025 items are cited here for ancestry only — they are not used as evidence for any in-window claim, and they carry the LA-N prefix to keep them visually distinct from the numbered references above.

LA-ID	Title	Venue / Date	Inherited by	What 2026 added
Online routing parents
LA-1	PORT — Efficient Training-Free Online Routing for High-Volume Multi-LLM Serving Wu & Silwal	NeurIPS 2025 · 2025-09-02	ParetoBandit 33, AdaServe 19	Training-free online routing with competitive-ratio guarantee; ANN query features + bootstrap optimisation became the substrate ParetoBandit's primal-dual pacer extends.
Causal routing parents
LA-2	Causal LLM Routing — Regret Minimization from Observational Data Tsiourvas et al. (IBM)	NeurIPS 2025 · 2025-12-02	MTRouter 27; §13 causal routing	Routes by marginal causal gain (counterfactual improvement), not absolute model quality; interval-conditioned architecture and end-to-end regret minimisation from observational logs.
RL multi-round routing parents
LA-3	Router-R1 — LLM-Native Multi-Round Routing Zhang, Feng, You (UIUC)	NeurIPS 2025 poster · 2025-12-09	GraphPlanner 28	Router is itself an LLM that interleaves "think" and "route" actions across multi-round contexts; routing reframed as reasoning, not classification.
Token/step granularity parents
LA-4	R2R — Token-Level Small-Large Model Routing	NeurIPS 2025 · arXiv:2505.21600 · 2025-11-15	TRIM 3	Token-level divergence detection + automatic routing-label generation; TRIM moved the granularity from token to step (more deployable).
LA-5	Lookahead Routing — Predicting Output Representations Huang et al.	NeurIPS 2025 poster · 2025-12-04	NVIDIA Prefill Activations Router 1	Predicts latent output representations using causal/masked LMs; the in-window ref-1 substitutes real internal states for predicted ones.
Confidence / self-routing parents
LA-8	DiSRouter — Distributed Self-Routing for LLM Selections	ICLR 2026 · arXiv:2510.19208 · 2025-10-22	DiSRouter 7; shadow self-router (§08b)	Already referenced as v5 ref-7; listed here for lineage completeness. Appendix A.9's "frontier models possess strong intrinsic self-awareness without fine-tuning" is the key premise for the shadow self-router thesis.
LA-9	Self-REF — Learning to Route LLMs with Confidence Tokens	ICML 2025 · 2025-05	Abstention as a route 88	Established that LLMs can learn to emit explicit confidence tokens predicting downstream correctness — the lineage parent for the typed-abstention contract.
LA-16	LLMs Encode Their Failures: Predicting Success from Pre-Generation Activations Lugoloobi et al.	arXiv:2602.09924 · 2026-02-06	Shadow self-router (§08b); zero-shot confidence 10	Pre-Feb-8 by 2 days. AUROC > 0.70 from linear probes on residual-stream activations; cross-model encoder pattern that enables shadow self-router on closed APIs without fine-tuning.
Benchmark parents
LA-6	RouterEval — 200M Record Comprehensive Benchmark	EMNLP 2025 Findings · 2025-11-12	RouterArena; Dynamic Routing Survey 8	First benchmark documenting model-level scaling — as candidate pool grows, capable router exceeds best individual model.
LA-7	RouterArena — Open Platform for Comprehensive Router Comparison	ICLR 2026 · arXiv:2510.00202 · 2025-10-01 (ongoing)	Capability matrix §06; drift §12	Live leaderboard infrastructure for router comparison — 44 categories, Bloom difficulty levels, 5 metrics, commercial-router inclusion; the missing primitive for auto-refreshed capability matrices.
LA-10	RouteLLM — Strong/Weak Cascade Foundation	ICLR 2025 (Berkeley / Anyscale) · 2025-01	DiSRouter 7; Martian 55; Not Diamond	Foundational ancestor cited by DiSRouter and many 2026 systems; established the strong/weak model cascade concept that the productised commercial routers extend.
LA-14	LLMRouterBench — Massive Benchmark and Unified Framework Li et al.	arXiv:2601.07206 · 2026-01-12	§13 drift / operational tooling	Pre-Feb-8 by 27 days. Unified evaluation harness (400K+ instances, 21 datasets, 33 models, 10 routing baselines); cited inside lineage appendix only.
LA-15	MMR-Bench — Multimodal LLM Routing Benchmark Ma et al.	arXiv:2601.17814 · 2026-01-28	Multimodal dispatch (§07b); BET 08	Pre-Feb-8 by 11 days. Multimodal routing benchmark referenced by EquiRouter 71 and others. Lineage parent for §07b multimodal-dispatch gap.
Field-substrate parents
LA-11	Hidden Cost of LLM Drift: How to Detect Subtle Shifts Before Quality Drops InsightFinder	insightfinder.com/blog · 2025-12-08	90-day degradation 75; drift taxonomy 77	Drift detection cost ~1–2% overhead; undetected drift costs 5–20% revenue over 90 days; 10–20× ROI on drift monitoring.
LA-12	ZenML — What 1,200 Production Deployments Reveal About LLMOps in 2025	zenml.io/blog · 2025-12-19	Gateway framing 62; outage postmortem 65	Aggregates Amazon Rufus multi-model evolution, GetOnStack $127/wk → $47K/mo recursive-loop incident, Cursor Tab 400M req/day, OpenTelemetry instrumentation patterns.
LA-13	Hugging Face TGI Maintenance Mode Hugging Face	huggingface.co · 2025-12-15	Fleet-rotation framing 69	Foundational fleet-rotation evidence: the most-widely-used open-source inference framework went into maintenance; teams that built on TGI face migration cost.
LA-17	Semantic Router Aurelio Labs	github.com · 2024 → 2026 (ongoing)	Heuristic routing §05b; rule-vs-ML framework 87	Lineage no-train system (utterance-embedding routing); 10–100× faster than LLM-based routing; primary tradeoff: utterances go stale on distribution shift.
LA-18	AWS Bedrock Intelligent Prompt Routing — original docs	docs.aws.amazon.com · 2025-11-15	AWS Bedrock 46, cross-region 98	Documented as "may not always provide optimal routing for unique or specialized use cases" — vendor-managed black-box that cannot adapt to application-specific distribution.
LA-19	AWS Bedrock 1-Hour Prompt Caching	aws.amazon.com · 2026-01-26	Bedrock cross-region 98	Pre-Feb-8 by 13 days. Closes gap with Anthropic Direct on cache TTL; combined with Bedrock IPR (LA-18) is the most powerful Bedrock-native composition.
LA-20	Red Hat — Master KV Cache Aware Routing with llm-d	developers.redhat.com · 2025-10-07	llm-d predicted-latency 44; precise prefix-cache 45	EPP scoring of vLLM decode pods, 87.4% cache hit rate, 88% TTFT reduction, 99.92% session affinity.
Regulatory parents
LA-21	EU AI Act (Reg. 2024/1689) + 2026-05-07 Digital Omnibus amendment EUR-Lex · 2024-07-12 (Act); Commission · 2026-05-07 (Omnibus)	EUR-Lex · 2024-07-12 / Commission · 2026-05-07	§03b audit surface; gateway routing decisions (Art. 9, 10, 13, 14, 15, 50, 53, 55)	Original Annex III deadline 2026-08-02 was pushed to 2026-12-02 by the Digital Omnibus of 2026-05-07. GPAI obligations (Art. 53, 55) remain live since 2025-08-02; systemic-risk reporting clock is unchanged. The article-by-article mapping in §03b is grounded in the EUR-Lex text.
Cascade economics parents
LA-22	FrugalGPT — How to Use Large Language Models While Reducing Cost and Improving Performance Chen et al. · TMLR · arXiv:2305.05176	arXiv:2305.05176 · 2023-05 (TMLR · 2024-12)	§04b "80% rule"; cascade tier discussion	Three-tier learned cascade (GPT-J → J1-L → GPT-4) trained against a DistilBERT scoring function. The headline result behind the "80% rule" naming: 80% cost reduction on HEADLINES at matched or improved accuracy; across-dataset range 50–98%. The 2026 cascade literature (BoundaryRouter, ConfSpec, Mahmood et al. ICLR 2026) reads as direct extension of this baseline.

One sentence per cluster names the in-window descendant. Online routing parents (LA-1) feed ParetoBandit and AdaServe. Causal routing (LA-2) feeds MTRouter27 and the §14 causal-correction discussion. RL multi-round routing (LA-3) feeds GraphPlanner28. Token/step granularity parents (LA-4, LA-5) feed TRIM3 and the NVIDIA Prefill Activations Router1. Confidence and self-routing parents (LA-8, LA-9, LA-16) feed DiSRouter7 and the shadow-self-routing pattern in §08b. Benchmark parents (LA-6, LA-7, LA-14, LA-15) feed §07b multimodal dispatch and §13 drift instrumentation. Field-substrate parents (LA-11–LA-13, LA-17–LA-20) feed the in-window gateway, outage, drift, vendor and KV-cache references throughout the survey.