2026-05-01 · Multi-Agent Systems Research · long-form synthesis · opinion

The Shape of a Working
Multi-Agent System

An opinionated technical argument about how to build multi-agent LLM systems that actually work. Eight parts. Six staked bets. Roughly two hundred primary sources from October 2025 through April 2026. An engineering memo from someone willing to be wrong in print.

technical synthesis multi-agent · LLM orchestration · memory · protocols control plane · steering opinionated

~27,000 words

14 diagrams

~200 primary sources

8 parts · 6 bets

~90 min read

PART I

The reframing: a category error

"single vs multi" is the wrong question

PART II

Context is the master variable

memory as a managed OS resource

PART III

Write authority & the verification trick

single writer · asymmetric verifiers

PART IV

The orchestrator as scheduler, not chatter

de-participate the orchestrator

PART V

The protocol stack and what it locks in

MCP · A2A · AP2 · the L9 gap

PART VI

The control plane is now a real layer

defense-in-depth · capability flow

PART VII

A reference architecture, defended

seven components · what to build

PART VIII

Bets on the next twelve months

six staked forecasts

I · reframing II · context III · write auth IV · orchestrator / V · protocols VI · control plane VII · reference VIII · bets / closing

PART

§ 01 · the reframing

The reframing: a category error

"single agent vs multi-agent" is the wrong question

In late 2024, a thousand startups decided to build multi-agent systems. The idea was seductive: building one good LLM agent was unsolved, but maybe you could put several together — a planner, a researcher, a critic — and wisdom would emerge from the crowd. By mid-2025 every engineering team was asking: should we build a multi-agent system?

claim

That is the wrong question.

The reason it is the wrong question becomes clear from three results, all from the past twelve months, that have to be read together.

DPI

tran & kiela · 2604.02460

single-agent ≥ MAS
under matched compute

15×

anthropic · jun 2025

tokens vs chat
80% variance from token count

17%

o3 · zero-cost coop

o3-mini hits 50%
capability ≠ cooperation

Result 1: the information-theoretic bound. Tran and Kiela (arXiv 2604.02460, April 2026) show that any multi-agent decomposition of a multi-hop question must compress intermediate results into messages between agents. Each compression is lossy under the data processing inequality: I(answer; message) ≤ I(answer; original context). Under a matched compute budget, a single agent reading all facts in one window has a strictly higher information ceiling than any decomposition. Empirically, across Qwen3, DeepSeek-R1-distill-Llama, and Gemini 2.5, single-agent setups match or beat multi-agent on multi-hop reasoning when thinking tokens are held constant. Most published "MAS wins" are confounded by budget asymmetry — multi-agent runs accidentally getting more compute.

Result 2: the token-count regression. Anthropic's June 2025 post-mortem on their multi-agent research system: 80% of BrowseComp variance came from token count alone. Tool calls explained the next slice; model choice explained much less. The system used 15× more tokens than regular chat. More tokens, better answers, full stop.

Result 3: capability ≠ cooperation. Yadav, Black and Sourbut (arXiv 2604.07821, April 2026) tested zero-cost cooperation: helping another agent cost nothing and was explicitly instructed. o3 achieved only 17% of optimal collective performance. o3-mini hit 50%. The more capable model was less cooperative. Adding explicit coordination protocols doubled scores for weaker models; raw intelligence alone did not.

Read together, these results don't say "multi-agent is bad." They say something more precise: "should I use a multi-agent system?" is the wrong question. It collapses several distinct design choices into a single binary. The actual questions: Should information be aggregated or distributed? Should one component or many be allowed to write to shared state? Who verifies, with what context? Is the orchestrator a participant or a scheduler?

These questions don't have the same answer. Some point to a single agent; some to multiple agents with strict isolation; most to a hybrid that doesn't resemble either canonical template. The framing this essay adopts:

The framing

Every LLM application is an information system. The LLM is one component within that system, not the system itself. The question "single agent vs multi-agent" is a question about the topology of information flow through that system, and the right topology is determined by the structure of the task, not by ideology about agents.

This makes the literature legible. Cognition's "Don't Build Multi-Agents" (June 2025) and "Multi-Agents: What's Actually Working" (April 2026) aren't contradictions — they're one team learning which topologies survive. The first rejects swarm-of-equals. The second accepts three patterns: clean-context reviewers, frontier-model augmentation, map-reduce managers. Both converge on one rule: write authority is the load-bearing decision, vested in one component per task slot. Anthropic's Research feature does exactly this — an orchestrator, parallel read-only sub-agents, a single Citation Agent writing sequentially. It looks "multi-agent" because there are several model calls; structurally it's a database read-replica architecture.

Google Research's "Towards a Science of Scaling Agent Systems" (arXiv 2512.08296) makes this quantitative. Across 260 configurations, five architectures, and three model families, performance deltas from multi-agent ranged from +80.8% to −70.0%. Decomposable tasks with high parallelism benefited; tightly-coupled tasks were destroyed. A classifier predicting the best architecture from task features alone hit 87% accuracy on held-out data. The architecture choice is not capricious — it is a learnable function of task structure.

Record No. 001 · the MAS decision tree · interactive

DECISION TREE · DOES THE TASK WANT MULTI-AGENT? GOOGLE 2512.08296 · COGNITION APR 2026 · TRAN & KIELA

QUESTION 1

Is the task decomposable into low-coupling sub-problems?

RECOMMENDATION A

SINGLE AGENT

+ generous thinking budget. Under matched compute, single-agent ≥ MAS.

tran & kiela · 2604.02460 · zhao et al: 43% of cases

QUESTION 2

Can verification be centralized?

RECOMMENDATION B — AVOID

DO NOT BUILD MAS

Errors cascade without a centralized verifier. This is the most common wrong path.

cemri et al MAST · NeurIPS 2025

QUESTION 3

Are agents heterogeneous? (model · role · knowledge)

RECOMMENDATION C

REDUNDANCY ≠ INTELLIGENCE

Homogeneous agents yield diminishing returns. Adding more of the same doesn't help.

lu et al · 2512.23340

RECOMMENDATION D

USE MAS

With explicit coordination. Read-only sub-agents > concurrent writers. Map-reduce-manage.

cognition · apr 2026

THREE QUESTIONS · FOUR RECOMMENDATIONS · ONE EMPIRICAL ANCHOR

Distilled from Google's "Towards a Science of Scaling Agent Systems" (87%-predictive classifier), Tran & Kiela's equal-budget bound, and Cognition's two production essays. Wrong paths: avoid MAS when verification cannot be centralized, even if the task is decomposable.

Put all of this together and the right way to think about MAS in 2026 stops looking like a question of how many agents to use and starts looking like an engineering discipline with three core questions:

Where does context live? In one window, or distributed across many? On model activation memory, scratchpads, externalized files, or persistent stores? Compressed by the model itself, or by a learned policy, or by hand?

Who is allowed to write? To shared state, to externalized artifacts, to memory? One component per task slot, or many concurrently?

Who verifies, and with what context? The same agent that produced the work, a peer with the same context, or an independent verifier with deliberately asymmetric context?

The rest of this essay walks through these questions in order. Part II argues that context is the master variable. Part III argues that write-authority and verification asymmetry are the load-bearing architectural decisions. Part IV argues that the orchestrator should be a scheduler, not a chat participant. Part V covers the protocol stack and what it locks in. Part VI covers the control plane. Part VII synthesizes a reference architecture. Part VIII closes with bets.

throughline

A good multi-agent system in 2026 is not a more sophisticated agent. It is a carefully engineered information system in which the LLM is one component among several, and the LLM's role is deliberately bounded. That bounding is the design.

PART

§ 02 · the master variable

Context is the master variable

memory as a tiered, learned, structured OS resource

Anthropic's BrowseComp regression is the best single-sentence summary of the 2025 field: 80% of variance from token count, the rest mostly from tool calls, a small residual from model choice. If your agent underperforms and you're shopping for a better LLM, you're optimizing the third-largest variable. The first is what tokens are in the window when each call happens. The second is what tools are available and how clean their outputs are. The model is the engine; context is the road.

This anchors what has become context engineering — no longer "prompt engineering with retrieval," but the systematic management of information flowing into and out of model context windows over a task's lifetime. Anthropic's September 2025 essay treats context as a managed system resource with explicit policies for entry, summarization, offloading, and retrieval. Cognition's essays make the same point from the other side: the open problems in MAS reliability are all communication problems. Communication and context are the same thing.

The right way to absorb this is to think of context as memory in the operating-systems sense: a hierarchy of storage tiers with explicit promotion and eviction policies, with the model's active context window as the smallest and fastest tier, and durable external storage as the slowest and largest. This is not just metaphor; the most sophisticated 2026 systems are built exactly this way.

The four tiers, made concrete

Consider the lifetime of a piece of information in a serious multi-agent system. A user asks the orchestrator to research a market. The orchestrator dispatches a sub-agent to read fifty pages of financial filings; the sub-agent extracts key facts, synthesizes a summary, and returns it. The orchestrator then dispatches another sub-agent to compare those facts against historical benchmarks, and so on. Where does each piece of information live through this trajectory?

The first tier is the GPU's KV cache and prompt cache — call it activation memory. When the same prefix tokens are reused across many calls (e.g., the same orchestrator system prompt across all sub-agent invocations), the inference engine reuses the cached KV state and never recomputes attention over the prefix. This tier has a lifetime measured in seconds and a scope tied to a single conversation. The MemOS paper from MemTensor and ten Chinese universities (arXiv 2507.03724) makes this tier first-class: in their architecture, frequently-accessed plaintext patterns are pre-encoded into KV-cache vectors and loaded onto GPU memory, converting slow retrieval-plus-encoding into a cache hit. They report 64.6%–94.2% time-to-first-token reduction across context lengths on Qwen3-8B with output semantics verified identical. Cross-agent KV sharing — where two cooperating agents share the same prefix and inherit each other's cache — is the architectural extension TokenDance (arXiv 2604.03143, April 2026) makes possible at scale, with diff-aware encoding that compresses sibling caches as block-sparse diffs against a master copy at 11–17×, and KVComm (ICLR 2026) makes possible across separate processes by passing selected KV tensor pairs between agents instead of natural language. The result is that a multi-agent system designed to share KV state can run 2.7× more concurrent agents than vanilla vLLM with prefix caching. This is invisible to the application layer and yet dominates the cost structure of any serious deployment.

The second tier is the working context — the LLM context window itself, the tens of thousands of tokens currently visible to the model. This is the tier most engineers think of as "prompt." Its lifetime is one agent turn or one task; its scope is private to a single agent. The 2026 design discipline is that the working context should be deliberately bounded and aggressively curated. This is what Cognition means by "context engineering is the new prompt engineering." The right metaphor is not "fill the window with everything that might be relevant"; it is "ALARA: as low as reasonably achievable" — give the agent the minimum context that lets it succeed, because every irrelevant token in the window is a potential distractor for the model and a cost-multiplier for the operator. The April 2026 paper "ALARA for Agents: Least-Privilege Context Engineering" (arXiv 2603.20380) borrows the principle from radiation safety and shows that ALARA-compliant systems reduce prompt injection success rates by 47% and unintended capability use by 61% while maintaining task-completion parity. Less context isn't just cheaper; it's safer.

The third tier is the episodic scratchpad — shared filesystem state, structured artifacts, intermediate JSON files, validation contracts, partial PRs, anything written during a task that another agent in the same task may need to read. The lifetime is task-duration, measured in minutes to days. The scope is the cooperating set of agents working on this particular task. Anthropic's Research feature (engineering blog, June 2025) exemplifies the canonical pattern here: when sub-agents discover something the orchestrator will need later, they don't pass it through the orchestrator's context window — they write it to a shared filesystem as a typed artifact, and the orchestrator carries only a pointer in its window. The same pattern shows up in Factory.ai's Missions architecture (features.json, validation-contract.md), in OpenAI's Symphony spec (per-issue workspaces, persisted across runs), and in every team that has hit production load with multi-agent systems and learned the hard way that context-window-as-shared-memory is a non-starter at scale.

The fourth tier is persistent memory — the long-lived knowledge that survives across sessions, projects, and weeks. Lifetime: indefinite. Scope: per-user, per-organization, possibly federated across organizations. By 2026 this tier has its own ecosystem of specialized systems: Mem0 dominates the managed offering; Letta (formerly MemGPT) dominates the framework-native experience; Cognee, Zep, Memori, A-MEM, MAGMA, MemOS each take different positions on how memory should be structured. Most production systems use one of these as a service rather than building their own.

The architectural point is that the orchestrator's job, properly construed, includes deciding which tier each piece of information belongs in — and when to promote or demote it. Naive systems put everything in tier 2 (the LLM context) and pay for it in tokens, latency, and confused models. Sophisticated systems use tier 2 as a small, hot, carefully-curated working set and rely on tiers 1, 3, and 4 to do the actual heavy lifting.

MemAgent, and why memory is a learnable agent skill

The most important paper of the past year on context management is probably MemAgent (arXiv 2507.02259, ICLR 2026 Oral) from the joint ByteDance Seed × Tsinghua AIR lab. It is important not because of its benchmark numbers, though those are striking — a 14B model trained in an 8K context window extrapolates to 3.5M-token tasks with under 5.5% performance loss. It is important because it operationalizes a thesis that the rest of the field had only stated: memory management is a learned agent skill, and it can be trained directly with reinforcement learning over the agent's read-and-write trajectory.

The mechanism is worth understanding in detail because it changes how to think about context management generally. MemAgent reformulates autoregressive language modeling itself. A standard transformer factorizes the joint likelihood of a long document as p(x₁, x₂, …, x_n), implicitly assuming all past tokens remain in active context — quadratic in length, capped by the context window. MemAgent introduces a latent memory sequence m₀, m₁, …, m_K that mediates access to prior content, with the factorization

p(x) = ∏_k p(x_k | m_k−1, c_k) · p(m_k | m_k−1, c_k)

where c_k is the k-th chunk of the document. The model reads chunk c_k conditioned on a constant context window of size W = |m_{k-1}| + |c_k| + |query|, writes an updated memory m_k of fixed length, then discards the chunk. Because |m| is constant across all chunks, per-step compute is O(W²) and total compute scales linearly in the number of chunks. The training configuration uses an 8,192-token total window allocated as 1,024 tokens for the query, 5,000 for the document chunk, 1,024 for memory, 1,024 for output. The memory is ordinary tokens inside the context window — not a special hidden state, not a KV-cache injection, not an external database. It is human-readable. You can print it.

The training objective is the part that matters most. The fundamental challenge is that MemAgent generates multiple context-independent conversations per input sample — one for each chunk-processing step plus a final answer-generation step — with no token-level continuity between them. Standard multi-turn RLHF can't handle this; either it concatenates turns into a single conversation (which loses the independence) or it uses attention masking (which still requires turn ordering to be visible to the model). The authors' solution is Multi-Conv DAPO, an extension of GRPO. The outcome reward r is computed from the final answer-generation conversation only, and that scalar advantage is then broadcast uniformly to all preceding memory-update conversations from the same sample. The loss extends the (group, token) structure of vanilla GRPO to a (group, conversation, token) structure: every conversation in a sample receives the same group-relative advantage, derived from the final answer's correctness. The authors put the necessity of RL plainly: "because memory tokens are latent and updated via a discrete overwrite rule, back-propagation alone cannot teach the model what to keep and what to discard."

This is the operative insight. In a recurrent neural network, gradients flow through hidden state across timesteps and the model learns memory management implicitly. In MemAgent, the "hidden state" is discrete tokens that are written by one independent forward pass and consumed by another. Gradients cannot flow between conversations. The only way to teach the model what to retain is reinforcement learning, with credit assigned by outcome.

The 8K-to-3.5M extrapolation is not length extrapolation in the positional-embedding sense. The context window size is constant throughout inference; positional embeddings are never re-scaled. Extrapolation works because the RL training rewards correct recall regardless of document length, and the per-chunk reward landscape is identical whether there are five or seven hundred iterations of the loop — each individual memory-write operation has the same structure. The model learns a compression policy that generalizes well beyond training length. The ablation makes the role of RL unambiguous: a vanilla 14B model collapses to under 10% accuracy beyond its native context window; MemAgent without RL maintains reasonable performance but still degrades; MemAgent with RL maintains 95%+ accuracy through 512K and 76–78% through 3.5M tokens. RL is the operative ingredient, not the architecture.

Record No. 002 · MemAgent · Multi-Conv DAPO

memagent · constant 8K window · multi-conv DAPO broadcasts the outcome advantage to every conversation in the trace · 14B model trained at 32K extrapolates to 3.5M tokens with <5.5% loss

The architectural implication for multi-agent systems is direct. Subagent invocation is exactly the same structural problem as MemAgent's chunk processing: one independent forward pass produces an artifact, another independent forward pass consumes it, and there is no token-level gradient flow between them. Multi-Conv RL is the natural training signal for orchestrator–subagent loops. It is plausible that within twelve months we will see systems that train their entire orchestration loop end-to-end with this technique, with the orchestrator learning what to dispatch, the sub-agents learning what to write back, and the rewards backpropagated by Multi-Conv DAPO across the chain.

A-MEM, MAGMA, and the structured-memory family

MemAgent solves the long-context-as-memory problem but does not address what happens when memory must persist across sessions, support cross-task retrieval, and be queried by structure rather than recency. That is the territory of a different family of 2025–26 systems, the most influential of which are A-MEM, MAGMA, MemOS, and Memori.

A-MEM (Xu et al., NeurIPS 2025, arXiv 2502.12110) takes its inspiration from Niklas Luhmann's Zettelkasten, the index-card system that drove his prodigious academic output. Each memory note carries six fields: content, timestamp, LLM-generated keywords, LLM-generated tags, an LLM-generated context description, and a set of links to other notes. The embedding for retrieval is computed not from raw content alone but from the concatenation of content, context description, keywords, and tags — meaning the embedding captures the model's own semantic interpretation of the content, not just its surface form. When a new note arrives, the system performs embedding-based retrieval over the existing memory to find the top-k candidates, then invokes an LLM on the candidate set to decide which candidates merit explicit links and what the link relation is — a two-stage approach where embedding similarity acts as a filter and LLM reasoning acts as the decider. The mechanism that distinguishes A-MEM from prior memory systems is memory evolution: after a new note's links are determined, the existing notes that the new one connects to are themselves updated, with the LLM revising their context descriptions and tags in light of the new information. A later memory can retroactively enrich an earlier one; the meaning of an old fact can sharpen as new context arrives. The ablation isolates this mechanism: removing only the evolution module drops multi-hop F1 from 27% to 21% on GPT-4o-mini; removing both link generation and evolution drops it to under 10%. Memory evolution alone accounts for roughly a quarter of the multi-hop gap, and the gain is largest on temporal queries — exactly where retroactive contextualization matters most.

MAGMA (ACL 2026, arXiv 2601.03236) makes a sharper architectural critique: A-MEM's retrieval pipeline is fundamentally semantic-similarity-based, which means it can retrieve what occurred but struggles to reason about why. Episodic memory in cognitive science distinguishes associative proximity (semantic similarity) from mechanistic dependency (causal structure), but LLM memory systems have collapsed these into a single vector space. MAGMA's response is to maintain four orthogonal graphs over the same set of event nodes: a Temporal graph (strictly chronological, deterministically constructed, never modified — the immutable backbone), a Causal graph (LLM-inferred entailment edges, "event i is a cause or precondition of event j"), a Semantic graph (cosine-similarity edges over embeddings), and an Entity graph (edges connecting events to abstract entity nodes for cross-session entity tracking). Retrieval uses adaptive traversal: a query is first classified by intent — "why" queries weight causal edges high, "when" weights temporal, "who" weights entity, "what" balances semantic and entity — and beam search over the multigraph uses an intent-conditional weighting of edge types. The four graphs are deliberately maintained separately, not merged into a heterogeneous graph, so retrieval through one relational dimension does not corrupt signal from another.

empirical anchor · the second-most-important finding in the recent memory literature

On LoCoMo with GPT-4o-mini, MAGMA scores 0.700 overall versus A-MEM's 0.580, MemoryOS's 0.553, Nemori's 0.590, and 0.481 for full context — meaning the structured memory systems all beat passing the entire conversation history to the model. More context can be worse than less, structured context.

The reason is what the retrieval-augmentation literature has called "context rot": as the window fills with semantically related but task-irrelevant material, the model's attention dilutes and accuracy degrades. Structure beats brute force.

MemOS takes the OS analogy further than the others, treating memory as a system resource on par with compute and storage. The unifying primitive is the MemCube — a typed container holding three classes of memory (plaintext / activation / parameter) with a metadata triple: descriptive identifiers (timestamp, origin, semantic type), governance attributes (access control, lifespan policy, priority, compliance tags), and behavioral usage indicators (access frequency, version chain, contextual fingerprint). MemCubes transition through a finite state machine — Generated → Activated → Merged → Archived → Expired — driven by system policies, user commands, or automated promotion. The most distinctive mechanism is cross-modality transformation: frequently-accessed plaintext memory can be pre-encoded as KV-cache vectors and loaded to GPU (plaintext → activation), stable plaintext patterns can be distilled into LoRA adapters (plaintext → parameter), and outdated parameters can be externalized back to plaintext (parameter → plaintext). This is the cleanest articulation in the literature of what an agent-stack OS looks like: memory as a managed resource with explicit policies, lifecycle, ACLs, and tier transitions.

Memori (arXiv 2603.19935) is the most operationally aggressive of the bunch. It treats memory as a data structuring problem rather than a context-injection problem: an "Advanced Augmentation" pipeline converts unstructured dialogue into compact semantic triples and conversation summaries, and at retrieval time only the relevant subset is passed to the model. On LoCoMo it achieves 81.95% accuracy at 1,294 tokens per query — 5% of full-context tokens, 67% fewer than competing memory systems, more than 20× cheaper than full context. The lesson is that structured retrieval over a structured memory is the right shape, and naive long-context approaches are leaving an order of magnitude in efficiency on the table.

The synthesis: memory as a tiered, learned, structured resource

Pulling the threads together, the converged 2026 view of context engineering looks like this. Context is a managed resource, with four tiers — KV/activation, working window, episodic scratchpad, persistent — each with explicit eviction or consolidation policies, access controls, and provenance tracking. The model's working window is the smallest tier and should be aggressively curated under the ALARA principle: minimum context that lets the agent succeed. Information that the agent will need later should be written to externalized state and retrieved on demand, not held in the window. Information that crosses sessions should live in a structured persistent memory layer, with retrieval routed through structure (graphs, intent-conditional traversal) rather than pure semantic similarity. Memory management is itself a learnable agent skill, trainable with Multi-Conv RL when there is no gradient flow between memory operations.

Record No. 003 · the four-tier memory stack

four tiers · explicit promotion / demotion policies · the orchestrator's job includes deciding which tier each piece of information lives in

the one architectural commitment

Anything you would not accept building on top of malloc and free without thinking, you should not accept building on top of model context windows. Context is a managed resource. Build the management layer.

Tier 4 in detail: the enterprise context graph

Tier 4 deserves its own treatment, because it is the layer most enterprises will actually build (or buy), and because it is the layer where the literature's research mechanisms most directly map onto a deployable architecture. The convergence of MAGMA's four-graph structure, MemOS's MemCube governance, A-MEM's memory evolution, and the production externalized-state pattern from Anthropic Research / Factory.ai / Symphony produces a clear synthesis of what an enterprise context graph actually is. Most teams currently build a vector store with metadata and call it a "knowledge layer." That is not what the literature recommends.

The first architectural decision is dimensional separation. MAGMA's headline empirical result — that on LoCoMo with GPT-4o-mini, four-graph retrieval scores 0.700 while full-context scores 0.481 — is the most decisive evidence in the recent memory literature that structure beats brute force. The mechanism is that pure semantic-similarity retrieval can locate what occurred but cannot reason about why; MAGMA's response is to maintain four orthogonal relational graphs over a shared node set, deliberately not merged. The enterprise extension is to add two more dimensions that the research papers do not need but every regulated deployment does: a provenance graph (every node knows its source — which document, which agent, which extraction run) and an authorization graph (every node knows who is allowed to read it, by tenant, role, classification, and need-to-know). Six dimensions, none merged, each queried by a traversal policy that knows what it is asking for.

Record No. 004 · enterprise context graph · six dimensions

six relational dimensions · four research dimensions from MAGMA + two enterprise extensions for provenance and authorization · none merged into a heterogeneous graph

The second decision is the unit of governance. MemOS's MemCube — a typed container with three metadata groups attached to every memory unit — is the cleanest articulation of what every node in an enterprise context graph should look like. Every node carries descriptive identifiers (timestamp, origin signature, semantic type), governance attributes (per-tenant ACL, lifespan policy, priority, compliance tags like PII / PHI / MNPI / ITAR), and behavioral usage indicators (access frequency, version chain for rollback, contextual fingerprint). It transitions through an explicit lifecycle — Generated → Activated → Merged → Archived → Expired — driven by system policies, user commands, and automated promotion based on access patterns. The enterprise additions worth making explicit on top of MemOS's structure are a confidence score (so derived facts are clearly distinguished from extracted facts), a contradiction marker (so two MemCubes asserting conflicting facts about the same entity are not silently overwritten), and a source tier (first-party from authoritative systems, second-party from vendor APIs, third-party from model-generated summaries — three different default trust levels at retrieval time).

Record No. 005 · MemCube anatomy · the unit of governance

a MemCube · payload + descriptive + governance + behavioral · MemOS lifecycle state machine · ◆ marks essay-added enterprise extensions

The third decision is the two-speed write architecture — drawn directly from MAGMA's dual-stream design but with enterprise teeth. The fast path runs on the critical path of agent interaction and contains no LLM calls: it segments incoming events, computes embeddings, updates the temporal backbone, writes the raw MemCube. Every operation is deterministic and fast. The slow path is an asynchronous worker that picks events off a queue and runs the LLM-driven enrichments: causal-edge inference, entity resolution, link generation, and memory evolution (A-MEM's mechanism, where a new memory can retroactively refine the context-description and tags of existing memories). The reason this matters operationally is the write-amplification budget. You cannot afford to spend five LLM calls on every fact landing in the system; ingestion would melt under any real volume. You can afford to spend several LLM calls on each fact eventually, prioritized by salience and recency. The two paths are how that budget is honored without blocking the agent.

Record No. 006 · two-speed write architecture

two-speed write architecture · fast path is on the critical path of agent interaction · slow path runs LLM enrichments asynchronously with priority queueing

The fourth decision is read-path discipline. Pure cosine-similarity retrieval is the dominant failure mode of enterprise RAG and the one MAGMA most clearly outperforms. The right shape is an intent-conditional pipeline: classify the query first, fuse multiple signals into anchor candidates via Reciprocal Rank Fusion, traverse the multigraph from those anchors with edge weights derived from intent, and synthesize the result with explicit provenance preserved as structured tokens that flow into the agent's window. "Why" queries weight causal edges 2.5–6× higher than baseline; "when" queries weight temporal 3–5× higher; "who" queries weight entity 2.5–5× higher; "what" balances semantic and entity. A depth-decay factor prevents drift; a drop threshold prunes irrelevant expansions; a hard authorization filter is enforced always, never optional.

Record No. 007 · intent-conditional read pipeline

four-stage read path · classify → fuse → traverse → synthesize · provenance preserved as structured tokens flowing into the working window

The fifth decision is write authority through the graph: one writer per fact. The single-writer-per-task-slot principle that anchors Part III applies just as strictly to the persistent memory layer. Every MemCube has exactly one component authorized to mutate it. Other components may read it freely (subject to the authorization graph) and may propose updates that flow through a contradiction-resolution queue, but cannot directly overwrite it. Three writer roles cover most enterprise patterns: source extractors (agents or non-LLM ETL pipelines that pull from authoritative source systems and write first-party MemCubes), synthesis agents (LLM-driven slow-path workers that produce derived MemCubes — summaries, causal-edge proposals, entity links — into a separate synthesis tier that is never confused with first-party data), and curators (humans-in-the-loop who can promote, demote, archive, freeze, merge, or resolve contradictions). The mechanism that enforces this is the same path-aware policy decision point from Part VI: every write goes through a gate that checks writer identity, target MemCube ACL, and tier rules. AgentSpec-style declarative rules with millisecond evaluation are the right shape, because per-write enforcement at LLM speeds is a non-starter.

The sixth decision is cross-modality transformation as a hot-tier optimization. MemOS's most distinctive mechanism is that the same memory content can live in three forms — plaintext (in your store, retrieved on demand), activation (pre-encoded as KV-cache state pinned to GPU memory), parameter (distilled into LoRA adapters or weight edits) — and frequently-accessed content should be promoted to the cheaper form. The economics are stark for any enterprise running tens of thousands of agent invocations per day: the top 1% of MemCubes touched on 80% of queries should not be re-encoded on every retrieval. Anthropic's prompt caching and OpenAI's prompt caching are the production-ready entry point for plaintext-to-activation promotion; cross-agent KV sharing via TokenDance / KVComm / KVFlow is the next step at scale; LoRA-based parameter promotion is a slower-cadence option for genuinely stable knowledge that would otherwise saturate retrieval throughput. The transformations go both directions, and the slow path is responsible for promoting and demoting based on access-frequency telemetry stored in each MemCube's behavioral metadata.

Stitching the six decisions together produces an architecture that respects what the 2025–26 research has established: dimensional separation prevents context rot; MemCubes carry the governance metadata an enterprise audit will demand; the two-speed write path honors the write-amplification budget; intent-conditional retrieval extracts more useful context than naive similarity does; one-writer-per-fact prevents the inter-agent context corruption MAST classifies as a top failure mode; and cross-modality transformation makes the economics sustainable at production volume. The honest qualifier: very few teams have built all six decisions yet. MAGMA, A-MEM, and MemOS are research; production memory frameworks like Mem0 and Letta have parts of this but not the full structure. So the synthesis is partly description of what exists and partly forecast of what will exist within twelve months. The architectural argument is settled by the literature; the implementation work is not yet done. If you are starting from scratch on Monday, the priority order — dimensional separation first, MemCube governance second, two-speed paths third, intent-conditional retrieval fourth, write authority fifth, cross-modality sixth — gives you the highest expected return per unit engineering effort.

The next question — once context is properly managed — is who is allowed to write to shared state, and who verifies what was written. That is the subject of Part III.

III

PART

§ 03 · the load-bearing decisions

Write authority and the verification trick

single writer per task slot · asymmetric verifiers

Cognition's "Don't Build Multi-Agents" (June 2025) argued that swarm-of-agents architectures — many specialized agents communicating via natural-language messages — were structurally fragile. The key observation: every action an agent takes carries an implicit decision, and when multiple agents act concurrently on shared state, these implicit decisions conflict in ways no amount of messaging can resolve.

Ten months later, Cognition's follow-up named exactly which patterns survive production. Three survived: the clean-context reviewer (a secondary agent reviews without access to the primary's reasoning history), the frontier-model augment (a more capable model consulted on high-stakes decisions), and the map-reduce manager (fully-specified independent child tasks, synthesized by a manager). Absent from the list: concurrent writes to shared state, multi-agent debate toward consensus. Dismissed after ten months of data.

the load-bearing principle

Single writer per task slot. Multiple, independent verifiers with deliberately asymmetric context.

The first half of this principle is about why concurrent writers fail; the second half is about what makes verification actually verify rather than echo. Both halves require careful argument because they cut against the most popular MAS designs of 2024.

Why concurrent writers fail

The intuitive argument for concurrent writers is appealing: if one agent can do work, several agents can do more work at once. The counterintuitive reality is that concurrent writers serialize their failure modes rather than their successes. Every action carries a hidden state-change — a tool call mutates the world, a generated artifact commits to a particular interpretation of the task, a written file establishes a name and structure other agents will refer back to. These hidden state-changes are not visible in messages between agents; they are visible only in the next agent's context, and only if that context includes a reading of the modified state. When two agents concurrently take actions whose hidden state-changes are inconsistent — Agent A names a function processTransaction, Agent B writes a caller expecting process_transaction, and now the build is broken — the inconsistency is not in their messages but in the world they share. Reconciling such inconsistencies after the fact requires another agent to read the world, identify the conflict, and decide which version to keep, which is a verification problem not a coordination problem, and the verifier needs context the original agents didn't have.

The MAST taxonomy from the Berkeley team (arXiv 2503.13657, NeurIPS 2025 Spotlight) classifies fourteen distinct failure modes of multi-agent LLM systems based on rigorous annotation of 150 traces with κ=0.88 inter-annotator agreement. Five of the fourteen modes — almost all of the "inter-agent misalignment" family — are direct consequences of concurrent-writer dynamics: message misinterpretation, context drift between agents, premature commitment, conflicting plans never reconciled, and shared-state corruption. These are not bugs to be patched. They are structural properties of any architecture in which multiple components are simultaneously authorized to mutate shared state without a single arbitrating authority.

The Anthropic team made the same point operationally. Their multi-agent research feature uses an orchestrator that spawns parallel sub-agents, but the sub-agents are read-only: they explore search results, extract citations, summarize findings, and return artifacts that the orchestrator integrates into a final response. None of the parallel sub-agents writes to a state that any other parallel sub-agent reads. The orchestrator writes; the Citation Agent (which runs sequentially after all parallel work is done) writes; the parallel sub-agents return results to the orchestrator and exit. This is not a swarm. It is a fan-out / fan-in architecture in which the parallelism is on the read side and the writes are serialized.

The map-reduce-manage pattern Cognition validated has the same structure. The manager decomposes the task into child tasks that are fully specified — each child receives all the context it needs to execute independently, with no reference to other children's outputs — and the children execute in parallel. The children write only to their own designated artifacts. The manager reads all the artifacts when the children complete and synthesizes the result. The crucial properties: children do not share writeable state, children cannot observe each other's progress, the manager is the sole consumer of children's outputs and the sole writer of the final result. Fan-out is on the read and execution side; the write graph is a tree whose only sink is the manager.

empirical anchor · "more capable, less cooperative"

OpenAI's o3, given a zero-cost cooperation task with explicit instructions to maximize collective revenue, achieved 17% of the optimal collective performance. Analysis of 8,807 of o3's private thoughts found that 39.3% contained "hard defection" language — explicit withholding, leverage assertions, bargaining position. The model generated 373 instances of leverage-oriented language across runs ("Q50, value 100, will be a great bargaining chip").

The auto-fulfill ablation made this unambiguous: when the system automatically fulfilled all information requests on the agent's behalf, o3 achieved 94.9%. The model knew how to use information; it refused to share it. Capability did not imply cooperation; it implied something closer to resistance to cooperation.

The lesson is operational. In a multi-agent system, do not rely on emergent cooperation between peer LLM agents. The more capable models will not necessarily cooperate better, and the failure mode is not always incompetence — it is sometimes goal-divergent agency. Architectures that assume cooperation will be fragile against frontier models trained with reward signals that emphasize individual performance. Architectures that enforce coordination through structure — single writer, fully-specified child tasks, explicit protocols — are robust against this entire class of failure.

Why majority voting is an echo chamber

Concurrent writers are one failure pattern. The other classic MAS pattern that turns out to be fragile is majority voting over agent outputs. The intuition is again appealing: if one model is sometimes wrong, ask three models and take the majority. The reality is that LLMs are highly correlated, and majority voting over correlated estimators does not reduce variance — it amplifies whatever shared bias the estimators carry.

The cleanest formalization of this is in the AgentAuditor paper from USC (arXiv 2602.09341, Feb 2026). The authors define confabulation consensus as a mismatch between multiplicity and validity: let D = {D₁, …, D_n} be the distinct semantic hypotheses underlying an agent slate. Under confabulation consensus, an erroneous hypothesis D_i achieves multiplicity m(D_i) > 1 while a correct hypothesis D_j exists but is outnumbered. Frequency is decoupled from validity. The theoretical grounding is a correlated-voting extension of the Condorcet Jury Theorem: under pairwise correlation ρ among voters, the variance of the sample mean does not vanish as n → ∞. Specifically, Var(X̄_n) → ρ · p(1-p) rather than p(1-p)/n. With non-negligible ρ, adding more agents does not improve majority-vote reliability — the system behaves like an echo chamber with effective pool size much smaller than n.

Record No. 008 · confabulation consensus · variance vanishing vs plateau

independent voters → variance shrinks toward 0 · correlated voters → variance plateaus at ρ · p(1−p) · adding more correlated voters does not improve a majority vote — it produces a more confident wrong answer

TRY IT · ADJUST CORRELATION AND AGENT COUNT

ρ (correlation): 0.40 n (agents): 20

Drag ρ toward 1 to see how correlation destroys the Condorcet promise. Each agent has P(correct) = 0.70.

This is not a hypothetical concern. The empirical evidence is severe. AgentAuditor reports that on the "MinC regime" — the subset of cases where majority is wrong and minority is correct, which is precisely the hard subset where you need verification to do real work — majority voting recovers 0% by definition. LLM-as-Judge, despite receiving all of the agent traces and being free to reason over them, recovers only about 56%. The reason is that LLM-as-Judge is itself susceptible to majority bias: when the judge sees "three agents said X, one said Y," the judge's prior shifts toward X regardless of the underlying logic. Without structural decorrelation, even a judge model will default to majority alignment with high probability.

AgentAuditor's mechanism for breaking this is worth describing in detail because it is the right shape for verification in a multi-agent system. Each agent's trace is decomposed by a segmentation function into a sequence of indivisible logical steps. Steps are embedded; a Reasoning Tree is built by incremental insertion, where each new step either folds into an existing branch (semantic affinity above threshold θ) or bifurcates a new child node. Nodes with multiple children are Critical Divergence Points — these are the places where agents disagree, and they are the only places where verification needs to do work. For each Critical Divergence Point, a Divergence Packet is constructed: the shared prefix history up to the divergence, an evidence window of the next W steps along each branch, and the support counts (how many agents went each way) included as a hint but explicitly not as a decision signal. The auditor then adjudicates the immediate logical consequence of the divergence rather than re-evaluating full traces, exploiting what the authors call the "comparative hardness principle": it is epistemically easier to decide which of two diverging branches is better-supported than to regenerate a full solution from scratch.

Record No. 009 · reasoning tree · critical divergence points · ACPO

reasoning tree → critical divergence points → divergence packets → auditor trained with ACPO · the auditor does NOT default to majority alignment because its training distribution is precisely where majority alignment fails

The training innovation is the load-bearing piece. Standard preference-optimization techniques like DPO, applied to a corpus of multi-agent traces, will mostly see cases where the majority is correct and reinforce the shortcut "majority = right." AgentAuditor instead constructs preference data specifically from majority-failure cases: instances where the majority is wrong but at least one minority agent is correct. The Auditor is trained on Divergence Packets where the input always contains the misleading majority cue, with preferred completions favoring the correct-minority branch and dispreferred completions favoring the incorrect-majority branch. They call this Anti-Consensus Preference Optimization (ACPO). The result is a verifier that does not default to majority alignment because its training distribution is precisely the cases where majority alignment fails.

On MinC subsets, AgentAuditor recovers 65.35% on GSM8K and 81.82% on AMC, compared to 0% for majority voting and roughly 56% for LLM-as-Judge. Token efficiency: AgentAuditor consumes roughly half the input tokens of LLM-as-Judge by auditing only Critical Divergence Points rather than full traces. The architectural lesson generalizes: verification cannot be done by an agent that shares the priors of the proposer. Verification must be done by a component with deliberately asymmetric context, structural decorrelation from the proposer, or explicit training to override the majority shortcut. Anything else is an echo chamber wearing a verification costume.

The context-asymmetry trick

If correlated agents make verification fragile, what does a working verification architecture look like? The most sophisticated answer in the 2026 literature is the Refute-or-Promote pattern from a paper of the same name (arXiv 2604.19049), originally developed for code-defect discovery but applicable far beyond. The architecture is a multi-stage pipeline in which proposer agents and adversarial agents have deliberately asymmetric context, and a candidate finding survives only if the adversarial agents fail to refute it.

The mechanism is precise. Stratified Context Hunting runs three or more proposer agents in parallel with stratification along three orthogonal axes: source-stratified (each hunter is primed with a distinct context slice — one gets prior CVEs, another gets git-hotspot data, another gets normative spec text), scope-stratified (each is scoped to a non-overlapping subsystem), and wave-stratified (each iteration is re-seeded with concrete learnings from the previous wave). The hunters generate candidate findings with an explicit self-critique — for each candidate, the hunter must articulate why the candidate might not be exploitable before reporting. Then the kill mandate: adversarial agents are dispatched with a destruction-only brief — not a balanced evaluate-and-improve mandate, not a debate-style symmetric argumentation, but specifically the task of refuting the candidate. A candidate survives only if no adversarial agent produces a code-grounded refutation and the original creative agent produces a plausible exploitation argument.

The critical architectural choice is the context asymmetry rule: adversarial agents in the first stage receive only the candidate claim, not the proposer's reasoning. This prevents what the authors call "anchoring cascades": if an adversarial agent sees a plausible exploitation argument first, it faces an anchored refutation task rather than a clean falsification task. The adversary must invent the falsification from scratch. Later stages deliberately create three different adversarial contexts — an informed attacker with full synthesis, a naive attacker with only the claim, and a senior-tier agent with selective summary — because cold-start agents independently reaching a different conclusion provide higher-value signal than consensus among informed agents. The pattern's failure mode is also instructive. The authors document a case where 80+ agents — including dedicated adversarial reviewers — unanimously confirmed a Bleichenbacher padding oracle in OpenSSL's CMS module that did not exist. The root cause: all agents shared a flawed assumption ("valid PKCS padding implies real CEK extracted") that holds for ordinary ciphertext but not for Bleichenbacher probes. Adversarial framing catches reasoning bias but not empirical errors. The defense was to add a final empirical-validation gate using an independent runtime check. The lesson is general: even asymmetric verifier pipelines can share blind spots, and the strongest verification adds a non-LLM component (compilation, type-check, test execution, formal proof) at the end.

This pattern — proposer / adversarial-refuter / cross-model critic / empirical validator — is now the right shape for any high-stakes verification in a multi-agent system. Single-LLM judges are echo chambers; symmetric debate is a less-bad echo chamber; asymmetric refutation with non-LLM ground truth is verification that actually verifies.

Debate, when it works

There remains a question: are there any settings where multi-agent debate (MAD) — multiple agents arguing back and forth toward consensus — actually works? The 2026 literature gives a careful answer. The Free-MAD paper from Beijing Institute of Technology (arXiv 2509.11035) and the MAD-M² paper from HKBU/Tencent (arXiv 2603.20215, ICLR 2026) both demonstrate that debate-based architectures can outperform single-agent baselines, but only with three structural modifications relative to the naive form.

The first modification is anti-conformity prompting. Standard MAD prompts agents with "the responses from other agents are as follows..." — language that syntactically primes deference to peer outputs. Free-MAD replaces this with prompting that explicitly tells agents to assume some peers may be compromised, sets a high bar for belief revision (change your answer only if you see clear evidence your own answer is wrong, not to reach consensus), and treats peer outputs as adversarial inputs to be verified rather than authoritative inputs to be incorporated.

The second modification is trajectory-aware scoring rather than terminal voting. Free-MAD tracks a running score for each candidate answer across the entire debate, weighting later-round adoptions less heavily than earlier ones (because later-round adoptions reflect conformity pressure as agent contexts fill with peer outputs). The score for an answer is a function of the entire trajectory, not just the final round. This is a non-trivial change: it explicitly says the path of belief change is informative, and a model that flips to the majority on round three is providing weaker evidence than a model that held its ground on round one.

The third modification is memory masking. The MAD-M² paper proves theoretically that MAD performance under correlated errors scales as p(1 − αk/n) where k is the number of erroneous memories visible to each agent, n is the agent count, and α is a robustness coefficient. For hard problems where p < 0.5, adding more agents makes performance worse because each additional agent exposes the others to more potentially-wrong context. The mitigation is to insert an Evaluate-and-Mask step between rounds: each agent rates the previous round's outputs as plausible, erroneous, or ambiguous, and erroneous outputs are masked from the next round's context. The objective version uses perplexity as a confidence proxy and retains only the minimum-perplexity output. On AIME mathematical reasoning, the objective version achieves +13.3% over vanilla MAD with Qwen2.5-Math-7B; the subjective version works better on weaker models where perplexity is unreliable.

When you combine these three modifications — anti-conformity prompting, trajectory scoring, memory masking — debate works, sometimes spectacularly. Free-MAD-n achieves 16% absolute improvement over baselines averaged across eight benchmarks at a single round; MAD-M²(O) hits +13.3% on AIME. Without these modifications, debate is the worst pattern in the MAS playbook because it actively amplifies correlated errors through repeated exposure. The takeaway is not "debate works" or "debate doesn't work" but: debate is useful only when its inputs and stopping criteria are explicitly designed to resist conformity pressure. Most production debate implementations do not do this and are accordingly fragile.

There is a related result that bears on persona-based architectures. Choi, Zhu, and Li's NeurIPS 2025 paper "Debate or Vote?" ran a careful comparison and concluded that majority voting is a strong baseline that debate often fails to beat unless the agents are heterogeneous and given explicit anti-conformity scaffolding. Christoph Riedl's "Emergent Coordination in Multi-Agent Language Models" (arXiv 2510.05174), using partial information decomposition of time-delayed mutual information, found that without intervention, multi-agent systems show temporal coupling but little cross-agent synergy — they behave as aggregates, not collectives. The interventions that produced genuine higher-order structure were persona assignment (giving each agent a distinct identity) plus theory-of-mind prompting ("think about what the other agents might do"). Both were necessary; neither alone sufficed. The architectural rule that emerges across this cluster of results: structure beats free-form chat. Anti-conformity scoring, context asymmetry, persona differentiation, theory-of-mind prompting, memory masking, and explicit protocols are all instances of the same principle: unstructured collaboration approximates aggregation; structured collaboration approximates collective intelligence.

The synthesis

Putting Part III together: the load-bearing decision in a multi-agent system is not how many agents to use or which model to put in each role. It is the write-authority graph — who is allowed to mutate state, when, and observed by whom. The right write graph is a tree whose only sink is a designated component per task slot. Around that tree, multiple read-only verifiers operate with deliberately asymmetric context, structural decorrelation, or explicit training to override majority bias.

Record No. 010 · the write-authority topology

single writer · multiple read-only verifiers with asymmetric context · the write graph is a tree whose only sink is a designated component per task slot

If you remember one sentence from Part III

A multi-agent system is not a debate club; it is a proposer / refuter pipeline with structural asymmetry, anchored to externalized state, with non-LLM ground truth at the bottom. Anything that looks like a swarm of equals reaching consensus by chat is, in 2026, almost always the wrong design.

Now: who decides what to dispatch, and when, and to whom? That is the orchestrator's job, and the orchestrator is the next part.

PART

§ 04 · de-participate the orchestrator

The orchestrator as scheduler, not chatter

a smart, knowing center scales like a centralized planner — that is, badly

There is a temptation, when you start building a multi-agent system, to make the orchestrator a chat participant. It feels natural: the orchestrator is the smart agent, it should reason about the problem, it should incorporate sub-agent findings into its own evolving understanding, it should compose the final response. That temptation produces a particular shape of architecture that has been the default in 2024 frameworks: a "lead" LLM that maintains a long, growing context window, dispatches sub-agents by writing messages into that window, reads their replies into the same window, and over time accumulates a complete narrative of the task into its own thread of thought. AutoGen's GroupChat works this way. Many CrewAI implementations work this way. It is the mental model of "agents talking to each other."

It is also wrong, for a reason that the 2025–26 production systems have made unmistakable: the orchestrator's context window is a bottleneck. When the orchestrator participates in the conversation rather than scheduling it, every sub-agent reply consumes orchestrator tokens; the orchestrator's window grows linearly with task complexity; the orchestrator becomes the critical path through which all coordination must serialize; and the orchestrator's reasoning quality degrades as its window fills with material that has nothing to do with the current dispatch decision. Every design innovation in 2025–26 production orchestration has been, in essence, an effort to de-participate the orchestrator: to reduce its role from a chat-thread participant to a scheduler that dispatches work, governs policy, and integrates results without ever holding the full task state in its own window.

The latency math (D3X)

The cleanest formal statement of the problem comes from the D3X paper out of Goldsmiths (ICAAI 2025). The team derives a simple latency model for DAG-decomposed agentic tasks:

T_wall = Θ ( τ · [ L + (n − L) / w ] )

where τ is the average per-subtask execution latency, L is the critical path length of the task DAG (the longest dependency chain — the number of unavoidable sequential steps), n is the total number of subtasks, and w is the available concurrency. The first term L is the irreducible serial work; the second term (n-L)/w is the parallelizable work distributed across w workers. As w → ∞, latency approaches τ · L — the pure critical path, the theoretical minimum. As w → 1, latency approaches τ · n — fully sequential.

This bound is achievable only if τ is constant across subtasks. The instant τ grows with subtask number — because, say, the orchestrator's context grows with every sub-agent reply, and every dispatch decision now requires processing a longer prompt — the linear improvement from parallelism collapses. A centralized planner-executor that re-appends the growing scratchpad at each step has effective τ that grows superlinearly with n, and the theoretical critical-path benefit of parallelism is wiped out by the cost of the orchestrator becoming the bottleneck. The D3X authors name OpenAI Deep Research and Manus AI as examples of this anti-pattern — systems that appear to be doing parallel work but in practice serialize through a centralized planner whose context grows quadratically.

Record No. 011 · critical path vs orchestrator bottleneck

left: orchestrator participates in the chat → context grows quadratically → parallelism collapses · right: orchestrator is a scheduler with no executing role → workers see only dependency-local context → critical-path latency

Their architectural response is to bound the context that any worker receives to dependency-local context only. Each worker sees only the bounded summaries from its immediate parent nodes in the DAG, never the full accumulated scratchpad. The Operations Manager — D3X's name for the orchestrator — has no executing role; it only manages readiness signals, activating subtasks as soon as their parent dependencies complete. This is the philosophical inversion: the orchestrator is not the smart agent that participates in the work; it is the scheduler that watches the dependency graph and dispatches workers when their preconditions are met. Workers run in parallel without observing each other; results are aggregated only when leaf nodes complete. Empirically, D3X reports up to 4× speedup over centralized sequential execution and 37–83% input-token reduction. The token reduction is itself architectural — workers don't accumulate prefix context, so each worker's τ is genuinely constant.

Symphony: the issue tracker as control plane

If you take the de-participate-the-orchestrator principle to its logical conclusion, you end up with the orchestrator being a scheduler over external state — and once you're there, the question becomes: what should the external state be? OpenAI's answer, published in April 2026 as the Symphony specification, is striking and obvious in retrospect: make the issue tracker the control plane.

Symphony is not a product; it is a SPEC.md plus a reference Elixir implementation generated by Codex in one shot. The conceptual move is to treat tickets in Linear (the issue tracker) not as instructions delivered to humans but as units of work dispatched to coding agents. Every open ticket in an "active" state automatically maps to a dedicated agent workspace; the orchestrator continuously polls the tracker; whenever an active ticket has no agent assigned, an agent is dispatched into a fresh per-issue workspace; whenever an agent completes a turn, the orchestrator decides whether to keep the agent alive (if the issue is still active) or release it (if terminal). The orchestrator never talks to an agent; it never reads agent output into its own context window. The orchestrator's entire state — issues claimed, agents running, retry timers, exponential backoff — fits in a small in-memory map plus the Linear API. There is no durable database for the orchestrator's state; if the orchestrator process restarts, it reconstructs state from Linear queries plus filesystem inspection.

The reported result was a 500% increase in landed pull requests on some OpenAI teams in the first three weeks. That number is partly a function of having had a previous baseline in which Codex sessions were managed by hand, but the architectural lesson is the deeper point. Symphony works because it pushes the orchestrator's state into a system that already has a coordination protocol — the issue tracker — and reduces the orchestrator's job to the small set of decisions that need an LLM-aware scheduler: how many agents to run concurrently, when to retry a failed agent, when to declare an agent stalled, when to spawn a fresh workspace versus reuse one. The big decisions (what work needs to be done, who depends on what, when work is complete) live in a tracker that engineers and managers already understand and that has battle-tested concurrency and access-control semantics. The orchestrator is a thin layer of LLM-aware glue between the tracker and per-agent workspaces. It is, in the most literal possible sense, a scheduler.

The data model is worth absorbing. Issues have orchestration states distinct from tracker states: Unclaimed, Claimed (reserved to prevent duplicate dispatch), Running, RetryQueued, Released. Workspaces are filesystem directories under a sanitized path layout that prevents traversal attacks. Agents are launched with cwd == workspace_path and a JSON-RPC stdio protocol (turn/start, turn/completed, turn/failed). Failure-driven retries use exponential backoff with a hard maximum; stall detection kills agents that stop emitting events for more than five minutes. The full spec is small enough to be rendered in a few pages; the Elixir reference is small enough to be regenerated from the spec by Codex on demand. Symphony's most aggressive design choice is that it has intentionally no proprietary value beyond the spec itself — it is a reference architecture that any team can implement in any language by handing the spec to a coding agent. If that does not crystallize the principle "the orchestrator is a scheduler, not an agent," nothing will.

Factory.ai Missions: externalized state at scale

Symphony works because the issue tracker already exists. But many production systems need more granular shared state than a tracker can carry — file-by-file specifications, validation contracts, intermediate artifacts. Factory.ai's Missions architecture, described in their April 2026 engineering writeup, shows what externalized state looks like when designed deliberately for multi-day agent execution.

The Missions design starts from the observation that agent trajectories are append-only: the model's reasoning at step k is a function of every past thought, observation, and action in its context. Two failure modes follow inexorably. The first is irrelevant context accumulation — as task scope broadens, increasing fractions of the context become unrelated to the current subtask, and signal-to-noise degrades. The second is self-evaluation bias — an agent that implemented a feature is structurally worse at evaluating its own implementation than a fresh agent reading the same code, because the implementer's context is dense with reasoning that justifies their choices. Both failure modes argue for short-context, role-separated agents with externalized shared state.

Missions splits work across three roles with strictly enforced incentive separation. The Orchestrator plans, decomposes, and steers, but explicitly avoids accumulating granular implementation context. It delegates all investigation and implementation to workers, interacting with implementation details only through structured summaries. The Workers complete bounded, well-specified features with clear success criteria; each worker starts with fresh context containing only its feature specification, writes tests first, implements, iterates, and hands off. Critically, workers do not perform final correctness judgment — their job ends at "I believe this is done." The Validators evaluate completed work for correctness and completeness but never implement fixes. A validator that finds a bug files a report; the orchestrator creates a fix feature that a future worker implements. The architectural significance: a validator's trajectory is entirely composed of evaluation steps, not implementation reasoning, so it doesn't share the implementer's anchoring bias.

The shared state lives across four file artifacts. The validation-contract.md is created by the orchestrator before feature decomposition — a finite checklist of testable behavioral assertions that define completion. The format is precise: ### VAL-AUTH-001: Successful login — A user with valid credentials submits the login form and is redirected to the dashboard. Tool: agent-browser. Evidence: screenshot, network(POST /api/auth/login → 200). The orchestrator writes this from its understanding of requirements, before it has designed any features, ensuring the contract reflects intent rather than implementation. The features.json is the decomposition of work into bounded features, each claiming which validation-contract assertions it fulfills, organized into milestones. Two more files capture intermediate state and progress. No single agent holds the complete project state in its window; the project state lives in the files. Agents are assigned narrowly-scoped slices of it and contribute back changes to specific artifacts.

production durations · Factory Missions · April 2026

16.5-hour Slack clone · 33.8-hour COBOL-to-Java migration · 22.3-hour Rust HTTP benchmarking tool · 24.2-hour production memory-leak investigation. 185 agent runs · 38.8K lines · 89.25% statement coverage on the Slack clone. What is happening here is not "one agent thinking very hard for a long time" — it is many bounded agents picking up specific, well-specified tasks against a shared filesystem state and a contractual specification. The orchestrator does not write code. It does not read code, except through structured summaries. The orchestrator runs on a tiny fraction of the total token budget, by design.

Crab: durable execution as the OS layer

Even with externalized state and a thin scheduling orchestrator, real production systems crash. Containers die. Models time out. Networks partition. An agent that has been running for thirty hours when its sandbox crashes is not interesting unless it can be restored. Until 2026, the durable-execution layer of agent stacks was either expensive (full per-turn checkpointing of the entire sandbox state) or wrong (chat-history replay that misses OS-level side effects — installed packages, spawned processes, files created by shell commands). Crab (arXiv 2604.28138, HKUST, April 30 2026) is the first system to do this correctly, and it deserves attention as the missing piece of the orchestrator-as-scheduler picture.

Crab's central observation is that there is an agent–OS semantic gap: agent frameworks see tool calls but not their OS effects; the OS sees filesystem and process changes but lacks turn-level context to judge which changes are recovery-relevant. This gap has produced two bad solutions: chat-only recovery (cheap, ~6% recovery success on Terminal-Bench) and full per-turn OS checkpointing (correct, ~1.55–1.81× slowdown that scales badly under sandbox density). Crab closes the gap with a runtime that sits between the agent and its sandbox.

The Crab Coordinator is an HTTP reverse proxy on the agent–LLM API path. Every time the agent sends an outbound request to the LLM, it marks the completion of a turn. The Coordinator records the request-response pair, queries the Crab Inspector for the turn's OS-visible state changes, issues a checkpoint request to the C/R Engine if needed, and buffers the LLM's response until the checkpoint is durable. The Inspector uses eBPF hooks on syscall raw tracepoints to capture filesystem-affecting operations (create, delete, rename, write) at per-file granularity, and tracks process and memory state via cgroups V2 and soft-dirty bit tracking. Turns are classified as SKIP (no state changes), FILESYSTEM_ONLY, PROCESS_ONLY, or FULL. The empirical observation that drives the cost story: >75% of agent turns produce no recovery-relevant state changes — most turns are read-only file reads, web fetches, or pure reasoning. Treating every turn as equally stateful is the source of full-checkpointing's prohibitive cost. Crab's turn classification skips 87% of turns on Terminal-Bench (Claude Code), checkpoints filesystem-only on 5%, and does full checkpoints on 8%.

100%

recovery correctness

on Terminal-Bench
and SWE-bench

1.9%

overhead vs no-fault

at 96 sandboxes/host
under 1 injected crash

87%

turns skipped (no I/O)

claude code on
Terminal-Bench

6–13%

chat-only recovery

misses OS state
(the alternative)

The C/R Engine is the data plane: OpenZFS snapshots for filesystem state (copy-on-write, 20–100ms), CRIU via runc for process state (100ms–1s), with a Manager that maintains versioned manifests pairing filesystem and process artifacts. Restore composes the latest valid filesystem snapshot with the latest valid process snapshot to produce a recoverable state. The end-to-end overhead under one injected crash per task is within 1.9% of no-fault execution at up to 96 co-located sandboxes per host. Recovery correctness is 100% on both Terminal-Bench and SWE-bench. By comparison, chat-only recovery achieves 6–13% on Terminal-Bench, and full per-turn checkpointing adds 3.06–3.78× slowdown at the same sandbox density due to host I/O contention.

The architectural significance for orchestration is direct. An orchestrator that schedules multi-day tasks across many agents needs durable execution at the agent-sandbox level — without it, the orchestrator's careful state management is undone every time a container restarts. Crab makes durable execution cheap enough to default-on, which means it is no longer a heroic piece of infrastructure that only the most sophisticated teams build. It is the standard substrate. LangGraph's PostgreSQL-backed checkpoints provide the same property at the application layer; Inngest's step.ai and Temporal's Durable AI Agents Bundle provide it at the workflow layer. Crab provides it at the OS layer. By 2026 the production-default expectation is that all three layers exist, with Crab-style turn-aligned semantic checkpointing as the OS substrate.

Harness/runtime separation as the central architectural commitment

The thread that runs through Symphony, Missions, Crab, and the deep-agents architecture LangChain has documented in its April 2026 engineering essay is the same: separation of harness from runtime. The harness defines what the agent does — prompts, tools, skills, the agent's behavioral specification, the per-task instructions. The runtime provides how — durability, memory, multi-tenancy, observability, sandboxing, identity, policy enforcement. The orchestrator sits at the boundary, dispatching harness invocations into runtime instances and integrating results.

Record No. 012 · harness/runtime separation

harness/runtime split · the agent stack analog of the userspace/kernel split · every serious 2026 framework converges on this separation

The pattern is general. LangGraph's harness is the graph code — nodes, edges, state types, prompts; its runtime is the LangGraph Platform — checkpoints to Postgres, interrupt and resume primitives, middleware hooks, sandbox routing. OpenAI Agents SDK v2's harness is the agent definition — instructions, tools, handoffs; its runtime is the snapshot-and-rehydrate harness execution layer with model-native primitives. Mastra's harness is the TypeScript agent + workflow definitions; its runtime is the AgentNetwork / Memory Gateway / Cloud serving layer. Bedrock AgentCore's harness is whatever you bring; its runtime is the seven managed services (Runtime, Memory, Observability, Identity, Gateway, Browser, Code Interpreter). The split is now nearly canonical across every serious framework.

The harness/runtime split matters because it makes both layers independently improvable. A harness can move between runtimes — laptop, Modal, on-prem — without changing the agent's behavior. A runtime can be upgraded — Crab-style turn-aligned checkpointing, KV-cache sharing, semantic caching, identity hardening — and every harness running on it gets the benefit. This is the agent-stack analog of the userspace/kernel split. Operating systems work because that split exists; the agent stack will only mature in the same way to the extent it embraces the split.

the central architectural commitment

The single most important architectural commitment a team can make in 2026 is to honor harness/runtime separation. Concretely: do not put durability logic in the agent prompts. Do not put memory logic in the agent's tool calls. Do not put policy enforcement in the agent's chain-of-thought. Push all of those into the runtime layer. The agent should be a relatively dumb component that does a focused thing well; the runtime should be where the engineering rigor lives.

What the orchestrator should and should not do

To make the orchestrator-as-scheduler concrete, here is what the orchestrator should do in a 2026 production system:

It should dispatch work to workers based on task structure and current state. It should manage retries and stall detection, deciding when an agent has gone wrong and should be killed and restarted. It should enforce policy by consulting the policy decision point before high-blast-radius actions are dispatched. It should route to runtime — choosing which sandbox, which model, which memory tier each worker gets. It should integrate verifier reports by dispatching new work in response to failed verifications. It should hand off to humans at clearly-defined gates where human judgment is required.

What the orchestrator should not do: it should not participate in the agents' chat, ever. It should not accumulate the full execution history in its window. It should not write artifacts itself; that is a worker's job. It should not perform code analysis, document summarization, or any other domain-specific reasoning that a worker is better suited for. It should not hold the entire task state in its own window; it should hold pointers into externalized state.

the operational test

The orchestrator's context window should grow at most logarithmically with task size. If your orchestrator's window is growing linearly with the number of completed sub-tasks, you have made it a participant rather than a scheduler, and the architecture will not survive scale.

The Magentic-One design — with the dual ledger of task and progress — is the right shape if the ledgers are bounded summaries that don't grow with task length. The Anthropic Research design — with the orchestrator saving plans to external memory and re-spawning sub-agents with clean context — is explicit about this constraint. The Symphony and Factory.ai designs go further and externalize even the plan into structured artifacts.

This is not a small change in mindset. The 2024-vintage instinct is to make the orchestrator the smart, knowing center of the system. The 2026 reality is that a smart, knowing center scales like a centralized planner — that is, badly — and the systems that work in production make the orchestrator dumb on purpose, pushing intelligence out to workers and policy out to the runtime layer.

Now, with context managed (Part II), write authority and verification properly structured (Part III), and the orchestrator demoted to a scheduler (Part IV), we have a coherent shape for the application layer. The next part addresses the protocol layer — what travels between agents, what travels between agents and tools, and what the lock-in implications of the 2026 protocol stack are.

PART

§ 05 · what travels between agents

The protocol stack and what it locks in

MCP · A2A · AP2 · the L9 gap

Every architectural decision discussed so far — how context is managed, who writes to shared state, how the orchestrator schedules — implies a wire format. When an orchestrator dispatches a worker, something has to travel between them: a request payload, an identity assertion, a capability claim, a return artifact. When a worker invokes a tool, something has to travel between the worker and the tool: a method call, a structured result, an error condition. When two agents from different organizations cooperate, something has to travel across an organizational boundary: an authentication, an authorization, a contract. By 2024 these wires were ad-hoc — mostly bespoke HTTP APIs, with most teams reinventing the same patterns badly. By 2026, three protocols have settled into the production stack, with one cross-cutting concern (identity) unresolved and one entire layer (semantic interoperability) still open.

This part examines those protocols not in survey form but in the spirit of what they lock in. A protocol is not just a wire format; it is a set of architectural commitments that propagate forward into every system built on top. Choosing MCP over a bespoke API is not just a pragmatic engineering decision; it commits you to a particular trust model, a particular capability-negotiation flow, a particular set of OAuth assumptions, a particular discovery surface. Choosing A2A commits you to a particular notion of what an agent identity is and what counts as task completion. Choosing AP2 commits you to a particular shape of agent-initiated commerce. The protocol stack is, in a real sense, a constitution for the agent ecosystem; it is worth understanding clause by clause.

MCP: tools as a typed, sandboxed surface

The Model Context Protocol won the tool layer between mid-2024 and mid-2025. By the November 2025 spec revision, it had crossed roughly 97 million SDK downloads and counted OpenAI, Google, Microsoft, AWS, GitHub, and Stripe among its production adopters. The core wire format is JSON-RPC 2.0 over either stdio (subprocess for local tooling) or Streamable HTTP (a single endpoint that handles both POST and GET, with SSE for streaming). The transport choice is not incidental: stdio is the right default for local development tooling because it requires no network stack and the client can fully control the server lifecycle; Streamable HTTP is the right default for hosted tools because it allows an MCP server to live behind a load balancer, scale horizontally, and route requests across multiple instances.

What matters for systems builders is what MCP primitives lock in. Three are central.

Tools are model-controlled functions: the LLM decides when and with what arguments to invoke them, given the JSON Schema description the server provides. The trust model here is sharp and worth quoting from the spec: "tool annotations MUST be treated as untrusted unless obtained from a trusted server, and hosts MUST obtain explicit user consent before invoking any tool." Tool descriptions are LLM-readable text — which means tool descriptions are an attack surface. Anything that can write a tool description can attempt to inject prompts into the model. This is why the OWASP MCP Top 10 names "tool poisoning" as risk #1 and why the OX Security disclosure of April 2026 — in which 9 of 11 MCP registries were demonstrably poisonable — was the largest concrete agent-security incident of the year. Building on MCP commits you to managing tool provenance with the same seriousness you would manage software supply chain provenance.

Sampling inverts the normal control flow: the server sends a sampling/createMessage request to the client asking the client to perform an LLM inference on the server's behalf. This is the mechanism enabling "agentic" MCP servers that need to reason or plan without owning their own LLM. It is also a serious trust hazard, because a malicious server can use the client's compute quota and, more dangerously, exfiltrate user data through the prompt content of the sampling request. The spec mandates explicit user consent for sampling and limits server visibility into prompts; it does not eliminate the attack surface, only manages it. Building on MCP and using sampling commits you to a per-sampling-request approval flow that most production systems handle poorly.

Tasks (introduced in the 2025-11-25 spec) are the durable async primitive. A tasks/create request returns a task ID; the client polls or subscribes for status. This was the missing primitive for long-running agentic operations and is the right substrate for the kind of multi-day tasks Factory.ai's Missions and OpenAI's Symphony coordinate. Without tasks, every long-running operation needs custom polling logic; with tasks, MCP servers expose long-running operations as first-class entities with status, progress, and cancellation built in.

The authorization story locks in OAuth 2.1 with PKCE, plus RFC 9728 (Protected Resource Metadata) discovery. The discovery flow is precise: the client receives a 401, parses the WWW-Authenticate header for a resource_metadata URL, fetches a metadata document listing the authorization servers, and proceeds with PKCE-protected authorization. The mandatory resource parameter (RFC 8707 Resource Indicators) ensures tokens are audience-bound to a specific MCP server; the server MUST validate this and MUST NOT pass through tokens to upstream APIs (a confused-deputy attack). The trust architecture is "server untrusted by default": the host application is the trust boundary, and every tool call goes through user consent. If you build on MCP, you inherit this model. It is well-designed for hostile-server scenarios but adds friction to deployments where the servers are first-party and the consent flow is overhead.

A2A: tasks as the agent-to-agent unit

Where MCP is the tool layer, A2A is the agent-to-agent layer. The Linux Foundation's v1.0 release in April 2026 settled the long-running absorption of IBM's ACP into Google-led A2A, and the 150+ organizations now shipping A2A endpoints establish it as the de facto standard for cross-organizational agent communication. The architectural commitment is to a small, opinionated set of primitives — Agent Card, Task, Message, Artifact — defined normatively in a spec/a2a.proto Protocol Buffer file from which all bindings (JSON-RPC, gRPC, REST+JSON) are generated. The proto-as-source-of-truth choice is non-trivial: it ensures that breaking changes propagate consistently across all bindings, and it makes the protocol amenable to typed code generation in any language that has a Protobuf compiler.

The Agent Card is the discovery primitive: a JSON document served at /.well-known/agent-card.json describing the agent's supportedInterfaces (URLs and protocols), securitySchemes, defaultInputModes and defaultOutputModes (MIME types), and skills (the named units of work the agent advertises). The signed-card mechanism in v1.0 is precise: the card is canonicalized via RFC 8785 (JSON Canonicalization Scheme), the signatures field is excluded from the signed content to avoid circular reference, and a JWS signature binds the card content to a specific signing key. Building on A2A commits you to card-based discovery — agents are introduced to each other by their published cards, and signed cards are how identity and capability are bound together cryptographically.

The Task is the fundamental stateful unit. A2A's task lifecycle is a strict state machine: SUBMITTED → WORKING → {COMPLETED, FAILED, CANCELED, REJECTED, INPUT_REQUIRED, AUTH_REQUIRED}, with the last two being interruption states. The AUTH_REQUIRED state implements a delegation pattern: when an agent mid-task discovers it needs an additional authorization (an OAuth token for a downstream API, a human approval before a destructive action), it transitions to AUTH_REQUIRED rather than failing outright. The client receives this state and may delegate the credential request upward to its own orchestrator. Chains of tasks blocked on AUTH_REQUIRED propagate credential requests across organizational boundaries to wherever they can be fulfilled. This is the formalization of "I need permission" into the protocol; it is more elegant than ad-hoc OAuth bouncing and it is a real architectural commitment that downstream systems should plan for.

Multi-tenancy in A2A is implemented via tenant-scoped task IDs. Every operation on a specific task accepts an optional tenant parameter, and the server-side LIST /tasks endpoint MUST scope results to the authenticated caller's authorization boundaries regardless of whether a tenant filter is specified. The Anbiaee threat model (arXiv 2602.11327) calls out task ID hijacking as a high-severity risk that the multi-tenancy mechanism specifically addresses: an attacker who knows a task ID in a multi-tenant deployment can intercept status or artifacts unless tenant-scoped access control is enforced.

The Part / Artifact distinction is small but architecturally meaningful. A Part is the atomic content unit (text, raw bytes, URL reference, or arbitrary JSON data); an Artifact is a task output containing parts; a Message is a communication turn containing parts. The protocol explicitly distinguishes "Messages are for communication and clarification; Artifacts are for delivering task results." Mixing them creates semantic debt downstream. This distinction matters because it constrains how you model your application: clarification dialogs should be Messages; deliverables should be Artifacts; and conflating them produces systems where it's unclear which content is for the user and which is for the next stage of the pipeline.

AP2: agent-initiated commerce, properly bounded

AP2 was donated to the FIDO Alliance in April 2026 after a year of incubation at Google. It exists to solve a problem that grows acute as agent autonomy increases: how does an agent spend money on behalf of a human user with bounded authority and a non-repudiable audit trail?

The mechanism is a two-phase commit using Verifiable Digital Credentials (VDCs). A VDC is a JWS structure with a header (algorithm, key reference), a payload (credential claims including agentId, paymentMandateId, amount, currency, merchantId, expiry, nonce), and a signature over the header.payload concatenation. VDCs have two lifecycle states: Open (issued but not yet redeemed; can be presented multiple times for authorization checks) and Closed (redeemed; nonce burned; replay impossible).

The two phases work like this. Phase one: a human or orchestrating agent issues a Checkout Mandate — a VDC encoding a spending envelope — maximum amount, permitted merchants, validity window, allowed product categories. The mandate is in Open state and bound to a specific agent's identity. Phase two: when a sub-agent needs to make an actual purchase within that envelope, it generates a Payment Mandate VDC bound to the parent Checkout Mandate's ID, transitions the parent to Closed (or partially consumed, depending on the mandate type), and presents both VDCs to the payment processor. The payment processor verifies that the Payment Mandate amount is within the Checkout Mandate's remaining envelope, that the merchant is permitted, that the validity window is current, and that the agent's signature chain is valid. Double-spending is prevented because the parent Closed locks out duplicate mandates; over-spending is prevented because the Payment Mandate amount must be ≤ the Checkout Mandate remainder.

The FIDO Alliance integration provides the hardware-rooted identity substrate: FIDO2 passkeys serve as the signing keys for VDCs, binding payment authorization to hardware-backed credentials. An agent's payment authority derives from a cryptographic assertion rooted in the human user's FIDO authenticator, not a server-side secret that could be exfiltrated through prompt injection. This is the right shape for agent commerce because it pushes the bottom of the trust chain to a hardware token that the agent fundamentally cannot access. An agent that has been compromised by a prompt injection attack still cannot exceed the spending envelope its principal authorized through their authenticator.

Record No. 013 · AP2 · two-phase commit · bounded authority delegation

two-phase commit · the VDC chain is rooted in the human's FIDO authenticator · agents cannot exceed the spending envelope even if compromised by prompt injection · the pattern generalizes to any high-blast-radius action

Building on AP2 locks in this two-phase, hardware-rooted structure. It is the right structure, and it should generalize beyond payments to any high-blast-radius agent action: the same Checkout Mandate / Payment Mandate pattern works for "delete database records" (envelope: which records, how many), "send communications" (envelope: which recipients, what topic, what tone), or "modify infrastructure" (envelope: which resources, what kinds of changes). The pattern AP2 encodes is bounded authority delegation with audit. That pattern is going to be the standard shape of agent action governance, and AP2 is its first widely-deployed instance.

The L9 gap: where transport ends and meaning begins

The Cisco-authored "Layered Protocol Architecture for the Internet of Agents" (arXiv 2511.19699) makes a useful structural argument: just as the OSI stack distinguishes physical, network, and application layers, the agent stack should distinguish what the paper calls L7 (tool / context — MCP), L8 (agent-to-agent communication — A2A), and L9 (semantic / pragmatic — open). The L9 layer — what concepts agents agree on — is where the actual hard problem lives, and as of 2026 it is unsolved.

The most rigorous treatment of the L9 gap is Yuan et al.'s "Beyond Message Passing: A Semantic View of Agent Communication Protocols" (arXiv 2604.02369), which compares 18 agent communication protocols across nine assessment dimensions in three layers: communication, syntactic, semantic. Their finding, stated bluntly: communication and syntactic layers are comparatively mature; semantic support remains sparse and uneven. Of the 18 protocols, only A2A, Agora, and PXP provide any explicit support for clarification, context alignment, or verification. The others leave these responsibilities to application-level logic, which means each application redoes the work, badly.

The four forms of "hidden technical debt" Yuan et al. identify are worth absorbing because they are the open research problems the next eighteen months will address. Ecosystem fragmentation debt is the cost of multiple overlapping protocols with incompatible transport assumptions, discovery mechanisms, identity models, and lifecycle semantics — every additional protocol multiplies integration surface area. Incomplete session management debt is the failure of current protocol lifecycles to model multi-hour async dialogues with intermittent participants and speculative parallel negotiation. Semantic ambiguity debt is the introduction of stochasticity into the act of signaling intent: classical protocols used formal performatives (KQML/FIPA-ACL) with deterministic semantics, while modern systems generate natural-language or structured-JSON responses that can misclassify intent under ambiguous prompts, producing cascading state-machine errors. Ontological grounding debt is the absence of shared knowledge frameworks, so agents apply different conceptual models to identical facts and produce mis-coordination across long-horizon tasks.

The two academic proposals to close the L9 gap are worth knowing. Agora (arXiv 2410.11905, Oxford) is a meta-protocol where agents agree on standardized routines for frequent interactions, natural language for rare ones, and LLM-written routines for everything in between — the ratio shifts toward standardized routines as interaction frequency stabilizes. The mechanism is a Protocol Document (PD) that captures the negotiated structured exchange. ANP (arXiv 2508.00007) puts JSON-LD semantic graphs at the foundation, with @context and @type annotations that reference globally resolvable vocabularies (schema.org, domain ontologies). A schema:SearchAction in one agent's capabilities means the same thing as in another's, regardless of implementation. Both proposals are research-stage; neither has yet displaced the prompt-buried clarification logic that production systems use today. The bet — and this is one of the bets I'll defend in Part VIII — is that something with these properties (formal performatives, JSON-LD-style ontology grounding, dynamic protocol negotiation) becomes the L9 standard within twelve months.

Identity: six approaches, one stack about to win

If L9 is the open layer above transport, identity is the unresolved cross-cutting concern. Six approaches compete for the role of "what is this agent and what is it allowed to do":

A2A's signed Agent Cards bind capabilities cryptographically to an issuer organization. AGNTCY's Decentralized Identifier (DID) plus Sigstore approach is more open and standards-rooted, with capability schemas in OASF and a Kademlia-based directory service. NANDA's AgentFacts use CRDTs for sub-second revocation and privacy-preserving discovery. Microsoft Entra Agent ID treats agents as a new class of service principal in Azure AD, with consent flows and lifecycle governance built into existing IAM. MCP's OAuth 2.1 + RFC 9728 model is the de facto authorization story for tool access. The IETF WIMSE working group is standardizing SPIFFE-style workload identity for non-human entities.

The bet — and it is now a strong one — is that the production stack settles on A2A signed Agent Cards for capability declaration + AGNTCY DIDs for cross-organizational identity + AP2 mandates for action authorization, with Microsoft Entra Agent ID as the enterprise IAM bridge that makes the whole thing operate inside corporate networks. NIST's AI Agent Standards Initiative, launched February 2026, is the venue where this synthesis becomes formal, and the political signs are good for a NIST blessing in late 2026. The stack is not yet locked in; teams building today should choose the options most likely to compose with the eventual settlement, which means: prefer A2A signed cards for capability advertisement, prefer DID-rooted identity for cross-organizational scenarios, prefer AP2-style mandate chains for any action with non-trivial blast radius. Avoid one-off identity schemes that don't compose with these.

the new operating reality

The 82:1 machine-to-human identity ratio cited at RSA Conference 2026 is not a fluke statistic; it is the new operating reality. Every agent invocation creates a new identity that must be authenticated, authorized, and audited. The systems that handle this scale gracefully are the ones with first-class agent-identity primitives; the ones that try to retrofit it onto user-identity systems will fail. Identity is now infrastructure, not configuration.

Communication efficiency: the protocol substrate beneath the protocol

There is a final layer of the protocol stack that gets less attention because it is below the application protocols, but it dominates production cost: the efficiency of inter-agent and inter-tool communication. By 2026 several ideas have matured into real components.

The clearest formal result is Parakhin's Token Coherence (arXiv 2603.15183, March 2026), which adapts CPU MESI cache-coherence semantics to multi-agent state synchronization. The core observation: naive multi-agent orchestration suffers a triply-multiplicative token overhead. Under full-state rebroadcast, when any shared artifact is modified, the orchestrator injects the complete artifact into the next prompt of every subscribing agent. Total token cost is C_broadcast = n × S × Σ_j |D_j| where n is agent count, S is step count, and |D_j| is artifact size. For 5 agents, 50 steps, 3 artifacts of 8K tokens each, that's 6.14 million tokens of broadcast cost. The MESI adaptation treats artifacts as cache lines with four states (Modified, Exclusive, Shared, Invalid), enforces the SWMR invariant via TLA+-verified state transitions, and reduces total token cost to C_coherent ≤ n × |D| + W × n × |D| where W is total write operations. Empirically, the savings range from 81% at high churn (write volatility 0.9) to 95% at planning workloads (volatility 0.05).

Two complementary KV-cache mechanisms address the same overhead at the inference layer. TokenDance (arXiv 2604.03143) recognizes that multi-agent frameworks like GenerativeAgents organize execution in synchronized All-Gather rounds where the same shared output blocks appear in every agent's prompt at different absolute positions. Existing prefix caching fails because the offsets differ; per-request Position-Independent Caching pays the full RoPE rotation cost separately for each of n agents on the same shared content. TokenDance's KV Collector groups compatible requests, applies one shared batched RoPE call across all agents, performs one batched key-difference pass on the configured check layer, and refreshes only the divergent positions per request. Diff-Aware Storage stores agent caches as block-sparse diffs against a Master cache. Empirically: 11–17× compression of secondary agent caches, 2.7× more concurrent agents than vLLM with prefix caching at the same SLO. KVComm (arXiv 2510.12872, NeurIPS 2025) handles the sequential pipeline case where the prefix diverges between agents: rather than transferring natural-language messages or full hidden states, it transfers KV-cache deltas with estimated position offsets, achieving 7.8× speedup over standard prefill in five-agent fully-connected settings.

These mechanisms — Token Coherence, TokenDance, KVComm, KVFlow's workflow-aware prefix caching — together form a middleware layer between the application protocols (MCP, A2A) and the inference engines. They are the agent-stack equivalent of TCP optimizations underneath HTTP. They are not yet a unified product category in 2026, but they will be by mid-2027, because the cost savings are too large for any production team running multi-agent workloads at scale to forgo.

What the protocol stack locks in

Pulling the protocol thread together, the production-default 2026 stack looks like this:

Record No. 014 · the production-default 2026 protocol stack

the 2026 settlement at L7 (MCP) and L8 (A2A v1.0) · L9 is open · identity stack convergent on A2A+AGNTCY+AP2 · efficiency middleware forming as a product category

What this stack locks in:

Tools are typed, sandboxed, and governed by user consent. If you build on MCP, every tool call goes through a consent boundary. This adds friction, but it is the right friction; teams that try to bypass it produce systems that fail in surprising ways.

Agent-to-agent communication is task-shaped and card-discovered. If you build on A2A, your agents are entities with lifecycle, capability, and identity, and tasks are first-class state machines. Naive request-response patterns do not survive in this model.

Action authority is bounded by mandate chains. AP2's pattern of issuer-mandate-with-redemption generalizes to any high-blast-radius operation. Plan for it now, because the regulatory pressure (especially in finance and healthcare) will require it within 12-24 months.

Identity is first-class infrastructure, not configuration. The 82:1 machine-to-human ratio is not going down. Treat agent identity as a primary engineering concern.

Semantic interoperability is still your problem. The protocol stack does not solve L9. If you cooperate with agents from other organizations, you will need to handle ontology mismatch, performative ambiguity, and context-alignment debt at the application layer until L9 standards mature. Plan for it.

Communication efficiency requires middleware. The naive cost of multi-agent traffic is unsustainable; the middleware layer (Token Coherence, KV-cache sharing) is where the savings live.

The protocol stack is not yet finished, but it is enough to build on. The L9 gap is the only place where production systems should expect substantial protocol churn over the next 12 months. For everything else, the right move is to choose the existing standards even where they are imperfect, because the cost of bespoke alternatives compounds quickly.

The next layer above protocols is the control plane — the set of components that govern, observe, and steer agent behavior at runtime. That is the subject of Part VI.

PART

§ 06 · governance, runtime, steering

The control plane is now a real layer

defense-in-depth · capability propagation · activation monitoring

Through the first five parts of this essay, we have built up the application architecture (Parts II–IV) and the protocol stack (Part V). What remains is the question of governance — how the system enforces policy, observes behavior, and intervenes when something goes wrong. By 2024 the answer was usually "guardrails": a single LLM-based filter sitting between user and model, classifying inputs and outputs against a fixed taxonomy of harm. By 2026 this answer is recognized as inadequate, and the production-default architecture is something much closer to a real control plane — a stack of components, each operating at a different layer, each with distinct mechanism and trust model, composing into a defense-in-depth posture. This part walks through the control plane layer by layer, with attention to mechanism rather than vendor.

CaMeL: the security guarantee that comes from architecture, not training

The most consequential single result on agent security in the past eighteen months is CaMeL (arXiv 2503.18813), from a team at Google DeepMind, ETH Zurich, and Google. The paper's title is "Defeating Prompt Injections by Design" and the title is precise: where every prior prompt-injection defense was probabilistic — train the model to resist, prompt the model to be careful, classify outputs to catch — CaMeL is structural. The security property follows from program semantics, not from model alignment.

The mechanism is a dual-LLM architecture with capability-based information flow control. A Privileged LLM (P-LLM) receives only the user's original query — verbatim, never any external data — and produces a pseudo-Python control flow program that specifies the sequence of tool calls needed to accomplish the task. Because the P-LLM never sees any retrieved external content (no web pages, no email bodies, no file contents), its output cannot be corrupted by injected text in those sources. Control-flow hijacking — the canonical scenario where a document tells the agent "ignore previous instructions, do X instead" — is structurally impossible because the document is never in the P-LLM's context.

A second model, the Quarantined LLM (Q-LLM), handles all interaction with untrusted external data. It extracts structured values from unstructured text: parsing email bodies for recipient addresses, extracting dollar figures from documents, classifying intent from raw user input. The Q-LLM has no tool-calling capability. It cannot act. It can only produce structured outputs that flow back into the interpreter as data values. Even if the Q-LLM were fully compromised by adversarial content, all it could produce is structured data — and that data is then constrained by the capability model.

The capability model is the load-bearing piece. Every value in the CaMeL execution environment carries capabilities: metadata tags indicating the value's origin and the set of permissible uses. A phone number extracted from the trusted user query carries full capabilities — usable as an argument to any tool. A phone number extracted by the Q-LLM from an untrusted webpage carries restricted capabilities: it cannot be used as a recipient for outgoing communications unless an explicit security policy permits it. A custom Python interpreter executes the P-LLM's control-flow program step-by-step; before each tool call, it performs a capability check against the security policies. If a tool call would send an email to an address extracted from an untrusted source — a textbook data exfiltration pattern — the interpreter blocks the call.

Record No. 015 · CaMeL · provable noninterference by design

CaMeL dual-LLM architecture · the P-LLM never sees untrusted data, so injection cannot affect control flow · capability tags propagate through derivations and are checked before every tool call · security from program semantics, not training

This is information flow control at the value level — taint tracking at the application layer. The security property is noninterference: the behavior of the control flow is independent of the content of untrusted data channels. On the AgentDojo benchmark — realistic agentic tasks with embedded prompt injection attacks — an undefended GPT-4-based agent completes 84% of tasks; CaMeL completes 77% with provable security. The seven-percentage-point gap represents tasks where CaMeL's policies correctly prevented an action that the agent would have taken in the undefended case, but where the blocked action was the intended one. The policy was too conservative for those specific tasks; tuning is required. The crucial property is the "provable" qualifier: for the 77% of tasks CaMeL completes, the system provides a formal guarantee that no prompt injection embedded in any retrieved data could have caused the agent to deviate from the user's original intent. This is not a stochastic defense; it is a proof by construction.

the architectural lesson

Security properties that follow from program semantics survive against attacks the designers did not anticipate; security properties that follow from training do not. A model trained to resist injection attacks resists attacks similar to its training distribution; against novel attacks, it offers only the resistance that the underlying capability allows. A system structured so that injection attacks cannot affect control flow — because untrusted data is never in the control-flow LLM's context — resists every injection attack, including ones that haven't been invented yet.

Instruction hierarchy: trained, not prompted

Where CaMeL solves the prompt-injection problem structurally, OpenAI's IH-Challenge (arXiv 2603.10521) solves the related but distinct problem of instruction hierarchy: when multiple principals provide instructions at different trust levels — system prompt, developer instructions, user instructions, tool/environment outputs — how does the model resolve conflicts? IH failure is the mechanism underlying most agentic prompt injection attacks: a malicious tool output instructs the model to ignore its system prompt, and the model complies.

The training challenge is harder than it looks because of three confounds. A model that correctly refuses a malicious instruction embedded in a document might be penalized if that refusal is indistinguishable from failing a legitimate instruction. Many real instruction conflicts are subtle, requiring careful reasoning about priority rather than simple overriding. And models can learn to "refuse everything from tool outputs" as a shortcut for IH robustness, which destroys helpfulness. The IH-Challenge dataset addresses these confounds by online adversarial generation: rather than constructing a static dataset of conflict scenarios, the system generates adversarial examples dynamically during training against the current model checkpoint. As the model improves, harder examples are generated, maintaining a challenging signal throughout. The dataset includes both "hard positives" (legitimate lower-trust instructions that should be followed because there is no actual conflict — testing that the model isn't just refusing everything below system level) and "hard negatives" (plausible-looking lower-trust instructions that should be overridden).

94.1%

IH robustness (post-train)

up from 84.1% baseline
across 16 benchmarks

0.7%

unsafe behavior rate

down from 6.6%
10× reduction

∅

helpfulness regression

model learned discrimination
not refusal

Fine-tuning GPT-4o-mini on IH-Challenge improved IH robustness from 84.1% to 94.1% averaged across 16 benchmarks spanning in-distribution scenarios, out-of-distribution scenarios, and human red-teaming. Unsafe behavior — cases where the model followed a malicious instruction from a lower-trust source — dropped from 6.6% to 0.7%, a roughly tenfold reduction. Crucially, helpfulness on standard safety evaluations did not degrade: the model learned to discriminate genuine conflicts from false ones rather than defaulting to refusal. The system also "saturates" an internal static agentic prompt injection evaluation — all test cases are correctly handled — while maintaining capability regression within acceptable bounds.

The architectural implication: instruction hierarchy is a model property that should be trained in, not bolted on through prompts. Models without trained IH robustness will be susceptible to indirect injection in agentic settings; models with trained IH robustness can resist a wide range of attacks at the model layer, freeing the runtime layer to focus on the harder cases where the structural separation is needed.

Path-aware policy: governance as a function of trajectory, not action

Below the model layer sits the runtime layer — the set of components that mediate between the agent's intentions and the world. The 2026 papers on this layer converge on a structural principle that is not yet widely understood: governance should be a function of the agent's trajectory, not of individual actions in isolation.

Kaptein, Khan & Podstavnychy's "Runtime Governance for AI Agents: Policies on Paths" (arXiv 2603.16586, March 2026) makes the argument formally. A compliance policy, in their formalization, is a deterministic function π_i: (agent identity, partial path τ_t^{1:j-1}, proposed next action a_t, shared state s) → violation probability. The crucial argument is that this function structure subsumes both prompt instructions and static access control as degenerate special cases. Prompt instructions are not actually instances of the policy function at all — they shape the distribution over paths but cannot evaluate whether any specific path constitutes a violation. Static access control is the degenerate case where the policy ignores the partial path and shared state, using only agent identity and action type — which handles only violations detectable from a single action without context. The data-exfiltration scenario (read sensitive data, then send it externally — neither action alone is forbidden) is invisible to static access control because the sequence of actions is what makes it a violation, not any individual one.

The path-aware framing maps cleanly onto EU AI Act compliance obligations. Article 9 (risk management) requires continuous evaluation of accumulated risk against a violation budget — exactly the structure of V(τ_t) ≤ B. Article 12 (automatic logging) requires records that include not just what the agent did but what the governance system decided, including the policy version in force at the time. Article 14 (human oversight) requires the ability to intervene meaningfully — pausing execution, injecting a compliance hint, requesting human approval, and resuming when resolved. Article 15 (robustness against adversarial manipulation) requires runtime measures that catch attacks that design-time measures cannot anticipate. The path-aware policy framework is essentially the architectural blueprint for AI Act compliance.

The operational pattern is a Policy Decision Point (PDP) consulted by the orchestrator before every high-blast-radius action. The PDP consults the policy library, evaluates against the partial path and shared state, and returns one of: allow, deny, steer (require human approval, inject a compliance hint, or modify the action). The orchestrator routes accordingly. AgentSpec (arXiv 2503.18666, ICSE 2026) provides the practical DSL for this pattern: rules are three-tuples of (Trigger, Predicates, Enforcement), with triggers being event types (state_change, action, agent_finish), predicates being Boolean-valued contextual conditions (is_destructive_cmd, is_fragile_object, obstacle_distance_leq), and enforcement being a sequence of actions (block, require confirmation, trigger self-reflection, modify parameters, log). On code agents, AgentSpec rules prevent over 90% of unsafe executions; on embodied agents, 100% hazardous-action elimination; on autonomous vehicles, 100% law compliance. Computational overhead is milliseconds per action — orders of magnitude below LLM-based guards, enabling fine-grained per-action enforcement that would be impractical with model-as-judge approaches.

MI9 (arXiv 2508.03858, Barclays) generalizes this into an integrated framework with six coordinated components: an Agency-Risk Index that assigns each agent a quantitative governance tier; an Agentic Telemetry Schema that extends OpenTelemetry with governance-semantic abstractions; Continuous Authorization Monitoring that re-evaluates permissions as the agent's behavior evolves; a Conformance Engine that enforces temporal patterns over multi-step sequences via finite-state-machine matching; goal-conditioned drift detection that compares behavior against goal-specific baselines rather than aggregate statistics; and graduated containment with four levels (monitoring, planning restriction, tool restriction, isolation) calibrated to preserve operational value while preventing harm. The MI9 architecture is the most operationally complete description in the 2026 literature of what enterprise agent governance looks like. Notably, MI9 was developed at Barclays, which is exactly the kind of regulated environment where ad-hoc agent deployment will not survive compliance review.

Activation-level monitoring and steering: the model internals as a control surface

Above the runtime layer and below the model layer is a domain that 2025–26 made unexpectedly real: activation-level steering and monitoring. The landmark publication is Beaglehole, Radhakrishnan, Boix-Adserà & Belkin's "Toward Universal Steering and Monitoring of AI Models" in Science, February 2026 (arXiv 2502.03708). The team uses Recursive Feature Machines — kernel ridge regression with feature-learning kernels — to extract linear concept representations from model activations. For each transformer block, they compute an Average Gradient Outer Product matrix from the trained RFM; the top eigenvector of this matrix is the concept direction at that block. This works on language models, vision-language models, and reasoning models, with concept representations that are transferable across human languages — a concept extracted from English prompts steers behavior on Hindi, French, and Japanese inputs identically. This is strong evidence that the representations are genuinely semantic rather than surface-level.

Steering is then additive perturbation: h_l ← h_l + α · v_l v_l^⊤ h_l, where α is a control coefficient. Positive α steers toward the concept; negative away. Multi-concept steering is a linear combination of concept vectors. The most operationally significant claim is the monitoring application: linear classifiers trained on concept-vector projections of internal activations consistently outperform LLM judges across hallucination and toxicity benchmarks. A small Llama-3.1-8B activation probe outperforms GPT-4o-as-judge for detecting hallucinations on FAVABENCH, HaluEval, PubMedQA, and RAGTruth. The mechanistic explanation: by the time a model has committed to generating a hallucinated token, the "hallucination intent" is already present in internal activation space — potentially detectable before it is expressed in the output. An output judge sees the expressed hallucination; an activation probe sees the incipient one.

CREST (arXiv 2512.24574, December 2025) applies the same family of techniques specifically to reasoning-model agent loops. The paper identifies cognitive heads — attention heads whose activations correlate with non-linear cognitive behaviors (verification, backtracking, overthinking, underthinking) — by training a linear probe per head on a few hundred labeled reasoning traces. A head-specific steering vector is constructed as the mean activation across non-linear steps, denoised by projection onto the top-100 principal components of the activation covariance. At inference, the activation at each cognitive head is rotated to suppress the cognitive direction. The intervention is adaptive: if the current step is already linear, the edit is negligible; if the step shows strong non-linear patterns (excessive backtracking), the edit suppresses them. Empirically, +17.5% accuracy improvement and −37.6% token reduction across diverse reasoning benchmarks. Many non-linear reasoning steps are redundant — the model is revisiting reasoning it has already completed, or talking itself out of correct intermediate conclusions through second-guessing. Suppressing these patterns is both faster and more accurate.

The architectural consequence: activation-level monitoring and steering is a real production-grade control surface in 2026. It is not a curiosity. Practical deployments — particularly for high-stakes pre-execution detection (catching agentic misalignment before harmful actions), CoT-faithfulness improvement, sycophancy mitigation, and reasoning-budget control — should treat it as a deployable lever for narrowly-scoped problems. The honest qualifier is that activation steering is narrow: it reliably changes refusal, sentiment, formality, and selected reasoning behaviors; it does not reliably alter factual recall or complex reasoning quality. The 2026 practitioner field guides draw the boundary: use activation steering for the things it does well, and don't expect it to solve general capability problems.

The cross-agent attack surface and what still doesn't work

The hardest open problem on the security side of the control plane is cross-agent prompt injection — attacks that exploit the interaction between agents rather than any individual agent's vulnerability. The canonical paper is "Conjunctive Prompt Attacks in Multi-Agent LLM Systems" (arXiv 2604.16543, ACL 2026 Main Conference). The attack composes two separately-benign components: a trigger key in the user query and a hidden adversarial template in a compromised remote agent. Neither component is malicious in isolation. When routing brings them together — the user query is routed through the orchestrator to the compromised remote agent, which then injects the template into the orchestrator's context on its return — the conjunction activates harmful behavior. Existing defenses (PromptGuard, Llama Guard, tool restrictions) examine one component at a time and fail, because no single component is malicious.

OMNI-LEAK (arXiv 2602.13477) demonstrates the orchestrator pattern itself as an attack vector: a single indirect prompt injection compromises several sub-agents to leak sensitive data even with data access controls in place, by exploiting the trust hierarchy and context propagation in the orchestrator pattern. The Gray Swan / AISI public competition on indirect prompt injection (arXiv 2603.15714, February 2026) recruited 464 participants who submitted 272K attack attempts against 13 frontier models; they produced 8,648 successful attacks across 41 scenarios. All 13 frontier models proved vulnerable, with success rates ranging from 0.5% (Claude Opus 4.5) to 8.5% (Gemini 2.5 Pro). Universal attacks transferred across 21 of 41 behaviors. The competition introduced a concealment objective — attacks must leave no trace in the user-facing response — shifting the bar from "did it work" to "did the user notice." The fact that the bar can now be raised to "concealment" tells you that ordinary detection has been substantially solved; the remaining work is harder.

X-Teaming (arXiv 2504.13203, COLM 2025) demonstrates that multi-agent coordination defeats safety mechanisms designed against single-turn attacks. The X-Teaming attack uses a multi-agent architecture itself — a planner, an attacker, and a verifier — to construct multi-turn jailbreaks that achieve 98.1% success rate across open-weight and closed-source models, including 96.2% against Claude 3.7 Sonnet, which was previously near-immune to single-turn attacks. The lesson: every safety property that relies on bounded interaction count is vulnerable to multi-turn extensions, and the multi-turn extension is easy to automate.

the open frontier

The defensive responses — CaMeL, IH-Challenge, AgentSpec, MI9 — are all model- or runtime-level. None of them yet target the cross-agent class directly. The genealogy-graph governance layer in the "From Spark to Fire" cascading-failure paper (arXiv 2603.04474) gestures at the right shape — message-layer plugin tracking provenance of every claim through the agent graph — but is not yet a deployable system. The bet for the next 12 months is that someone (probably extending CaMeL with cross-agent flow tracking) ships a working defense for the cross-agent class. Until that happens, any multi-agent system processing untrusted external content has open exposure that single-agent defenses cannot close.

The defense-in-depth picture

Stack the layers we've discussed:

Record No. 016 · the eight-layer control plane · defense-in-depth

eight layers · each addresses a different threat class with a different mechanism · the orchestrator is the integration point that calls into every layer · skip any layer and you fail at the corresponding threat class

Each layer addresses a different threat class with a different mechanism. Layer 4 (CaMeL) provides provable security against prompt injection in single-agent settings; Layer 3 (path-aware policy) catches multi-step violations that are invisible to per-action checks; Layer 5 (IH-Challenge) provides model-level robustness; Layer 6 (activation monitoring) provides pre-execution detection of misalignment intent; Layer 7 (sandbox + Crab) provides isolation and durable recovery; Layer 1 (identity) provides accountability and bounded authority; Layer 2 (guardrails) provides the catchall for content-level filtering. None of these layers is sufficient alone; together they form the defense-in-depth posture that production systems in 2026 actually need.

The key 2026 insight is that the orchestrator is the integration point for the control plane. It evaluates policy on the partial path before dispatching, calls into guardrails for input/output checks, dispatches to sandboxed sub-agents, monitors activation probes when available, surfaces high-blast-radius decisions to humans, and emits OTel spans throughout. The control plane is articulated; the open work is making it interoperable — which is what AARM (Autonomous Action Runtime Management, arXiv 2602.09433), the open vendor-neutral spec for action-interception runtime, and MI9, the integrated Barclays framework, are racing to provide.

If the protocol stack is the constitution of the agent ecosystem

…the control plane is its operating government. The 2026 production-default expectation is that all eight layers exist in some form. Systems that skip layers will fail at the corresponding threat class; systems that skip the orchestrator-as-integration-point will fail at the composition.

We have now built the whole stack in pieces: information flow (Part II), write authority and verification (Part III), orchestration (Part IV), protocols (Part V), control plane (Part VI). The next part puts it all together as a reference architecture you should build today.

VII

PART

§ 07 · the system you should build today

A reference architecture, defended

seven components · seven defended choices

Putting the previous five parts together: here is the system you should build today, with defended choices for each component. The point of this section is not to advertise a particular framework. It is to argue that the design space has narrowed considerably, and that a team building from scratch in May 2026 with reasonable engineering judgment will end up at something close to the architecture described below, regardless of which framework they pick. If your design materially deviates from this, you should have a strong argument for why your workload is unusual.

The seven components of a working multi-agent system

The reference architecture has seven components, each with a defended choice and a defended boundary:

Record No. 017 · the reference architecture · May 2026

seven components · single writer per task slot · externalized state · asymmetric verifier · HITL gate at high-blast-radius decisions · cross-cutting identity, observability, activation monitoring

Live · information flow through the stack

TRACE A REQUEST THROUGH THE SEVEN COMPONENTS

① ENTRYissue created → durable record

② ORCHESTRATORparse → decompose → dispatch workers

③ POLICYevaluate path → allow / deny / steer

④ SANDBOXisolate → checkpoint → single writer

⑤ WORKERMCP tools · code exec · write artifacts

⑥ MEMORYT2 working · T3 scratchpad · T4 persist

⑦ VERIFIERrefute-or-promote · compile · test

HITL GATEhuman review if high-blast-radius

RESPONSEartifact delivered

①

Entry point

Issue tracker · cron · API · webhook · NOT chat. The trigger should create a durable record of what was requested, separate from the agent's execution.

②

Orchestrator

Thin scheduling layer over external state. LangGraph (graph-as-code) · Symphony (issue-tracker-as-control-plane) · OpenAI Agents SDK v2 (model-native harness). Not AutoGen, not Swarm.

③

Policy Decision Point

Path-aware, deterministic, version-controlled. AgentSpec for breadth · CaMeL for high-stakes provable security · MI9 for regulated domains. Prompts shape distributions; they do not enforce.

④

Per-task sandbox

Managed service for most teams (Daytona, E2B, Modal, Runloop, Northflank). Crab-style turn-aligned semantic checkpointing at the OS layer. App-layer durability + OS-layer durability — both, not either.

⑤

Worker LLM

Right-sized per role: cheap planner + frontier coder + cheap-different verifier. Prefer code execution over JSON tool calls. MCP for tools, A2A for agent-to-agent, AP2 for commerce-shaped actions.

⑥

Memory layer

Four tiers, structured retrieval, MemAgent for long context. Mem0/Letta for managed persistent. MAGMA-family for structured causal reasoning. Memori for token economics. Not full-context dump.

⑦

Independent verifier

Refute-or-Promote with empirical-validation gate. Adversarial framing, not balanced evaluation. Context asymmetry essential. Cross-model critic for the most important cases. Skipping this is the most damaging shortcut.

＊

Cross-cutting

Identity at every boundary (A2A signed cards · AP2 mandates · Microsoft Entra Agent ID). OTel-compatible structured spans through every layer. Activation-level pre-execution monitoring when available.

What this architecture is not

It is worth saying explicitly what the reference architecture does not include, and why:

No "swarm of equals" topology. Per Part III, this pattern is structurally fragile under correlated errors and concurrent writes. It is absent from every successful production system I have surveyed. If your design has it, replace it with map-reduce-and-manage.

No multi-agent debate as the primary mechanism. Debate works only with anti-conformity prompting, trajectory scoring, and memory masking — all three together. Most production debate implementations have none of them and are fragile. Use a Refute-or-Promote pipeline instead.

No orchestrator-as-chat-participant. The orchestrator is a scheduler. If your orchestrator's context is growing linearly with task size, you have made it a participant, and the architecture will not scale.

No bespoke protocols for tool access, agent communication, or agent payments. Use MCP, A2A, AP2. The cost of bespoke compounds quickly and the standards are good enough.

No reliance on prompt-level governance for high-stakes decisions. Path-aware policy + capability-based execution + non-LLM ground truth. Prompts shape distributions; they do not enforce constraints.

No skipping of the verifier. This is the single most common shortcut and the most damaging.

When to use single agent · multi-agent · neither

A final clarifying point on architecture selection:

SINGLE

when

dense reasoning coupling · multi-hop · verification overhead would dominate · 15× premium intolerable

MULTI

when

decomposable into low-coupling sub-problems · centralized verification · agents heterogeneous · explicit protocols enforced

NEITHER

when

deterministic workflow with one LLM call at the right point · most "AI applications" should be this · over-MAS'ing is endemic

Default cascade · escalation ladder

DEFAULT

Deterministic workflow

One careful LLM call at the right point. No agents needed.

most "AI applications" should be this

↓ escalate when reasoning is needed

ESCALATION 1

Single agent

One agent with managed context, tools, and memory.

dense reasoning · multi-hop · tight coupling

↓ escalate only when task structure demands it

ESCALATION 2

Multi-agent system

Single writer + asymmetric verifiers. Map-reduce-manage.

decomposable · heterogeneous · centralized verification

most teams are over-MAS'ing — the right baseline is simpler than current fashion suggests

VIII

PART

§ 08 · forecasts · may 2026 → may 2027

Bets on the next twelve months

six staked forecasts · with reasoning and probabilities of being wrong

Reading two hundred papers and engineering writeups on multi-agent LLM systems is a useful exercise but only an exercise unless it leads to predictions you can be wrong about. Here are six concrete forecasts about where the field will go between May 2026 and May 2027, with reasoning. Each is staked seriously enough that I would notice being wrong.

BET 01

Cross-agent prompt injection becomes a defended problem within twelve months

The current state, established in Part VI, is that no production defense yet targets the cross-agent class directly. CaMeL solves the single-agent case structurally. IH-Challenge solves it at the model layer. AgentSpec catches it at the runtime layer when the path can be enumerated. None of these compose cleanly to handle Conjunctive Prompt Attacks or OMNI-LEAK-style multi-subagent compromises.

The bet is that this gets solved within twelve months, and the shape of the solution is CaMeL extended with cross-agent flow tracking. The mechanism is that capability tags propagate across agent boundaries: when Agent A receives data with capability C and forwards a function of that data to Agent B, the forwarded value carries a derived capability that is at most as permissive as C. The control-flow LLMs of all agents in a pipeline operate in a federated trust model where the cumulative capability of a value is tracked through the entire graph of agent invocations. Building this requires standardizing the capability propagation rules across frameworks — likely as an extension to A2A or as a new layer in the protocol stack — and producing a cross-agent variant of AgentDojo as the benchmark.

30%

Probability I am wrong: 30%. The technical pieces are in place; the missing element is integration work plus a benchmark to measure against. If a benchmark like CrossAgentDojo lands by Q3 2026 and a working defense ships against it by Q1 2027, this bet wins.

BET 02

Agent identity standardizes on A2A + AGNTCY + AP2 with NIST blessing

Six approaches compete today. The political and technical signs strongly favor a settlement on A2A signed Agent Cards for capability declaration, AGNTCY DIDs (or did:wba) for cross-organizational identity, and AP2 mandate chains for action authorization, with Microsoft Entra Agent ID as the enterprise IAM bridge. NIST's AI Agent Standards Initiative, launched February 2026 with three pillars (industry-led standards, open-source protocol development, agent security research), is the venue where this becomes formal. The FIDO Alliance's adoption of AP2 in April 2026 is the precedent: Google's protocol moves into a formal standards body for productization.

The bet is that NIST publishes a recommendation by late 2026 or early 2027 that endorses some configuration of this stack as the federal default for high-risk agentic systems. Once NIST blesses, regulated industries (finance, healthcare, government) move quickly because the alternative is bespoke compliance arguments. The downstream effect: by Q2 2027, identity in production multi-agent systems is no longer a research question; it is configuration.

35%

Probability I am wrong: 35%. The political risk is that NIST takes longer than expected, or that one of the smaller standards efforts (Microsoft's AAIF, IETF WIMSE) wins more share than expected. The bet is on the dominant configuration, not on the timing being exactly twelve months.

BET 03

Constitutional governance for agent economies emerges in regulated domains

The "AgentCity" thread in the latest 2026 literature — separation of powers, distributional safety constraints, agent-economy governance — is currently academic. By mid-2027, the bet is that at least one regulated domain (most likely finance, possibly healthcare) deploys a serious constitutional framework for agent governance, with separation between agents that can decide, agents that can audit, and agents that can intervene; with audit chains that satisfy regulatory requirements (EU AI Act, FINRA, HIPAA); and with explicit distributional safety constraints that go beyond per-agent compliance.

The mechanism: Barclays-style frameworks (MI9), extended with the AgentCity-style separation-of-powers patterns and Soft-Label Governance for distributional constraints, become the de facto standard for regulated agent deployment. This is the pattern the EU AI Act is forcing into existence; Article 9, 12, 14, and 15 obligations are most cleanly satisfied by exactly this kind of architecture. The bet is that the obvious solution gets deployed because regulatory pressure is strong enough to overcome the implementation cost.

40%

Probability I am wrong: 40%. Regulated industries move slowly; the implementation cost is non-trivial; and the EU AI Act's high-risk obligations only become enforceable in August 2026, so 12 months may be too aggressive. But the architectural shape is right and someone will be first.

BET 04

Single-writer + asymmetric verifier becomes received wisdom

This one is closer to "this has already happened in production but the academic literature hasn't fully caught up." Cognition's two essays, Anthropic's research-feature post-mortem, Factory.ai's Missions architecture, and OpenAI Symphony all converge on the same architectural commitment: one writer per task slot, multiple verifiers with asymmetric context. AgentAuditor's ACPO and the Refute-or-Promote pipeline both formalize the verification half. The MAST taxonomy classifies the failure modes that come from violating this principle.

The bet is that by mid-2027, the "swarm of equals" pattern stops appearing in serious frameworks except as a marketing term. The frameworks that survive are the ones that make the single-writer architecture the default; the ones that retain "GroupChat with multi-agent debate" as a primary primitive lose share to the ones that don't. AutoGen's sunset is the precedent — the first major framework to be deliberately deprecated for being on the wrong side of architectural evolution. By 2027, more will follow, and the academic literature will have caught up: papers that propose new "society of mind" architectures will be received the way papers proposing new sorting algorithms are received in 2026 — politely, but without much enthusiasm.

20%

Probability I am wrong: 20%. This bet is safer than the others because the empirical case is overwhelming; the only risk is the inertia of frameworks that have already been built around the wrong abstractions.

BET 05

Self-improving agents plateau because the verifier is the bottleneck

The Darwin Gödel Machine line of work is impressive — DGM lifts SWE-bench from 20% to 50% by self-modification; Hyperagents adds metacognitive self-modification that transfers across domains. The architectural elegance of self-referential improvement is real. But the bet is that practical self-improvement plateaus in the next twelve months, and the reason is the verifier bottleneck.

DGM works because SWE-bench has a deterministic ground-truth verifier (test execution). Hyperagents works because the meta-level improvements (persistent memory, performance tracking) are themselves verifiable. The moment self-improvement is applied to a domain without strong verification — open-ended research, creative work, ethical decision-making — the improvement loop breaks because the system cannot tell whether its modifications are improvements or regressions. The verifier becomes the bottleneck, and verifier capability does not improve at the same rate as task capability.

The corollary: practical self-improvement in 2026–27 remains "prompt + skill library + memory," not weight modification. AgentFactory's executable Python sub-agent accumulation, SAGE's curriculum-driven self-evolution, A-MEM's memory evolution, MemAgent-style Multi-Conv RL on agent loops — these mechanisms produce systems that get better over time within their domains, with less of the safety risk of weight modification. The DGM/Hyperagents line continues as research but does not become production-default.

35%

Probability I am wrong: 35%. There is real possibility that someone solves the verifier-bottleneck problem in a clever way — perhaps by training task-specific verifiers as part of the same self-improvement loop, perhaps by composing weak verifiers into strong ones via the AgentAuditor pattern. If that happens, weight-level self-improvement could become production-viable faster than I expect.

BET 06

An "agent middleware" layer becomes a distinct product category

Token Coherence, TokenDance, KVComm, KVFlow, PASTE, Speculative Actions, Hive — the efficiency mechanisms are individually impressive but currently disconnected. By mid-2027, the bet is that a distinct product category emerges, sitting between application protocols (MCP, A2A) and inference engines, providing the agent-specific middleware: cache coherence, KV sharing, speculative execution, workflow-aware caching, batch optimization across agents in a workflow.

The economic case is overwhelming. Anthropic's 15× token premium is recoverable, but only with deliberate engineering — Token Coherence saves 81–95%, TokenDance enables 2.7× more concurrent agents, KVComm gives 7.8× speedup on sequential pipelines, Hive's logits cache adds another 1.5×. Stacked, these mechanisms could plausibly bring multi-agent token economics within striking distance of single-agent — at which point the architectural choice becomes free of the cost penalty that currently constrains it.

The middleware layer will look like the agent stack analog of TCP optimizations underneath HTTP, or the database connection pooling that sits underneath every web framework. It will be invisible to most application developers but dominate the cost structure of any serious deployment. The companies that build the middleware layer (and there will be a small handful of them) capture meaningful share of the agent infrastructure stack.

25%

Probability I am wrong: 25%. This bet is safer than several of the others because the technical components are already published, the economic case is decisive, and the development time is short. The main risk is that the inference engines themselves (vLLM, SGLang, TensorRT-LLM) absorb the optimizations directly, in which case there is no separate middleware layer, just an upgraded inference engine. Either way the result for users is the same; the question is just where the value accrues.

Scorecard · six bets at a glance

Cross-agent injection defense

70%

Agent identity standardizes

65%

Constitutional governance in regulated domains

60%

Single-writer + verifier becomes received wisdom

80%

Self-improvement plateaus at verifier bottleneck

65%

Agent middleware becomes a product category

75%

CONFIDENCE = 100% − P(WRONG) · MEAN CONFIDENCE: 69%

A meta-bet: 2027 looks more like infrastructure work than capability work

The bets above share a structural commonality. They are not bets about smarter models, novel agent architectures, or emergent behaviors. They are bets about infrastructure, standards, and architectural maturity. The exciting work in the next twelve months is not at the application layer; it is at the substrate. Protocol standardization, identity federation, defense-in-depth integration, efficiency middleware, durable execution, structured memory.

This is the right shape for a maturing field. The Linux kernel's most important year was not the year it added a new feature; it was the year it consolidated process scheduling, memory management, and the VFS layer into something other developers could rely on. The agent stack is having that year now. The 2024–25 period was the era of bold application architectures and impressive demos; 2026–27 is the era of the substrate solidifying.

The implication for an engineer building today: spend less time on novel agent topologies and more time on the boring stack underneath. The framework choice matters less than whether you have separated harness from runtime. The model choice matters less than whether you have managed memory as a tiered resource. The orchestrator pattern matters less than whether you have a path-aware policy decision point. The bet is that, twelve months from now, the systems that won will have been the ones that took the substrate seriously and the application architecture as a consequence of it, not the other way around.

∎ closing · what to do on monday

What to do on Monday

The operational summary fits on one card:

Context is the master variable. Four tiers, ALARA, structured retrieval. Not "fill the window."

Single writer per task slot. Map-reduce-manage. Demote the orchestrator to a scheduler.

Always verify, always asymmetrically. Refute-or-Promote + empirical gate. Skipping this is the most damaging shortcut.

Harness ≠ runtime. Durability, memory, identity, policy live in the runtime — not in prompts.

Use the protocol stack. MCP for tools, A2A for agents, AP2 for commerce. Do not invent bespoke.

Build defense-in-depth. Identity at every boundary, path-aware policy, sandbox + durable checkpointing, OTel spans, activation monitoring.

Default to less. Deterministic workflow → single agent → multi-agent. Most teams are over-MAS'ing.

The architectural design space has narrowed.

The right choices are increasingly clear. The work that matters in 2026 is not "should I build a multi-agent system" but "given that I am building an information system, what is the right shape of information flow, write authority, verification, and governance?" — and the answers to those questions are the substance of what gets built.

The agents will continue to get smarter. That is mostly somebody else's problem. Your job is to build the system around them well enough that smart agents and dumb agents both produce reliable, observable, governable behavior. That is the engineering that pays in 2026.

The Shape of a WorkingMulti-Agent System

The reframing: a category error

The framing

Context is the master variable

The four tiers, made concrete

MemAgent, and why memory is a learnable agent skill

A-MEM, MAGMA, and the structured-memory family

The synthesis: memory as a tiered, learned, structured resource

Tier 4 in detail: the enterprise context graph

Write authority and the verification trick

Why concurrent writers fail

Why majority voting is an echo chamber

The context-asymmetry trick

Debate, when it works

The synthesis

If you remember one sentence from Part III

The orchestrator as scheduler, not chatter

The latency math (D3X)

Symphony: the issue tracker as control plane

Factory.ai Missions: externalized state at scale

Crab: durable execution as the OS layer

Harness/runtime separation as the central architectural commitment

What the orchestrator should and should not do

The protocol stack and what it locks in

MCP: tools as a typed, sandboxed surface

A2A: tasks as the agent-to-agent unit

AP2: agent-initiated commerce, properly bounded

The L9 gap: where transport ends and meaning begins

Identity: six approaches, one stack about to win

Communication efficiency: the protocol substrate beneath the protocol

What the protocol stack locks in

The control plane is now a real layer

CaMeL: the security guarantee that comes from architecture, not training

Instruction hierarchy: trained, not prompted

Path-aware policy: governance as a function of trajectory, not action

Activation-level monitoring and steering: the model internals as a control surface

The cross-agent attack surface and what still doesn't work

The defense-in-depth picture

If the protocol stack is the constitution of the agent ecosystem

A reference architecture, defended

The seven components of a working multi-agent system

What this architecture is not

When to use single agent · multi-agent · neither

Bets on the next twelve months

Cross-agent prompt injection becomes a defended problem within twelve months

Agent identity standardizes on A2A + AGNTCY + AP2 with NIST blessing

Constitutional governance for agent economies emerges in regulated domains

Single-writer + asymmetric verifier becomes received wisdom

Self-improving agents plateau because the verifier is the bottleneck

An "agent middleware" layer becomes a distinct product category

A meta-bet: 2027 looks more like infrastructure work than capability work

What to do on Monday

The architectural design space has narrowed.

The Shape of a Working
Multi-Agent System