← SOHAM SHAH
REV v1 · 04.12superseded v2 · 05.31 ►
You’re reading a previous revision. The latest is v2 · 05.31.
2026-04-12  ·  Blas Labs  ·  research

Specialized Subagent Architecture
for Freyja

From generic delegation to context-aware multi-agent coordination — how Freyja decomposes, delegates, persists, and trains.

architecture multi-agent training april 2026
live · orchestrator → subagent coordination

Why generic subagents fail

The default multi-agent pattern is seductively simple: spawn N identical children, collect their results, continue. Every production system starts here. Every one discovers the same five failure modes.

Context pollution

Parent window fills with noise

The parent's 1M token context window becomes saturated with raw research output from children. By the time the orchestrator needs to act, its own reasoning is buried under megabytes of subagent artifacts.

Truncation data loss

268k chars in, 2k chars out

Eight subagents produce 268k characters of findings. The truncator sees one giant blob and preserves only the first 2k characters — which means only sub_1's output survives. Seven subagents' work is silently discarded.

No specialization

Same model, same tools, same prompt

All children use the same frontier model, the same tool set, the same system prompt. An exploration task gets the same 40-tool context as a code-writing task. No tool-set specialization, no thinking-effort tuning.

No communication

Siblings are isolated

Subagents cannot share intermediate findings. If sub_2 discovers an API endpoint that sub_5 needs, the information must route through the parent — creating an orchestrator bottleneck and adding turns of latency.

No learning

Delegation decisions evaporate

The orchestrator's choice to delegate a task as "explore" rather than "code" is a structured signal about task decomposition. But it is never captured in training data. Delegation strategy cannot improve over time.

Coordination overhead

O(N) wait, O(N) context

The parent blocks on wait_all, accumulating results linearly. Eight parallel subagents that each take 30 seconds still require the parent to process all eight results serially, creating a serialization bottleneck.

The truncation crisis: 8 subagents, 1 survivor 8 SUBAGENTS WAIT_ALL TRUNCATOR SURVIVES sub_1 33.5k chars sub_2 33.5k chars sub_3 33.5k chars sub_4 33.5k chars sub_5 33.5k chars sub_6 33.5k chars sub_7 33.5k chars sub_8 33.5k chars ONE BLOB 268,000 chars sub_1 + sub_2 + ... + sub_8 concatenated linearly cut 2,000 chars 0.75% retained sub_1 only 7 agents discarded DATA LOSS: 268k chars produced, 2k chars survive sub_1 at position 0 wins by default. sub_2 through sub_8 are silently lost. the truncator does not know about subagent boundaries — it sees one continuous string
wait_all concatenates all results into a single blob. the truncator preserves only the first 2k characters. positional bias determines which subagent survives.
The fundamental insight

Generic delegation is a lossy compression scheme disguised as parallelism. You pay for 8 subagents' compute but retain only 1 subagent's output. The fix is not better truncation — it is structural: persist each subagent's artifacts independently, return structured indices instead of raw text, and specialize agents so each one does less but does it better.


What informed the design

Four sources shaped our architecture. Three are Anthropic's own research on multi-agent coordination. The fourth is the Harbor framework for agent training. Together they provide both the coordination theory and the training infrastructure.

Anthropic: "Building Multi-Agent Systems"[A1]

The foundational blog post that established context-centric decomposition as the primary design principle. The key insight: you decompose tasks not by domain or by complexity, but by context requirements. Two subtasks that need different tool sets should be different agents, even if they are in the same domain.

Specialization by tool set is the strongest lever. An agent with 5 focused tools outperforms an agent with 40 general tools, even on tasks that could technically be solved with either configuration. Tool-set specialization reduces attention dilution, shrinks the system prompt, and makes the agent's behavior more predictable.

The blog introduces three decomposition heuristics: (1) does the subtask need different tools? → new agent. (2) Does it need a different context window? → new agent. (3) Does it need a different trust boundary? → new agent. We use all three.

Anthropic: "Multi-Agent Coordination Patterns"[A2]

Catalogues five coordination patterns with decision criteria for when to use each:

PatternWhen to useFreyja mapping
Generator-VerifierOne agent produces, another validates. High-stakes outputs.code + verify agent types
Orchestrator-SubagentDynamic decomposition. Parent decides task split at runtime.Our primary pattern. Parent delegates via sub_agent tool.
Agent TeamsPersistent specialists with ongoing roles.Phase 2: warm agent pools.
Message BusLoosely coupled agents reacting to events.SessionMessageBus (current: append-only log).
Shared StateAll agents read/write to a common workspace.Artifact directory: ~/.freyja/sessions/{id}/artifacts/

The critical guidance: start with orchestrator-subagent, add other patterns only when needed. Over-engineering coordination is the most common multi-agent failure mode. We followed this advice.

Anthropic: "Seeing Like an Agent"[A3]

The UX research that established two design constraints for our agent type system:

Progressive disclosure. Don't front-load the full tool catalog into every agent. Start with a minimal set and expand only when the agent demonstrates it needs more. This directly motivates our tool_include / tool_exclude mechanism in the AgentType dataclass — each type gets exactly the tools it needs, no more.

The ~20 tool sweet spot. Anthropic's research found that agent performance degrades noticeably above ~20 tools in the context. Below 20, adding tools helps. Above 20, each additional tool dilutes attention across all tools, reducing the likelihood of the agent choosing the right one. Our explore type has 12 tools. code has 15. verify has 8. None exceeds 20.

Harbor Framework: Agent Training Infrastructure[H1]

Harbor provides the training pipeline that makes our delegation decisions improvable over time. Three components matter:

ATIF v1.6 trajectory format — A structured JSON format for recording agent trajectories. Each trajectory includes actions, observations, and metadata. We extend this with subagent_trajectory_ref.extra to record the agent_type used for each delegation, creating a structured signal for training.

RewardKit — A framework for defining per-task reward functions. We use it to define per-type rewards: explore agents are rewarded for coverage (% of relevant files found), code agents for test pass rate, verify agents for bugs detected per false positive.

22+ agent adapters — Harbor supports adapters for different agent frameworks. Freyja as a Harbor adapter means our subagent trajectories can be directly consumed by Harbor's training pipeline, enabling RL fine-tuning of delegation strategy.

Key research finding

Tool-set specialization > prompt specialization > domain specialization. Giving an agent different tools has a stronger effect than giving it a different system prompt, which in turn has a stronger effect than restricting it to a domain. This ordering determined our AgentType design: tool_include/exclude is the primary lever, system_prompt is secondary, and domain is not encoded at all.


Declarative specialization

Each agent type is a dataclass that fully describes the model configuration, tool constraints, and behavioral prompt for a specialization. The registry is a simple dictionary. Adding a new type is a single dict entry — the system prompt, tool injection, and training pipeline all auto-update.

The AgentType dataclass

Six fields define a complete agent specialization. The design is deliberately minimal — every field earns its place by affecting runtime behavior or training signals.

MODEL

model + thinking_effort

Which model to use and how much extended thinking to allocate. explore-fast uses Haiku with no extended thinking. verify uses Sonnet with high thinking. Cost per token varies 20x across types.

TOOL SET

tool_include / tool_exclude

Whitelist or blacklist of tools. explore gets read_file, search, web_search. code gets write_file, bash, run_tests. The strongest specialization lever we have.

BEHAVIOR

system_prompt + max_iterations

A focused prompt that defines the agent's role and constraints. verify: "Find bugs, do not fix them." explore: "Survey broadly, summarize findings." Max iterations prevent runaway agents.

@dataclass
class AgentType:
    model: str                          # "claude-sonnet-4-20250514"
    thinking_effort: str | None         # "low" | "medium" | "high" | None
    tool_include: list[str] | None      # whitelist (None = all tools)
    tool_exclude: list[str] | None      # blacklist (None = no exclusions)
    system_prompt: str                  # role-specific instructions
    max_iterations: int                 # hard stop on agent turns

AGENT_TYPES: dict[str, AgentType] = {
    "general":      AgentType("claude-sonnet-4-20250514", "medium", None, None,
                              "You are a general-purpose assistant.", 25),
    "explore":      AgentType("claude-sonnet-4-20250514", "low",
                              ["read_file", "search", "grep", "glob", "web_search",
                               "list_directory", "publish_finding"], None,
                              EXPLORE_PROMPT, 30),
    "explore-fast": AgentType("claude-haiku-3-5", None,
                              ["read_file", "search", "grep", "glob"], None,
                              EXPLORE_FAST_PROMPT, 15),
    "code":         AgentType("claude-sonnet-4-20250514", "high",
                              ["read_file", "write_file", "edit_file", "bash",
                               "run_tests", "search", "grep"], None,
                              CODE_PROMPT, 40),
    "verify":       AgentType("claude-sonnet-4-20250514", "high",
                              ["read_file", "search", "grep", "bash", "run_tests",
                               "publish_finding"], None,
                              VERIFY_PROMPT, 20),
}
TypeModelThinkingToolsPhilosophyMax iter
general Sonnet medium all (unrestricted) Fallback. When no specialization fits. 25
explore Sonnet low read, search, grep, glob, web, publish Survey broadly. Cover ground. Never write files. 30
explore-fast Haiku none read, search, grep, glob Cheap, fast reconnaissance. No web, no publish. 15
code Sonnet high read, write, edit, bash, tests, search, grep Implement. Test. Iterate. High thinking budget. 40
verify Sonnet high read, search, grep, bash, tests, publish Find bugs. Don't fix them. Report findings. 20
1

Why one tool with a type parameter, not five separate tools

We considered having sub_agent_explore, sub_agent_code, etc. as separate tools. But this violates progressive disclosure: five delegation tools in the system prompt creates choice paralysis. A single sub_agent(type="explore") tool with a type parameter is cleaner — the model makes one delegation decision, not two (whether to delegate + how to delegate).

2

Zero-touch extensibility

Adding a new agent type requires exactly one change: a new entry in the AGENT_TYPES dictionary. The system prompt generator auto-discovers available types and renders them into the parent's prompt. The training pipeline auto-captures the new type in ATIF exports. The tool's schema auto-updates its enum values. No plumbing, no coordination, no deployment.


Solving the truncation crisis

The root cause of data loss is structural, not algorithmic: wait_all concatenates results into one blob, and the truncator sees one giant string with no internal boundaries. The fix is equally structural: each subagent writes its own artifact to disk, and wait_all returns a JSON index instead of raw text.

Old flow vs. new flow OLD FLOW: BLOB + TRUNCATE + LOSE 8 subagents 33.5k each 268k total concat ONE BLOB 268k chars truncate 2k chars sub_1 only 7 subagents silently discarded no recovery possible — data never persisted NEW FLOW: PERSIST + INDEX + READ ON DEMAND 8 subagents each writes .md on completion proactive persist write ARTIFACTS ON DISK sub_1.md 33.5k sub_2.md 33.5k ... sub_8.md 33.5k ~/.freyja/sessions/{id}/artifacts/ index JSON INDEX (~800 chars) {sub_1: {summary: "...", path: "artifacts/sub_1.md"}, sub_2: {summary: "...", path: "artifacts/sub_2.md"}, ...} return parent reads index read_file(sub_3.md) on demand, per-agent old: 268k chars in context, 2k survives compaction. new: 800 chars in context, 268k recoverable on disk.
proactive persistence inverts the truncation problem: instead of losing 99.25% of output, the parent holds a lightweight index and reads individual artifacts on demand.

Each subagent writes its complete output to ~/.freyja/sessions/{session_id}/artifacts/{sub_id}.md before returning. The wait_all collator then constructs a JSON index with per-agent entries: a 2-3 sentence summary, the artifact file path, the agent type used, and completion status.

The parent's context now contains ~800 characters of structured index rather than ~268,000 characters of raw text. When the parent needs details from a specific subagent, it calls read_file on the artifact path. This is compaction-resilient: even after aggressive context compaction, the SUMMARY_PROMPT preserves artifact paths in its mandatory "files referenced" section, and the footer extraction logic specifically looks for path patterns.

Compaction resilience

The JSON index format is deliberately designed to survive context compaction. Artifact paths are absolute (~/.freyja/sessions/abc123/artifacts/sub_3.md) and self-describing. Even if the compactor summarizes the index, it will preserve the paths because they look like actionable file references, not stale data. The compactor's mandatory "files referenced" section ensures paths survive even aggressive summarization.


From isolation to collaboration

The pure orchestrator-subagent pattern has a fundamental bottleneck: all information flows through the parent. If sub_2 discovers a critical API endpoint that sub_5 needs, the finding must: (1) travel up from sub_2 to parent, (2) be processed by parent, (3) travel down from parent to sub_5 in a new delegation. This adds turns of latency and consumes parent context window capacity.

The SessionMessageBus solves this with an append-only shared log. Any subagent can publish a finding; any sibling can read all published findings. The parent is no longer the sole information relay.

live · SessionMessageBus architecture

Two tools are injected into every child subagent: publish_finding and read_findings. Publishing appends to the log; reading accepts an optional since_index parameter for cursor-based pagination and a topic filter for targeted reads.

# publish_finding tool
def publish_finding(
    topic: Literal["findings", "errors", "progress"],
    content: str,
    agent_id: str    # auto-injected, not user-provided
) -> dict:
    """Publish a finding to the shared message bus."""
    return bus.append(Message(
        topic=topic,
        content=content,
        agent_id=agent_id,
        timestamp=time.time(),
        index=bus.next_index()
    ))

# read_findings tool
def read_findings(
    topic: str | None = None,    # filter by topic, or None for all
    since_index: int = 0         # cursor-based pagination
) -> list[Message]:
    """Read findings from siblings. Use since_index to avoid re-reading."""
    return bus.read(topic=topic, since_index=since_index)

Architecture decision: append-only log vs. per-subscriber queues

We considered two designs. asyncio.Queue per subscriber: each subagent gets its own queue, messages are fanned out at publish time. Clean async semantics, but O(N) memory per message and complex lifecycle management when agents die mid-task. Append-only log with cursors: one shared list, each reader tracks its own position. Simpler, O(1) memory per message, naturally handles agent death (no orphaned queues).

We chose the log. The 500-message cap prevents unbounded growth. Cursor-based reads mean agents only process new messages, not the full history. Topic filtering reduces noise without separate channels. The design is essentially a minimal Kafka — append-only, consumer offsets, topic partitioning — but in 80 lines of Python.

Future: system message injection

The current bus requires agents to actively poll with read_findings. The next evolution is passive notification: inject recent findings into the subagent's system message at each turn start. The agent sees "SIBLING FINDINGS SINCE LAST TURN: [...]" without needing to call a tool. This reduces the turns-to-discovery from 2-3 (agent decides to read, calls read_findings, processes results) to 0 (findings appear in context automatically).


Capturing delegation decisions in ATIF

Every time the orchestrator calls sub_agent(type="explore"), it makes a structured decision about task decomposition. That decision — which type for which task — is a training signal. If we capture it in the right format, we can use it to improve delegation strategy over time via reinforcement learning.

Training data flow: delegation to ATIF sub_agent( type="explore" ) session_spawned event emitted agent_type: explore SessionSnapshot captures full trajectory v3 export JSON format ATIF-ready ATIF v1.6 trajectory SUBAGENT_TRAJECTORY_REF.EXTRA agent_type: "explore" artifact_path: "sub_1.md" model: "sonnet"
delegation decisions are captured as structured metadata in ATIF trajectories, enabling per-type reward optimization.

The subagent_trajectory_ref.extra field in ATIF v1.6 now includes three pieces of metadata: the agent_type used, the artifact_path where results were persisted, and the model that was used. This creates a structured signal that links task descriptions to delegation strategies to outcomes.

Per-type reward functions

Different agent types have different success criteria. Harbor's RewardKit lets us define per-type reward functions that capture these distinctions:

Agent typePrimary rewardMeasurementAnti-reward
explore Coverage % of relevant files/APIs discovered vs. ground truth Redundant file reads (same file read 3+ times)
explore-fast Speed-weighted coverage Coverage / wall-clock-seconds Exceeding 15-iteration budget
code Test pass rate % of tests passing after changes Introducing regressions (tests that passed before, fail after)
verify Bug detection precision Real bugs found / total issues reported False positives (reported issues that aren't bugs)
general Task completion Binary: did the parent's overall task succeed? Using more iterations than a specialized type would
The training loop

Harbor + RewardKit closes the loop from delegation to improvement. ATIF trajectories with agent_type metadata flow into Harbor's training pipeline. RewardKit scores each trajectory with the appropriate per-type reward function. RL fine-tuning updates the orchestrator's delegation policy: which type to use for which task pattern. Over time, the orchestrator learns that "find all REST endpoints" should be explore, "implement the auth middleware" should be code, and "check for SQL injection" should be verify.


Where we are heading

The architecture evolves through five phases, each building on the previous. We are currently in Phase 1. Each subsequent phase is motivated by a concrete limitation of the current design.

Phase 1 — current

Orchestrator-subagent with specialized types

The parent orchestrator delegates via sub_agent(type=...). Each child runs in an isolated context with type-specific tools and prompts. Artifacts persist to disk. The message bus enables sibling communication. ATIF captures delegation decisions. This is the foundation — functional, shipping, collecting training data.

Phase 2 — next

Warm agent pools

Currently, every sub_agent call spawns a fresh agent that starts from scratch. Warm pools maintain a set of persistent workers across turns. An explore agent that mapped the codebase in turn 3 retains that understanding in turn 7 when a follow-up exploration is needed. This eliminates the cold-start cost of re-reading files that a previous instance already processed.

Implementation: a WorkerPool keyed by agent type, with LRU eviction when the pool exceeds a configurable size. Each worker retains its context window and artifact directory across invocations. The parent addresses workers by type, not by ID — "give me an explore worker" rather than "resume sub_3."

Phase 3 — future

Event-driven message bus

The current message bus is pull-based: agents must actively call read_findings. Phase 3 makes it push-based: findings are injected into the agent's system message at each turn start. Additionally, the bus gains routing rules: "route all findings with topic 'api_endpoint' to agents of type 'code'" — declarative wiring that eliminates the orchestrator bottleneck for information flow.

Phase 4 — vision

Dynamic type creation

The orchestrator can create new agent types at runtime by specifying a tool set and prompt. "I need an agent that only has access to the database tools and knows our schema conventions" — a new type is born for this session, used for delegation, and the type definition is captured in ATIF for potential promotion to a permanent type. The registry becomes a living, evolving catalog.

Phase 5 — long-term

Harbor integration: Freyja as a Harbor agent adapter

Freyja becomes one of Harbor's 22+ agent adapters. Freyja's subagent trajectories flow directly into Harbor's training pipeline without conversion. Harbor's cross-agent insights (patterns learned from other adapters' trajectories) flow back into Freyja's delegation policy. The training loop crosses framework boundaries.

Evolution: five phases PHASE 1 Specialized types Artifact persist DONE PHASE 2 Warm pools Persistent workers NEXT PHASE 3 Push-based bus Event routing PHASE 4 Dynamic types Runtime creation PHASE 5 Harbor adapter Cross-framework each phase builds on the previous — no rewrites, only extensions
the architecture is designed for incremental evolution. phase 1 is complete. each subsequent phase adds capability without modifying existing code.

Decision framework

Knowing what to delegate is harder than knowing how to delegate. The following decision framework captures the patterns we've found effective and the anti-patterns we've learned to avoid.

When to use each agent type

explore

Survey, map, understand

"What does the auth module look like?" "Find all REST endpoints in the codebase." "Research how the payment flow works." Use when the task is about understanding, not changing. The agent reads broadly but writes nothing.

explore-fast

Quick reconnaissance

"Is there a Dockerfile in this repo?" "What test framework does this project use?" Narrow, factual questions where speed matters more than depth. Uses Haiku for cost-efficiency. No web search, no publishing — just fast local reads.

code

Implement, test, iterate

"Implement the JWT validation middleware." "Add unit tests for the user service." The agent has write access and test-running capability. High thinking effort because implementation requires careful reasoning. The most expensive type per invocation.

verify

Check, validate, report

"Run the test suite and report failures." "Check the auth flow for security vulnerabilities." The agent reads and runs tests but does not fix what it finds. This separation is deliberate — the generator-verifier pattern requires the verifier to be independent of the generator.

Foreground vs. background

DimensionForeground (sync)Background (async)
Blocking Parent waits for result before continuing Parent continues working; checks results later
Use when Result is needed for the next decision. Exploration that determines the plan. Result is supplementary. Parallel work that improves quality but isn't blocking.
Context cost Result enters parent context immediately Result stays in artifact file until explicitly read
Example explore("What auth mechanism does this repo use?") → determines which code type to spawn verify("Run full test suite") → parent continues coding; reads results before PR

Message bus vs. parent mediation

DimensionMessage busParent mediation
Latency Immediate. Sibling reads finding on next tool call. 2+ turns. Up to parent, processed, down to sibling.
Use when Siblings working in parallel on related tasks. Findings from one improve another. Parent needs to make a judgment call. "Sub_2 found an issue — should I redirect sub_3?"
Context cost Zero parent context. Findings stay in bus. Full parent context. Findings transit through parent window.
Control Decentralized. Siblings self-coordinate. Centralized. Parent decides what to relay.

Anti-patterns

1

Over-delegation

Delegating a task that takes the parent 3 tool calls to a subagent that takes 10 tool calls (cold start + re-read + actual work). The overhead of spawning exceeds the benefit. Rule of thumb: only delegate if the task would take the parent 5+ turns or requires a different tool set.

2

Tool-set bloat

Giving a specialized agent too many tools defeats the purpose of specialization. An explore agent with write_file access will occasionally write files, polluting the read-only invariant. Rule: start with the minimum tool set and add tools only when you have evidence the agent needs them.

3

Coordination overhead exceeding benefit

Three subagents communicating via message bus to solve a task that one agent could handle alone. The bus adds complexity without parallelism benefit. Rule: message bus is for independent parallel work with shared context needs. If agents are sequential (each waiting on the previous), use foreground delegation instead.

4

Type mismatch

Using explore for a task that requires writing files (explore has no write tools), or using code for pure research (wasting the high thinking budget). The type table above is the decision matrix — match the task to the type, don't force a type onto a task.

5

Ignoring artifact indices

The parent receives a JSON index from wait_all but then asks for a full summary of all subagent work, effectively re-creating the blob problem. Rule: read specific artifacts on demand, not all artifacts in sequence. The index tells you which artifact is relevant — use that signal.

The meta-principle

Delegation is not parallelism. Delegation is context isolation. The primary benefit of subagents is not running things in parallel (though that helps). It is preventing one task's context from polluting another task's context. Eight subagents each working in a focused 50K context outperform one agent working in a noisy 400K context. Specialization reduces noise; isolation prevents pollution; persistence prevents loss.


Production lessons from Claude Code, OpenCode, and the research frontier

After shipping the agent type registry, artifact persistence, message bus, and ATIF training signal, we studied how other production systems solved the same problems. Seven sources — two blogs, five papers — revealed patterns we missed and validated choices we made.

Dual state machines: the maturity gap

Our SubAgentState is a flat 4-value enum: RUNNING → DONE | FAILED | CANCELLED. Both Claude Code and OpenCode use two independent state machines per agent:

Member Status (coarse)

5 states for lifecycle

ready → busy → ready / error / shutdown_requested → shutdown. Governs whether the agent can accept work, needs recovery, or should be cleaned up.

Execution Status (fine-grained)

6+ states for prompt loop

idle → starting → running → completing → completed → idle. The UI uses execution status for spinners; crash recovery uses member status for decisions.

Why it matters: we can't distinguish "agent initializing" from "agent actively running" from "agent wrapping up." The UI shows a spinner for all three. More critically, crash recovery needs to know whether an agent was genuinely busy or in a stale state. Splitting into two levels is on our immediate roadmap.

live · dual state machine transitions

Crash recovery: the gap that bites you at 2am

We have zero recovery for subagent state. If the Electron app crashes while 4 subagents are running, those agents are simply lost — their state is purely in-memory, their results never persisted unless they already wrote artifacts. OpenCode's recovery sequence is deliberate:

⊕ STEP 1

Register handlers

Permission restoration handler registered before recovery begins. Recovery may trigger cleanup, which may need to restore delegate-mode permissions.

⊕ STEP 2

Force-transition stale agents

All agents marked "busy" are force-transitioned to "ready." A system message is injected: [System]: Server restarted. Teammates interrupted.

⊕ STEP 3

Subscribe to cleanup after

Event subscriptions registered after recovery completes. Avoids spurious cleanup from the force-transitions in step 2.

Critical design decision

No auto-restart after crash. Interrupted agents are marked ready but idle — the human must re-engage them. This prevents runaway agents burning API credits overnight. OpenCode quote: "You lose convenience, but you don't wake up to find four agents burning API credits all night." We should adopt this principle.

live · bootstrap recovery sequence

Token duplication: the hidden cost

Measured across real production frameworks (Galileo research):

FrameworkToken Duplication RateCost Multiplier
CAMEL86%7.1×
MetaGPT72%3.6×
AgentVerse53%2.1×
Freyja (estimated)<15%~1.2×

Systems consume 1.5–7× more tokens than necessary due to redundant context sharing. Our design already addresses this: the message bus shares concise findings (not full context), artifact persistence avoids re-fetching, and agent type specialization keeps tool definitions lean. But we should instrument and measure our actual duplication rate.

Peer-to-peer validates our bus

Claude Code uses leader-centric routing (teammate → leader → teammate). OpenCode moved to full peer-to-peer mesh and found the lead could focus on orchestration instead of being a message router. Our publish_finding / read_findings bus is already peer-to-peer — any sibling can see any other's findings directly. This validates our approach over Claude Code's original design.

What goes on the roadmap

Immediate

Dual state machines

Split SubAgentState into MemberStatus + ExecutionStatus. UI uses execution status for spinners; recovery uses member status for decisions.

Immediate

Crash recovery bootstrap

Persist subagent records to disk alongside artifacts. On bridge restart: force-transition stale agents, inject system notification, no auto-restart.

Near-term

Token duplication metrics

Instrument total_tokens_across_all_agents / hypothetical_single_agent_tokens. Surface in session export and ATIF trajectories.

Near-term

Auto-wake for warm pools

Event-driven restart of idle agents on message bus delivery. The missing piece for persistent agent teams.

Future

Message bus TTL + namespacing

Auto-expire old messages for long-running sessions. Namespace isolation per agent role.

Future

Permission delegation

Orchestrator voluntarily restricts own tools during coordination-heavy phases, forcing focus on delegation.

What we got right

The field research validated every major design choice: declarative agent types with tool filtering (vs Claude Code's undifferentiated workers), peer-to-peer message bus (vs leader-centric relay), artifact persistence to files (vs in-memory only), and context-centric decomposition over role-based decomposition. The gaps — dual state machines and crash recovery — are maturity improvements, not architectural pivots.