From generic delegation to context-aware multi-agent coordination — how Freyja decomposes, delegates, persists, and trains.
The default multi-agent pattern is seductively simple: spawn N identical children, collect their results, continue. Every production system starts here. Every one discovers the same five failure modes.
The parent's 1M token context window becomes saturated with raw research output from children. By the time the orchestrator needs to act, its own reasoning is buried under megabytes of subagent artifacts.
Eight subagents produce 268k characters of findings. The truncator sees one giant blob and preserves only the first 2k characters — which means only sub_1's output survives. Seven subagents' work is silently discarded.
All children use the same frontier model, the same tool set, the same system prompt. An exploration task gets the same 40-tool context as a code-writing task. No tool-set specialization, no thinking-effort tuning.
Subagents cannot share intermediate findings. If sub_2 discovers an API endpoint that sub_5 needs, the information must route through the parent — creating an orchestrator bottleneck and adding turns of latency.
The orchestrator's choice to delegate a task as "explore" rather than "code" is a structured signal about task decomposition. But it is never captured in training data. Delegation strategy cannot improve over time.
The parent blocks on wait_all, accumulating results linearly. Eight parallel subagents that each take 30 seconds still require the parent to process all eight results serially, creating a serialization bottleneck.
Generic delegation is a lossy compression scheme disguised as parallelism. You pay for 8 subagents' compute but retain only 1 subagent's output. The fix is not better truncation — it is structural: persist each subagent's artifacts independently, return structured indices instead of raw text, and specialize agents so each one does less but does it better.
Four sources shaped our architecture. Three are Anthropic's own research on multi-agent coordination. The fourth is the Harbor framework for agent training. Together they provide both the coordination theory and the training infrastructure.
The foundational blog post that established context-centric decomposition as the primary design principle. The key insight: you decompose tasks not by domain or by complexity, but by context requirements. Two subtasks that need different tool sets should be different agents, even if they are in the same domain.
Specialization by tool set is the strongest lever. An agent with 5 focused tools outperforms an agent with 40 general tools, even on tasks that could technically be solved with either configuration. Tool-set specialization reduces attention dilution, shrinks the system prompt, and makes the agent's behavior more predictable.
The blog introduces three decomposition heuristics: (1) does the subtask need different tools? → new agent. (2) Does it need a different context window? → new agent. (3) Does it need a different trust boundary? → new agent. We use all three.
Catalogues five coordination patterns with decision criteria for when to use each:
| Pattern | When to use | Freyja mapping |
|---|---|---|
| Generator-Verifier | One agent produces, another validates. High-stakes outputs. | code + verify agent types |
| Orchestrator-Subagent | Dynamic decomposition. Parent decides task split at runtime. | Our primary pattern. Parent delegates via sub_agent tool. |
| Agent Teams | Persistent specialists with ongoing roles. | Phase 2: warm agent pools. |
| Message Bus | Loosely coupled agents reacting to events. | SessionMessageBus (current: append-only log). |
| Shared State | All agents read/write to a common workspace. | Artifact directory: ~/.freyja/sessions/{id}/artifacts/ |
The critical guidance: start with orchestrator-subagent, add other patterns only when needed. Over-engineering coordination is the most common multi-agent failure mode. We followed this advice.
The UX research that established two design constraints for our agent type system:
Progressive disclosure. Don't front-load the full tool catalog into every agent. Start with a minimal set and expand only when the agent demonstrates it needs more. This directly motivates our tool_include / tool_exclude mechanism in the AgentType dataclass — each type gets exactly the tools it needs, no more.
The ~20 tool sweet spot. Anthropic's research found that agent performance degrades noticeably above ~20 tools in the context. Below 20, adding tools helps. Above 20, each additional tool dilutes attention across all tools, reducing the likelihood of the agent choosing the right one. Our explore type has 12 tools. code has 15. verify has 8. None exceeds 20.
Harbor provides the training pipeline that makes our delegation decisions improvable over time. Three components matter:
ATIF v1.6 trajectory format — A structured JSON format for recording agent trajectories. Each trajectory includes actions, observations, and metadata. We extend this with subagent_trajectory_ref.extra to record the agent_type used for each delegation, creating a structured signal for training.
RewardKit — A framework for defining per-task reward functions. We use it to define per-type rewards: explore agents are rewarded for coverage (% of relevant files found), code agents for test pass rate, verify agents for bugs detected per false positive.
22+ agent adapters — Harbor supports adapters for different agent frameworks. Freyja as a Harbor adapter means our subagent trajectories can be directly consumed by Harbor's training pipeline, enabling RL fine-tuning of delegation strategy.
Tool-set specialization > prompt specialization > domain specialization. Giving an agent different tools has a stronger effect than giving it a different system prompt, which in turn has a stronger effect than restricting it to a domain. This ordering determined our AgentType design: tool_include/exclude is the primary lever, system_prompt is secondary, and domain is not encoded at all.
Each agent type is a dataclass that fully describes the model configuration, tool constraints, and behavioral prompt for a specialization. The registry is a simple dictionary. Adding a new type is a single dict entry — the system prompt, tool injection, and training pipeline all auto-update.
Six fields define a complete agent specialization. The design is deliberately minimal — every field earns its place by affecting runtime behavior or training signals.
MODEL
Which model to use and how much extended thinking to allocate. explore-fast uses Haiku with no extended thinking. verify uses Sonnet with high thinking. Cost per token varies 20x across types.
TOOL SET
Whitelist or blacklist of tools. explore gets read_file, search, web_search. code gets write_file, bash, run_tests. The strongest specialization lever we have.
BEHAVIOR
A focused prompt that defines the agent's role and constraints. verify: "Find bugs, do not fix them." explore: "Survey broadly, summarize findings." Max iterations prevent runaway agents.
@dataclass
class AgentType:
model: str # "claude-sonnet-4-20250514"
thinking_effort: str | None # "low" | "medium" | "high" | None
tool_include: list[str] | None # whitelist (None = all tools)
tool_exclude: list[str] | None # blacklist (None = no exclusions)
system_prompt: str # role-specific instructions
max_iterations: int # hard stop on agent turns
AGENT_TYPES: dict[str, AgentType] = {
"general": AgentType("claude-sonnet-4-20250514", "medium", None, None,
"You are a general-purpose assistant.", 25),
"explore": AgentType("claude-sonnet-4-20250514", "low",
["read_file", "search", "grep", "glob", "web_search",
"list_directory", "publish_finding"], None,
EXPLORE_PROMPT, 30),
"explore-fast": AgentType("claude-haiku-3-5", None,
["read_file", "search", "grep", "glob"], None,
EXPLORE_FAST_PROMPT, 15),
"code": AgentType("claude-sonnet-4-20250514", "high",
["read_file", "write_file", "edit_file", "bash",
"run_tests", "search", "grep"], None,
CODE_PROMPT, 40),
"verify": AgentType("claude-sonnet-4-20250514", "high",
["read_file", "search", "grep", "bash", "run_tests",
"publish_finding"], None,
VERIFY_PROMPT, 20),
}
| Type | Model | Thinking | Tools | Philosophy | Max iter |
|---|---|---|---|---|---|
| general | Sonnet | medium | all (unrestricted) | Fallback. When no specialization fits. | 25 |
| explore | Sonnet | low | read, search, grep, glob, web, publish | Survey broadly. Cover ground. Never write files. | 30 |
| explore-fast | Haiku | none | read, search, grep, glob | Cheap, fast reconnaissance. No web, no publish. | 15 |
| code | Sonnet | high | read, write, edit, bash, tests, search, grep | Implement. Test. Iterate. High thinking budget. | 40 |
| verify | Sonnet | high | read, search, grep, bash, tests, publish | Find bugs. Don't fix them. Report findings. | 20 |
We considered having sub_agent_explore, sub_agent_code, etc. as separate tools. But this violates progressive disclosure: five delegation tools in the system prompt creates choice paralysis. A single sub_agent(type="explore") tool with a type parameter is cleaner — the model makes one delegation decision, not two (whether to delegate + how to delegate).
Adding a new agent type requires exactly one change: a new entry in the AGENT_TYPES dictionary. The system prompt generator auto-discovers available types and renders them into the parent's prompt. The training pipeline auto-captures the new type in ATIF exports. The tool's schema auto-updates its enum values. No plumbing, no coordination, no deployment.
The root cause of data loss is structural, not algorithmic: wait_all concatenates results into one blob, and the truncator sees one giant string with no internal boundaries. The fix is equally structural: each subagent writes its own artifact to disk, and wait_all returns a JSON index instead of raw text.
Each subagent writes its complete output to ~/.freyja/sessions/{session_id}/artifacts/{sub_id}.md before returning. The wait_all collator then constructs a JSON index with per-agent entries: a 2-3 sentence summary, the artifact file path, the agent type used, and completion status.
The parent's context now contains ~800 characters of structured index rather than ~268,000 characters of raw text. When the parent needs details from a specific subagent, it calls read_file on the artifact path. This is compaction-resilient: even after aggressive context compaction, the SUMMARY_PROMPT preserves artifact paths in its mandatory "files referenced" section, and the footer extraction logic specifically looks for path patterns.
The JSON index format is deliberately designed to survive context compaction. Artifact paths are absolute (~/.freyja/sessions/abc123/artifacts/sub_3.md) and self-describing. Even if the compactor summarizes the index, it will preserve the paths because they look like actionable file references, not stale data. The compactor's mandatory "files referenced" section ensures paths survive even aggressive summarization.
The pure orchestrator-subagent pattern has a fundamental bottleneck: all information flows through the parent. If sub_2 discovers a critical API endpoint that sub_5 needs, the finding must: (1) travel up from sub_2 to parent, (2) be processed by parent, (3) travel down from parent to sub_5 in a new delegation. This adds turns of latency and consumes parent context window capacity.
The SessionMessageBus solves this with an append-only shared log. Any subagent can publish a finding; any sibling can read all published findings. The parent is no longer the sole information relay.
Two tools are injected into every child subagent: publish_finding and read_findings. Publishing appends to the log; reading accepts an optional since_index parameter for cursor-based pagination and a topic filter for targeted reads.
# publish_finding tool
def publish_finding(
topic: Literal["findings", "errors", "progress"],
content: str,
agent_id: str # auto-injected, not user-provided
) -> dict:
"""Publish a finding to the shared message bus."""
return bus.append(Message(
topic=topic,
content=content,
agent_id=agent_id,
timestamp=time.time(),
index=bus.next_index()
))
# read_findings tool
def read_findings(
topic: str | None = None, # filter by topic, or None for all
since_index: int = 0 # cursor-based pagination
) -> list[Message]:
"""Read findings from siblings. Use since_index to avoid re-reading."""
return bus.read(topic=topic, since_index=since_index)
We considered two designs. asyncio.Queue per subscriber: each subagent gets its own queue, messages are fanned out at publish time. Clean async semantics, but O(N) memory per message and complex lifecycle management when agents die mid-task. Append-only log with cursors: one shared list, each reader tracks its own position. Simpler, O(1) memory per message, naturally handles agent death (no orphaned queues).
We chose the log. The 500-message cap prevents unbounded growth. Cursor-based reads mean agents only process new messages, not the full history. Topic filtering reduces noise without separate channels. The design is essentially a minimal Kafka — append-only, consumer offsets, topic partitioning — but in 80 lines of Python.
The current bus requires agents to actively poll with read_findings. The next evolution is passive notification: inject recent findings into the subagent's system message at each turn start. The agent sees "SIBLING FINDINGS SINCE LAST TURN: [...]" without needing to call a tool. This reduces the turns-to-discovery from 2-3 (agent decides to read, calls read_findings, processes results) to 0 (findings appear in context automatically).
Every time the orchestrator calls sub_agent(type="explore"), it makes a structured decision about task decomposition. That decision — which type for which task — is a training signal. If we capture it in the right format, we can use it to improve delegation strategy over time via reinforcement learning.
The subagent_trajectory_ref.extra field in ATIF v1.6 now includes three pieces of metadata: the agent_type used, the artifact_path where results were persisted, and the model that was used. This creates a structured signal that links task descriptions to delegation strategies to outcomes.
Different agent types have different success criteria. Harbor's RewardKit lets us define per-type reward functions that capture these distinctions:
| Agent type | Primary reward | Measurement | Anti-reward |
|---|---|---|---|
| explore | Coverage | % of relevant files/APIs discovered vs. ground truth | Redundant file reads (same file read 3+ times) |
| explore-fast | Speed-weighted coverage | Coverage / wall-clock-seconds | Exceeding 15-iteration budget |
| code | Test pass rate | % of tests passing after changes | Introducing regressions (tests that passed before, fail after) |
| verify | Bug detection precision | Real bugs found / total issues reported | False positives (reported issues that aren't bugs) |
| general | Task completion | Binary: did the parent's overall task succeed? | Using more iterations than a specialized type would |
Harbor + RewardKit closes the loop from delegation to improvement. ATIF trajectories with agent_type metadata flow into Harbor's training pipeline. RewardKit scores each trajectory with the appropriate per-type reward function. RL fine-tuning updates the orchestrator's delegation policy: which type to use for which task pattern. Over time, the orchestrator learns that "find all REST endpoints" should be explore, "implement the auth middleware" should be code, and "check for SQL injection" should be verify.
The architecture evolves through five phases, each building on the previous. We are currently in Phase 1. Each subsequent phase is motivated by a concrete limitation of the current design.
The parent orchestrator delegates via sub_agent(type=...). Each child runs in an isolated context with type-specific tools and prompts. Artifacts persist to disk. The message bus enables sibling communication. ATIF captures delegation decisions. This is the foundation — functional, shipping, collecting training data.
Currently, every sub_agent call spawns a fresh agent that starts from scratch. Warm pools maintain a set of persistent workers across turns. An explore agent that mapped the codebase in turn 3 retains that understanding in turn 7 when a follow-up exploration is needed. This eliminates the cold-start cost of re-reading files that a previous instance already processed.
Implementation: a WorkerPool keyed by agent type, with LRU eviction when the pool exceeds a configurable size. Each worker retains its context window and artifact directory across invocations. The parent addresses workers by type, not by ID — "give me an explore worker" rather than "resume sub_3."
The current message bus is pull-based: agents must actively call read_findings. Phase 3 makes it push-based: findings are injected into the agent's system message at each turn start. Additionally, the bus gains routing rules: "route all findings with topic 'api_endpoint' to agents of type 'code'" — declarative wiring that eliminates the orchestrator bottleneck for information flow.
The orchestrator can create new agent types at runtime by specifying a tool set and prompt. "I need an agent that only has access to the database tools and knows our schema conventions" — a new type is born for this session, used for delegation, and the type definition is captured in ATIF for potential promotion to a permanent type. The registry becomes a living, evolving catalog.
Freyja becomes one of Harbor's 22+ agent adapters. Freyja's subagent trajectories flow directly into Harbor's training pipeline without conversion. Harbor's cross-agent insights (patterns learned from other adapters' trajectories) flow back into Freyja's delegation policy. The training loop crosses framework boundaries.
Knowing what to delegate is harder than knowing how to delegate. The following decision framework captures the patterns we've found effective and the anti-patterns we've learned to avoid.
"What does the auth module look like?" "Find all REST endpoints in the codebase." "Research how the payment flow works." Use when the task is about understanding, not changing. The agent reads broadly but writes nothing.
"Is there a Dockerfile in this repo?" "What test framework does this project use?" Narrow, factual questions where speed matters more than depth. Uses Haiku for cost-efficiency. No web search, no publishing — just fast local reads.
"Implement the JWT validation middleware." "Add unit tests for the user service." The agent has write access and test-running capability. High thinking effort because implementation requires careful reasoning. The most expensive type per invocation.
"Run the test suite and report failures." "Check the auth flow for security vulnerabilities." The agent reads and runs tests but does not fix what it finds. This separation is deliberate — the generator-verifier pattern requires the verifier to be independent of the generator.
| Dimension | Foreground (sync) | Background (async) |
|---|---|---|
| Blocking | Parent waits for result before continuing | Parent continues working; checks results later |
| Use when | Result is needed for the next decision. Exploration that determines the plan. | Result is supplementary. Parallel work that improves quality but isn't blocking. |
| Context cost | Result enters parent context immediately | Result stays in artifact file until explicitly read |
| Example | explore("What auth mechanism does this repo use?") → determines which code type to spawn | verify("Run full test suite") → parent continues coding; reads results before PR |
| Dimension | Message bus | Parent mediation |
|---|---|---|
| Latency | Immediate. Sibling reads finding on next tool call. | 2+ turns. Up to parent, processed, down to sibling. |
| Use when | Siblings working in parallel on related tasks. Findings from one improve another. | Parent needs to make a judgment call. "Sub_2 found an issue — should I redirect sub_3?" |
| Context cost | Zero parent context. Findings stay in bus. | Full parent context. Findings transit through parent window. |
| Control | Decentralized. Siblings self-coordinate. | Centralized. Parent decides what to relay. |
Delegating a task that takes the parent 3 tool calls to a subagent that takes 10 tool calls (cold start + re-read + actual work). The overhead of spawning exceeds the benefit. Rule of thumb: only delegate if the task would take the parent 5+ turns or requires a different tool set.
Giving a specialized agent too many tools defeats the purpose of specialization. An explore agent with write_file access will occasionally write files, polluting the read-only invariant. Rule: start with the minimum tool set and add tools only when you have evidence the agent needs them.
Three subagents communicating via message bus to solve a task that one agent could handle alone. The bus adds complexity without parallelism benefit. Rule: message bus is for independent parallel work with shared context needs. If agents are sequential (each waiting on the previous), use foreground delegation instead.
Using explore for a task that requires writing files (explore has no write tools), or using code for pure research (wasting the high thinking budget). The type table above is the decision matrix — match the task to the type, don't force a type onto a task.
The parent receives a JSON index from wait_all but then asks for a full summary of all subagent work, effectively re-creating the blob problem. Rule: read specific artifacts on demand, not all artifacts in sequence. The index tells you which artifact is relevant — use that signal.
Delegation is not parallelism. Delegation is context isolation. The primary benefit of subagents is not running things in parallel (though that helps). It is preventing one task's context from polluting another task's context. Eight subagents each working in a focused 50K context outperform one agent working in a noisy 400K context. Specialization reduces noise; isolation prevents pollution; persistence prevents loss.
After shipping the agent type registry, artifact persistence, message bus, and ATIF training signal, we studied how other production systems solved the same problems. Seven sources — two blogs, five papers — revealed patterns we missed and validated choices we made.
Our SubAgentState is a flat 4-value enum: RUNNING → DONE | FAILED | CANCELLED. Both Claude Code and OpenCode use two independent state machines per agent:
ready → busy → ready / error / shutdown_requested → shutdown. Governs whether the agent can accept work, needs recovery, or should be cleaned up.
idle → starting → running → completing → completed → idle. The UI uses execution status for spinners; crash recovery uses member status for decisions.
Why it matters: we can't distinguish "agent initializing" from "agent actively running" from "agent wrapping up." The UI shows a spinner for all three. More critically, crash recovery needs to know whether an agent was genuinely busy or in a stale state. Splitting into two levels is on our immediate roadmap.
We have zero recovery for subagent state. If the Electron app crashes while 4 subagents are running, those agents are simply lost — their state is purely in-memory, their results never persisted unless they already wrote artifacts. OpenCode's recovery sequence is deliberate:
⊕ STEP 1
Permission restoration handler registered before recovery begins. Recovery may trigger cleanup, which may need to restore delegate-mode permissions.
⊕ STEP 2
All agents marked "busy" are force-transitioned to "ready." A system message is injected: [System]: Server restarted. Teammates interrupted.
⊕ STEP 3
Event subscriptions registered after recovery completes. Avoids spurious cleanup from the force-transitions in step 2.
No auto-restart after crash. Interrupted agents are marked ready but idle — the human must re-engage them. This prevents runaway agents burning API credits overnight. OpenCode quote: "You lose convenience, but you don't wake up to find four agents burning API credits all night." We should adopt this principle.
Measured across real production frameworks (Galileo research):
| Framework | Token Duplication Rate | Cost Multiplier |
|---|---|---|
| CAMEL | 86% | 7.1× |
| MetaGPT | 72% | 3.6× |
| AgentVerse | 53% | 2.1× |
| Freyja (estimated) | <15% | ~1.2× |
Systems consume 1.5–7× more tokens than necessary due to redundant context sharing. Our design already addresses this: the message bus shares concise findings (not full context), artifact persistence avoids re-fetching, and agent type specialization keeps tool definitions lean. But we should instrument and measure our actual duplication rate.
Claude Code uses leader-centric routing (teammate → leader → teammate). OpenCode moved to full peer-to-peer mesh and found the lead could focus on orchestration instead of being a message router. Our publish_finding / read_findings bus is already peer-to-peer — any sibling can see any other's findings directly. This validates our approach over Claude Code's original design.
Split SubAgentState into MemberStatus + ExecutionStatus. UI uses execution status for spinners; recovery uses member status for decisions.
Persist subagent records to disk alongside artifacts. On bridge restart: force-transition stale agents, inject system notification, no auto-restart.
Instrument total_tokens_across_all_agents / hypothetical_single_agent_tokens. Surface in session export and ATIF trajectories.
Event-driven restart of idle agents on message bus delivery. The missing piece for persistent agent teams.
Auto-expire old messages for long-running sessions. Namespace isolation per agent role.
Orchestrator voluntarily restricts own tools during coordination-heavy phases, forcing focus on delegation.
The field research validated every major design choice: declarative agent types with tool filtering (vs Claude Code's undifferentiated workers), peer-to-peer message bus (vs leader-centric relay), artifact persistence to files (vs in-memory only), and context-centric decomposition over role-based decomposition. The gaps — dual state machines and crash recovery — are maturity improvements, not architectural pivots.