How Freyja decomposes work, delegates it to context-scoped subagents, persists results as artifacts, coordinates the agents that run in parallel, and trains its own delegation policy on the trajectories they produce.
The default multi-agent pattern is the smallest thing that runs: spawn N children, return their results to the parent, continue. The trouble is the return step. Whatever a child produces has to come back through the parent's context window, and that window is a fixed, shared budget. As you add children or ask for longer outputs, their combined results grow without bound while the channel they return through does not. That mismatch is not a bug in any one implementation. It is a property of returning work as text through a finite shared channel, and it forces a set of choices that the simple pattern has no good answer for.
The parent's 1M token context window becomes saturated with raw research output from children. By the time the orchestrator needs to act, its own reasoning is buried under megabytes of subagent artifacts.
When several children finish, their combined output can exceed what the parent can hold. Now you have to choose: drop the budget on raw results and starve the orchestrator's own reasoning, compress each result and lose detail the parent might have needed, or cut and lose whole results outright. Every option at the text-return layer trades away something real, because the work was always larger than the channel.
All children use the same frontier model, the same tool set, the same system prompt. An exploration task gets the same 40-tool context as a code-writing task. No tool-set specialization, no thinking-effort tuning.
Subagents cannot share intermediate findings. If sub_2 discovers an API endpoint that sub_5 needs, the information must route through the parent, creating an orchestrator bottleneck and adding turns of latency.
The orchestrator's choice to delegate a task as "explore" rather than "code" is a structured signal about task decomposition. But it is never captured in training data. Delegation strategy cannot improve over time.
The parent blocks on wait_all, accumulating results linearly. Parallel subagents run concurrently, but the parent still has to read and integrate every result in sequence, so the wall-clock saving from parallel execution erodes against a serial integration step that grows with the number of children.
Generic delegation is a lossy compression scheme disguised as parallelism. You pay for every subagent's compute but retain only the first one's output. The fix is not better truncation. It is structural: persist each subagent's artifacts independently, return structured indices instead of raw text, and specialize agents so each one does less but does it better.
Each agent type is a dataclass that fully describes the model policy, tool constraints, and behavioral prompt for a specialization. The registry is a dictionary keyed by type name. Adding a new type is a single dict entry, and the system prompt, tool injection, and (eventually) the training pipeline all read from the same record.
The original version of this post described a six-field dataclass and a five-entry registry. Both grew. The dataclass now carries twelve fields, and the registry holds fifteen agent types. The growth is concentrated in two places: model selection moved from a single hard-coded string to a policy with fallbacks, and a family of read-only review/judge types appeared to support the coordination work in §5.
Twelve fields define a complete agent specialization. The design stays minimal in spirit, every field earning its place by affecting runtime behavior or the training signal, but model selection alone now needs three of them: a policy, a fallback chain, and an inheritance flag.
MODEL
No longer a single string. model_policy is a resolution strategy (inherit:parent, first_available:<chain>, or random_available:<chain>) paired with model_fallbacks. The resolver walks the chain at spawn time, so a type degrades gracefully when its preferred model is unavailable instead of hard-failing.
TOOL SET
Whitelist or blacklist of tools. explore gets read_file, search, web_search. code gets write_file, bash, run_tests. The strongest specialization lever we have.
BEHAVIOR
A focused prompt that defines the agent's role and constraints. verify: "Find bugs, do not fix them." explore: "Survey broadly, summarize findings." Max iterations prevent runaway agents.
| Type | Model policy | Thinking | Tool surface | Role | Max iter |
|---|---|---|---|---|---|
| general | inherit:parent | auto | inherited safe tools | Fallback when no specialization fits. | 100 |
| explore | first_available (Sonnet→GPT→Kimi→DeepSeek) | medium | read/grep/glob/web + publish | Deep research. Never writes source. | 160 |
| explore-fast | random_available (Kimi/MiniMax/GLM)→Haiku | off | read/list/web | Cheap parallel reconnaissance. | 60 |
| code | inherit:parent | high | read/write/edit/bash/grep | Isolated code changes. | 120 |
| verify | first_available (GPT→Sonnet→DeepSeek→GLM) | high | read/grep/bash (read-only) | QA from a different model. Finds, doesn't fix. | 100 |
| plan | inherit:parent | medium | read/grep/skills | Read-only implementation planning. | 80 |
| review | first_available (GPT→Sonnet→DeepSeek→GLM) | high | read/grep (read-only) | Pre-merge bug and regression hunt. | 100 |
| test | inherit:parent | medium | read/grep/bash | Run and diagnose build/test validation. | 100 |
| browser-qa | inherit:parent | medium | browser_execute_js / screenshot | Exercises a running UI for real behavior. | 100 |
| performance | inherit:parent | high | browser + read/grep | Profiles hot paths, proposes low-risk wins. | 140 |
| docs | inherit:parent | medium | read/edit/grep | Writes design docs and guides. | 100 |
| memory-curator | inherit:parent | medium | read/grep/skills/memory | Prunes stale memory, flags gaps. | 80 |
| specifier | inherit:parent | low | read/grep/kanban | Expands triage cards into ready specs. | 30 |
| judge-calibrator | prefer_parent→Opus/GPT/Sonnet | high | none (pure reasoning) | One-shot judge config for a goal loop. | 1 |
| judge-deep | prefer_parent→Sonnet/GPT/DeepSeek | high | read-only verification | Skeptical goal adjudication, JSON verdict. | 3 |
Values current as of 2026-05-31. auto thinking defers to the resolved model's default; inherit:parent reuses the parent's model rather than naming one, so these rows track whatever the parent is running.
We considered having sub_agent_explore, sub_agent_code, etc. as separate tools. But this violates progressive disclosure: five delegation tools in the system prompt creates choice paralysis. A single sub_agent(type="explore") tool with a type parameter is cleaner, because the model makes one delegation decision, not two (whether to delegate, then how to delegate).
Adding a new agent type requires one change: a new entry in the AGENT_TYPES dictionary. The system-prompt generator discovers available types and renders them into the parent's prompt, and the tool's agent_type enum reads its values from the same registry, so a new type is selectable the moment it is added. The growth from five to fifteen types over seven weeks is the evidence this holds: every type that appears in §5's coordination work was added as a dict entry, not a code path. The trajectory exporter in §6 reads the same registry, so a new type is captured in exports for free the moment it lands.
The root cause of data loss is structural, not algorithmic: wait_all concatenates results into one blob, and the truncator sees one giant string with no internal boundaries. The fix is equally structural: each subagent writes its own artifact to disk, and the parent reads back a manifest of paths instead of a wall of concatenated text.
Each subagent writes its complete output under the session's project directory, ~/.freyja/projects/{session_id}/, before returning. A manifest records the resolved absolute path, the creating session, the operation, size, and timestamp for every file produced; the parent reads it back through the artifacts tool rather than guessing at filesystem paths. The wait_all collator returns the full final output of every completed child, so the parent gets both the summary and the durable artifact.
The parent's context now holds a compact structured index rather than the full concatenated text of every subagent. When the parent needs details from a specific subagent, it calls read_file on the artifact path. This is compaction-resilient: even after aggressive context compaction, the SUMMARY_PROMPT preserves artifact paths in its mandatory "files referenced" section, and the footer extraction logic specifically looks for path patterns. Compaction can drop a reference; it can never drop the work, because the work is on disk.
The manifest is what makes artifacts survive context compaction. Paths are absolute (~/.freyja/projects/abc123/sub_3.md) and live outside the transcript, so a summarization pass cannot lose them: the file is still on disk and the manifest still points at it. When the compactor runs, the work itself is unaffected; at most the parent re-reads the manifest through the artifacts tool to recover a path it had paged out. This is the difference between compaction losing a reference and compaction losing work: only the reference is ever at risk.
The pure orchestrator-subagent pattern has a fundamental bottleneck: all information flows through the parent. If sub_2 discovers a critical API endpoint that sub_5 needs, the finding must: (1) travel up from sub_2 to parent, (2) be processed by parent, (3) travel down from parent to sub_5 in a new delegation. This adds turns of latency and consumes parent context window capacity.
The SessionMessageBus solves this with an append-only shared log. Any subagent can publish a finding; any sibling can read all published findings. The parent is no longer the sole information relay.
Two tools are injected into every child subagent: publish_finding and read_findings. Publishing appends to the log; reading accepts an optional since_index parameter for cursor-based pagination and a topic filter for targeted reads.
# publish_finding tool
def publish_finding(
topic: Literal["findings", "errors", "progress"],
content: str,
agent_id: str # auto-injected, not user-provided
) -> dict:
"""Publish a finding to the shared message bus."""
return bus.append(Message(
topic=topic,
content=content,
agent_id=agent_id,
timestamp=time.time(),
index=bus.next_index()
))
# read_findings tool
def read_findings(
topic: str | None = None, # filter by topic, or None for all
since_index: int = 0 # cursor-based pagination
) -> list[Message]:
"""Read findings from siblings. Use since_index to avoid re-reading."""
return bus.read(topic=topic, since_index=since_index)
We considered two designs. asyncio.Queue per subscriber: each subagent gets its own queue, messages are fanned out at publish time. Clean async semantics, but O(N) memory per message and complex lifecycle management when agents die mid-task. Append-only log with cursors: one shared list, each reader tracks its own position. Simpler, O(1) memory per message, naturally handles agent death (no orphaned queues).
We chose the log. The 500-message cap prevents unbounded growth. Cursor-based reads mean agents only process new messages, not the full history. Topic filtering reduces noise without separate channels. The design is essentially a minimal Kafka (append-only, consumer offsets, topic partitioning) in 80 lines of Python.
The message bus is one coordination mode among four. A session runs under a coordination strategy that decides how the parent dispatches work and how children report back:
| Strategy | Mechanism | When it fits |
|---|---|---|
| BUS | Parent delegates to profile workers; children publish_finding so siblings build on each other; parent synthesizes and resolves conflicts. | Overlapping research or review where discoveries help siblings. |
| kanban | Work is cards on a board (triage → ready → in-progress → done). A specifier promotes triage cards to ready; autopilot can auto-dispatch them. | Many loosely-related units of work with a backlog shape. |
| goal | A judge-evaluated loop. A judge-calibrator arms the rules, then judge-deep adjudicates each iteration with a strict-JSON verdict until the goal is met or stalls. | Qualitative objectives where "done" needs an independent verdict, not just produced output. |
| isolated | Children run in their own context with the full task ledger; the parent doesn't babysit. The worker drives its task to completion. | Self-contained tasks that don't benefit from cross-talk. |
The strategies are not exclusive in spirit, since a goal loop can spawn BUS-coordinated researchers and a kanban card can be executed by an isolated worker, but a session has one active strategy that sets the defaults for how its children are wired.
The bus is broadcast; talk() is addressed. A parent, sibling, or child can send a message to any session it can name, whether by id, by label, or by alias (parent, operator). Two flags change the semantics:
talk(
to: str | list[str], # id, label, alias, or a list (multi-cast)
content: str,
force: bool = False, # interrupt the recipient mid-operation:
# cancels its current stream/tool call,
# gives it one compliance turn to react
wait_for_reply: bool = False, # block the sender's turn until the
# recipient replies (tagged to this msg)
reply_timeout_s: int = 60,
)
The behavior that matters most for coordination is re-wake: a message sent to a non-running subagent revives it. The recipient picks up where it left off, with the new message prepended. This is what lets the "agent teams" pattern from §2 work without keeping idle agents resident. A specialist terminates when its task is done and resumes when a sibling needs it again, paying context cost only while it runs. force=true is the stop/redirect path: it interrupts a child mid-tool-call, which is how a parent kills a runaway exploration without waiting for it to finish.
The goal loop introduced a third kind of agent that is neither orchestrator nor worker: a verifier whose only job is to decide whether work is done. judge-deep is skeptical by default, read-only (it may run grep/cat but never writes, installs, or mutates git), and returns a strict-JSON verdict ({done, confidence, reason, criteria, open_questions}), capped at three iterations so it issues a verdict instead of drifting into doing the work itself. judge-calibrator runs once at goal-arming time to propose the rules the judge will apply. Separating the verifier from the producer is the generator-verifier pattern from §2, made structural: the model that decides "done" is a different model from the one that did the work, which is the cheapest available defense against an agent grading its own homework.
Passive notification means injecting recent findings into a subagent's context instead of making it poll read_findings. The runtime does this at each turn boundary: <system-reminder> blocks, context-pressure advisories, and sibling findings published since the agent's last cursor position are spliced into the agent's stream as side-channel cues. The plumbing (the bus, cursors, per-agent read positions) carries the findings, and the injection point sits at the start of every agent turn, so a discovery a sibling published two seconds ago is in front of this agent before it picks its next action.
docs/TRAJECTORY-TRAINING.md, a v3 JSON schema) feeds a reward layer that scores delegations per agent type, and a GRPO-style loop fine-tunes the routing policy on those scores. The pipeline has a name in the codebase, ATIF, with the reward functions in RewardKit and the rollout orchestration in Harbor. The mechanism below is grounded in the spring-2026 agent-RL literature, and the seams to the export format are called out where they matter.
Every time the orchestrator calls sub_agent(type="explore"), it makes a structured decision about task decomposition: which specialization for which slice of work. That decision is a training signal. A delegation that led to a verified, used result is a positive example; one that produced a discarded artifact or a redundant re-exploration is a negative one. Because those decisions are captured in a learnable format, the routing policy (which type to spawn, with what budget, against what slice of the task) is improved with reinforcement learning instead of hand-tuned in a prompt.
The thing being trained is worth naming precisely, because it is easy to conflate two different policies. Selection is which agent type to spawn for a sub-task. Decomposition is how to cut the task into sub-tasks in the first place. The capture format below records both, but they have different reward structure: selection has a relatively clean counterfactual (would a different type have done better on the same slice?), while decomposition does not (a different cut produces different slices, so there is no like-for-like comparison). The loop trains selection first for that reason, and treats decomposition as a slower second-order signal layered on once the routing policy has stabilized.
sub_agent call to an ATIF trajectory. Each delegation emits a session event, the snapshot captures the full trajectory, and the v3 export carries it into the reward layer that scores it.The artifact at the center of this is the v3 export format in docs/TRAJECTORY-TRAINING.md: a per-session JSON schema that records the message sequence, tool calls, and a subagent_trajectory_ref for each delegation. That ref's extra field carries three pieces of metadata that make a delegation learnable: the agent_type chosen, the artifact_path the child wrote, and the resolved model. That triple is what links a task description to a delegation decision to an outcome, and it is what the reward layer reads when it scores the delegation after the fact.
The clearest signal in the spring-2026 agent-RL work is that verifiable rewards (outcomes you can check programmatically) train better policies than learned preference scores, because they don't drift and can't be gamed by producing confident-sounding output.[R4] Freyja's agent types do not all have verifiable rewards, and that difference is the most important thing about this table. The reward layer sorts its functions by how checkable they are and weights the checkable ones most heavily, so the routing policy learns fastest on the types whose outcomes are scripts.
| Agent type | Reward class | Primary reward | Measurement | Anti-reward |
|---|---|---|---|---|
| code | verifiable | Test pass rate | % of tests passing after changes (programmatic) | Regressions: tests green before, red after |
| test / browser-qa | verifiable | Reproduction fidelity | Does the reported pass/fail match an independent re-run | Flaky or unreproducible verdicts |
| verify / review | semi-verifiable | Bug-detection precision | Reported issues that reproduce / total reported | False positives that cost reviewer time |
| judge-deep | semi-verifiable | Verdict calibration | Agreement with held-out human/consensus verdicts | Confident verdicts later overturned |
| explore / explore-fast | soft | Downstream usefulness | Was retrieved context actually used by a sibling/parent (citation in a later artifact) | Redundant re-reads; coverage that nothing consumed |
| plan / general | soft | Task completion | Did the parent's overall task succeed | Burning more iterations than a specialized type would |
The asymmetry is the design constraint. code and test can be trained with verifiable rewards almost immediately, because the reward is a script. The explore family is the hard case: "coverage" is easy to measure and easy to game (read every file, score high, help nobody), so the design rewards used context instead. A retrieval counts only if a downstream agent cited the artifact it produced. That ties an explorer's reward to a sibling's behavior, which is a credit-assignment problem across the delegation tree, not a single agent's episode.
A single-agent RL setup assigns credit along a chain: state, action, reward, repeat. A delegation tree is not a chain. The parent spawns children, children may spawn grandchildren, several siblings may contribute to one synthesized result, and some produce artifacts that are never used. Assigning a scalar reward to the whole episode and backpropagating it equally rewards the explorer whose work was discarded exactly as much as the one whose work was cited. That is the central unsolved problem in training a delegation policy, and it is where the recent process-reward work is most relevant.[R5]
The design's answer is to score the tree as a DAG, not an episode. Each node (a subagent run) gets a local reward from its per-type function in the table above. Each edge (a delegation decision) gets credit proportional to how much its child's output propagated upward: concretely, whether the child's artifact was cited in the parent's synthesis, and whether the parent's task ultimately succeeded. An explorer whose artifact no parent cited gets near-zero edge credit even if its local "coverage" reward was high. This is the structural fix for the gaming failure mode. Local reward measures whether the work was good, edge credit measures whether the delegation was worth making, and the policy being trained is the edge policy.
Selection has a clean counterfactual that decomposition lacks: for a given task slice, you can ask whether a different agent type would have done better. The design exploits this with the group-relative approach from the GRPO line of work. For a sampled task slice, spawn the same slice against several candidate types (or the same type at several budgets), and optimize the routing policy relative to the group's outcomes rather than against an absolute value estimate.[R6] This sidesteps the hardest part of agent RL, which is estimating the absolute value of a delegation; ranking delegations within a batch is far more robust than scoring one in isolation. The cost is sample efficiency, since you pay for the counterfactual spawns, so the design runs counterfactual rollouts offline on logged trajectories where possible, replaying a recorded task slice against an alternate type rather than re-executing live.
Not everything should be learned. The design draws the line at routing: the policy that maps a task slice to (agent type, budget) is worth training because it has a measurable outcome and a large decision space. The per-type system prompts, the tool whitelists, and the coordination strategy stay hand-authored, because they are low-dimensional, interpretable, and cheap to edit; training them would trade away the auditability that makes the registry a single dict entry. The endpoint is modest and specific: a routing policy that has learned, from logged delegation outcomes, that "enumerate the REST endpoints" is an explore job at a low budget, "implement the auth middleware" is a code job at a high one, and "decide whether this goal is actually met" is a judge-deep job, with the budgets learned from data instead of guessed.
The reward functions, the DAG credit assignment, the counterfactual rollouts, and the RL fine-tuning all sit downstream of one thing: the trajectory capture has to be lossless and right. If the trajectories don't record agent type, artifact lineage, and downstream citation from the start, none of the training above is recoverable later, because the signal was never written. That is why the v3 schema and the extra-field enrichment came first and are the most conservative part of the system: everything else can be retrained, reweighted, or swapped, but a delegation outcome that wasn't logged is gone. The rest of the loop evolves against a fixed, complete record.
The architecture evolves through five phases, each building on the previous. We are currently in Phase 1. Each subsequent phase is motivated by a concrete limitation of the current design.
The parent orchestrator delegates via sub_agent(type=...). Each child runs in an isolated context with type-specific tools and prompts, and artifacts persist to disk through the manifest. This foundation shipped, and then development overtook the roadmap below on a different axis. The growth from five to fifteen agent types, the four coordination strategies, the talk() channel, re-wakeable subagents, and the judge family in §5 were never on this timeline; they emerged from coordination needs rather than from the plan. The trajectory capture format and the training loop it feeds (§6) both came in on this phase. The phases below are the axis the roadmap planned along; the coordination work happened orthogonally to all of them.
Currently, every sub_agent call spawns a fresh agent that starts from scratch. Warm pools maintain a set of persistent workers across turns. An explore agent that mapped the codebase in turn 3 retains that understanding in turn 7 when a follow-up exploration is needed. This eliminates the cold-start cost of re-reading files that a previous instance already processed.
Implementation: a WorkerPool keyed by agent type, with LRU eviction when the pool exceeds a configurable size. Each worker retains its context window and artifact directory across invocations. The parent addresses workers by type, not by ID, asking for "an explore worker" rather than "resume sub_3."
The bus already pushes: the runtime injects <system-reminder> cues, context-pressure advisories, and sibling findings published since an agent's last cursor position into the agent's stream at each turn start. What remains is the declarative routing layer on top of it, topic subscriptions that say "route findings tagged api_endpoint to code agents" rather than delivering every sibling finding to every agent. The bus, cursors, per-agent read positions, and the turn-start injection point all exist; the topic-routing policy is the open piece.
The orchestrator can create new agent types at runtime by specifying a tool set and prompt. A request like "I need an agent that only has the database tools and knows our schema conventions" mints a new type for this session, uses it for delegation, and captures the type definition for potential promotion to a permanent entry in the registry. The registry becomes a living, evolving catalog.
Freyja becomes one of Harbor's 22+ agent adapters. Freyja's subagent trajectories flow directly into Harbor's training pipeline without conversion. Harbor's cross-agent insights (patterns learned from other adapters' trajectories) flow back into Freyja's delegation policy. The training loop crosses framework boundaries.
Every phase in the roadmap above keeps one assumption fixed: a profile is a stateless configuration. The registry entry for explore names a model policy, a tool surface, a thinking budget, and a prompt, and the runtime instantiates that configuration fresh for every task and discards it when the task returns. Two explore children spawned a minute apart share their definition and nothing else. Whatever the first one learned about the codebase, the second one re-derives from zero.
That is a strange thing to throw away. A profile is the one structure in the system that recurs: the same explore configuration runs hundreds of times against the same project, the same verify configuration checks the same classes of task over and over. The work is different each time, but the kind of work is stable, and stable kinds of work accumulate reusable structure. The next axis of evolution is not new profiles or new coordination strategies. It is giving each profile a durable, type-level store that every instance of that profile reads from and writes back to, so the configuration stops being the only thing the instances share.
This is a different mechanism from the warm pools in Phase 2. A warm pool keeps one worker's context alive across turns inside a single session, so a long-lived reviewer does not pay re-instantiation cost mid-conversation. A stateful profile is cross-session and type-level: the store outlives any individual worker and any individual session, and every future instance of the profile inherits it. Warm pools amortize startup within a session; stateful profiles amortize learning across the profile's entire history.
A profile becomes a pair: the configuration that defines how an instance behaves, and an accumulating store that defines what every instance already knows. The registry stops being a table of stateless configs and becomes a set of long-lived, specialized systems that happen to be invoked one task at a time.
The substrate for this already exists in the architecture. The artifact and manifest system from §4 is a content-addressed store that outlives the session that wrote it: every subagent already persists its full output to disk and registers it in a manifest the parent can read on demand. A stateful profile is what you get when that store is indexed by profile and queried before a task runs, not only read after one finishes. The two profiles the rest of this section develops, explore and verify, are the two where the payoff is largest and the prior art is clearest.
An explore instance spends most of its budget re-discovering things the profile has already seen. It searches the web, reads files, greps the codebase, and assembles a picture, and almost all of that picture overlaps with what some earlier explore instance assembled for an adjacent question. The wasted motion is not the reasoning, which is genuinely per-task. It is the retrieval: the same documents fetched, the same files read, the same searches issued, because the second instance has no way to ask whether the first one already found the answer.
A stateful explore profile closes that gap with a persistent store of prior retrievals, indexed so a new instance can ask “have I already found this?” before it issues a single search. The organizing question is how to index it. The most directly applicable design is ByteRover's Context Tree (2026), which organizes accumulated findings as a hierarchy of Domain → Topic → Subtopic → Entry, stores each entry as a markdown note with provenance back to its source, and retrieves through a progressive five-tier walk that returns relevant context in well under a second. That maps almost exactly onto the grouping a research profile wants: by topic, by goal, and by session. A new explore task resolves to a path in the tree, pulls the entries already hanging off that path, and only searches outward for what is missing.
The idea that an agent should consult its own history before acting is older than the recent memory systems, and the lineage matters because it tells you which parts are load-bearing. MemGPT (2023) introduced virtual context management, paging information between a small working context and an external store the way an operating system pages memory, which is the mechanism that lets a store outgrow any single context window. Generative Agents (2023) added a memory stream with a reflection step that periodically distills raw observations into higher-level summaries, scored by recency, importance, and relevance at retrieval time. Reflection is the part a research profile needs most: without it the store is a transcript that grows without bound, and with it each session leaves behind a compact digest rather than a log. ExpeL (2023) made the reuse explicit, keeping an experience pool keyed by task and retrieving relevant prior experience before attempting a new one, and CoALA (2023) gave the vocabulary that keeps these pieces distinct, separating episodic memory (what happened in past sessions) from semantic memory (facts about the domain) from procedural memory (how to do the task), which are three different stores with three different update rules even though a naive design collapses them into one.
The more recent work sharpens the retrieval side. A-MEM (2025) links memory notes into a Zettelkasten-style graph so related findings connect across sessions instead of sitting in separate buckets, which is what lets a topic accumulate rather than fragment. Mem0 (2025) runs an extract-and-consolidate loop over conversational history into a combined vector and graph store and reports large token savings at retrieval time, because the agent pulls a few consolidated facts instead of replaying a transcript. General Agentic Memory (2025) is the cleanest statement of the explore use case: it separates a cheap always-on page-store that records everything from a just-in-time researcher that answers, at query time, whether the needed information has already been found, which is precisely the “have I already explored this?” gate. And the 2026 memory-OS line, MemoryOS, EverMemOS with its thematic MemScene clusters, and MemForest with a hierarchical index keyed jointly by session and topic, converges on the same hierarchical, theme-clustered organization that the Context Tree describes, which is some evidence that this is the shape the problem wants rather than one team's idiosyncrasy.
The verify profile has the same redundancy in a different shape. Today a verify instance receives a task, works out what correct would mean, and synthesizes the checks from scratch: it writes the assertions, reasons out a rubric, decides what to run. The next verify instance handed a structurally identical task does all of that again. The reasoning about this output is per-task and worth redoing. The machinery of how to check this class of task, the rubric, the executable checker, the replay harness, is not, and rebuilding it every time is both wasteful and a source of inconsistency, because two instances asked to verify the same kind of work can invent two different standards for it.
A stateful verify profile keeps a growing library of verification artifacts indexed by task type, and retrieves a pre-built verifier instead of synthesizing one. The closest prior art is the tool-induction line. LATM (2023) splits the work into a tool maker that writes a reusable tool once and caches it by problem description, and a cheaper tool user that retrieves and applies it, which is the exact division a verify profile wants between building a checker and running it. CRAFT (2023) organizes induced tools into toolsets by domain and retrieves them by embedding similarity, which is the “indexed by task type” requirement directly. TroVE (2024) is the one that takes the lifecycle seriously: it grows a toolbox and then trims it, so the library does not bloat into a pile of near-duplicate verifiers that make retrieval worse, and any honest version of this profile needs that trim step as much as it needs the grow step.
The admission rule is what separates a verifier library from a junk drawer. Voyager (2023) is the template: it builds an ever-growing skill library in which a skill is only admitted after it executes successfully against the environment, so the library is a set of verified capabilities rather than plausible-looking code. A verify profile should gate the same way, admitting a checker only once it has actually run and discriminated a known-good output from a known-bad one. LEGO-Prover (2023) shows the same admission discipline in formal proof, growing a library of lemmas each gated by a Lean type-check, and the discipline transfers: a checker earns its place in the library by passing an executable gate, not by looking reasonable.
The 2026 work targets the verification case specifically. RewardHarness (2026) maintains an evolving library of evaluation tools and skills per domain, with an orchestrator that refines them over time, which is almost a direct description of this profile. Prompt-Level Reward Specifications (2026) is the sharpest match to what a verifier should store: per task it produces an offline, task-adaptive rubric paired with an executable hard-constraint checker, caches the pair, and reuses it, which is exactly the rubric-plus-checker artifact a verify profile would retrieve by task type. AutoHarness (2026) synthesizes a code harness per environment type, the replay-harness analogue. DeepVerifier (2026) builds a failure taxonomy and indexes rubrics by failure type, which is a more discriminating key than task type alone: the profile can retrieve checks aimed at the specific ways this kind of task tends to break. VPR (2026) and AgentV-RL (2026) round out the picture from the reward-modeling side, with oracle verifiers organized by reasoning category and an agentic tool-using verifier respectively, both consistent with a library keyed by problem type.
Both explore and verify do work that is heavily redundant across instances and cheap to verify when reused: a retrieved document is either still the right source or it is not, and a retrieved checker either passes its executable gate or it does not. That makes them the two profiles where an accumulating store pays off soonest and fails most visibly, which is exactly where you want to start.
Once two profiles carry their own stores, the registry itself changes character. Every entry becomes a pair of a configuration and an accumulating store, and the stores differ by profile in a way that follows the kind of experience each one generates. A useful frame here is the experience-compression spectrum (2026), which separates episodic experience (raw past sessions, compressible perhaps five to twenty times), procedural experience (distilled reusable skills, fifty to five hundred times), and declarative experience (general rules, a thousand times and up). An explore profile lives mostly in the episodic and semantic registers: what it found, about what. A verify profile lives in the procedural register: how to check a kind of thing. A code profile would accumulate a library of project-specific fixes and patterns; a plan profile would accumulate prior decompositions of similar goals. The store is not one mechanism bolted onto every profile, but a per-profile choice of what kind of experience is worth keeping and at what compression.
The trajectory loop in §6 is what makes this more than a cache. The training exports already capture which profile ran each step, with what tools, to what outcome, tagged by agent_type. That signal is exactly what a profile's store needs to learn from: it tells the verify profile which retrieved checkers actually caught real failures and which waved bad output through, and it tells the explore profile which cached findings were reused versus re-searched anyway. The store and the training loop reinforce each other, where the store gives the profile something to learn over, and the loop tells the store what was worth keeping.
This is the most speculative direction in this writeup, and the prior art includes its own strongest counterargument. The most important caution is that library learning is harder than it looks. A 2025 study of LLM library learning found that in a LEGO-Prover-style system there was little evidence of genuine reuse, and the apparent gains largely vanished once compute was held constant, which means a profile store can look like it is helping while really just buying more attempts. The discipline that follows is non-negotiable: measure the reuse rate directly, control for compute when claiming a benefit, and treat “the store grew” as worthless until “the store was reused and the reuse helped” is demonstrated.
Three more failure modes are specific to this design. Capability erosion: work on self-evolving agents (2026) documents a phenomenon where specializing on a narrow distribution degrades general capability, so a profile that over-fits its store to the tasks it has seen can get worse at the tasks it has not. Staleness: an explore store accumulates findings that go out of date as URLs rot and code changes underneath them, so cached retrievals need provenance and an invalidation policy, which is exactly why the Context Tree carries provenance and a maturity decay on its entries. Cold start: a store beats re-derivation only after enough history has accumulated, so the early sessions pay the full cost of building the store and get none of the benefit, and a profile that is rarely invoked may never reach the crossover. None of these is fatal, but each one is a place where a careless version of this idea quietly makes the system slower and worse while appearing to learn, which is the failure the compute-controlled study warns about, in three new disguises.
Knowing what to delegate is harder than knowing how to delegate. The following decision framework captures the patterns we've found effective and the anti-patterns we've learned to avoid.
"What does the auth module look like?" "Find all REST endpoints in the codebase." "Research how the payment flow works." Use when the task is about understanding, not changing. The agent reads broadly but writes nothing.
"Is there a Dockerfile in this repo?" "What test framework does this project use?" Narrow, factual questions where speed matters more than depth. Runs on a rotation of fast models (Kimi / MiniMax / GLM, falling back to Haiku) with thinking off. Spawn three to five in background for breadth. The other twelve types (plan, review, test, browser-qa, performance, docs, memory-curator, specifier, and the two judges) follow the same shape: a narrow tool surface and a model policy matched to the job.
"Implement the JWT validation middleware." "Add unit tests for the user service." The agent has write access and test-running capability. High thinking effort because implementation requires careful reasoning. The most expensive type per invocation.
"Run the test suite and report failures." "Check the auth flow for security vulnerabilities." The agent reads and runs tests but does not fix what it finds. The separation is deliberate: the generator-verifier pattern requires the verifier to be independent of the generator.
| Dimension | Foreground (sync) | Background (async) |
|---|---|---|
| Blocking | Parent waits for result before continuing | Parent continues working; checks results later |
| Use when | Result is needed for the next decision. Exploration that determines the plan. | Result is supplementary. Parallel work that improves quality but isn't blocking. |
| Context cost | Result enters parent context immediately | Result stays in artifact file until explicitly read |
| Example | explore("What auth mechanism does this repo use?") → determines which code type to spawn | verify("Run full test suite") → parent continues coding; reads results before PR |
| Dimension | Message bus | Parent mediation |
|---|---|---|
| Latency | Immediate. Sibling reads finding on next tool call. | 2+ turns. Up to parent, processed, down to sibling. |
| Use when | Siblings working in parallel on related tasks. Findings from one improve another. | Parent needs to make a judgment call. "Sub_2 found an issue — should I redirect sub_3?" |
| Context cost | Zero parent context. Findings stay in bus. | Full parent context. Findings transit through parent window. |
| Control | Decentralized. Siblings self-coordinate. | Centralized. Parent decides what to relay. |
Delegating a task that takes the parent 3 tool calls to a subagent that takes 10 tool calls (cold start + re-read + actual work). The overhead of spawning exceeds the benefit. Rule of thumb: only delegate if the task would take the parent 5+ turns or requires a different tool set.
Giving a specialized agent too many tools defeats the purpose of specialization. An explore agent with write_file access will occasionally write files, polluting the read-only invariant. Rule: start with the minimum tool set and add tools only when you have evidence the agent needs them.
Three subagents communicating via message bus to solve a task that one agent could handle alone. The bus adds complexity without parallelism benefit. Rule: message bus is for independent parallel work with shared context needs. If agents are sequential (each waiting on the previous), use foreground delegation instead.
Using explore for a task that requires writing files (explore has no write tools), or using code for pure research (wasting the high thinking budget). The type table above is the decision matrix: match the task to the type, don't force a type onto a task.
The parent gets a manifest from wait_all but then asks for a full summary of all subagent work, effectively re-creating the blob problem. Rule: read specific artifacts on demand through the artifacts tool, not all artifacts in sequence. The manifest tells you which artifact is relevant: use that signal.
Delegation is not parallelism. Delegation is context isolation. The primary benefit of subagents is not running things in parallel, though that helps. It is preventing one task's context from polluting another's. A subagent working in a focused context reasons better than one agent working in a context cluttered with every prior task's tool output, because attention is finite and every irrelevant token competes for it. Specialization reduces noise; isolation prevents pollution; persistence prevents loss.
Other production systems have solved the same coordination and state problems, and reading how they did it surfaces both patterns Freyja is missing and choices it got right. Two engineering writeups (Claude Code, OpenCode) and the spring-2026 research frontier are the comparison set. The training signal (§6) is not in this list, because it isn't shipped: the lessons here are about coordination and state, not training.
Our SubAgentState is a flat 4-value enum: RUNNING → DONE | FAILED | CANCELLED. Both Claude Code and OpenCode use two independent state machines per agent:
ready → busy → ready / error / shutdown_requested → shutdown. Governs whether the agent can accept work, needs recovery, or should be cleaned up.
idle → starting → running → completing → completed → idle. The UI uses execution status for spinners; crash recovery uses member status for decisions.
Why it matters: we can't distinguish "agent initializing" from "agent actively running" from "agent wrapping up." The UI shows a spinner for all three. More critically, crash recovery needs to know whether an agent was genuinely busy or in a stale state. Splitting into two levels is on our immediate roadmap.
Freyja has no recovery for subagent state. If the Electron app crashes while subagents are running, those agents are lost: their state is in-memory, and their results are gone unless they had already written artifacts. The mitigation that exists is the artifact manifest from §4. Anything a child persisted before the crash survives, because it is on disk and recorded in the manifest. What does not survive is in-flight work. OpenCode's recovery sequence is the reference design for closing this gap:
⊕ STEP 1
Permission restoration handler registered before recovery begins. Recovery may trigger cleanup, which may need to restore delegate-mode permissions.
⊕ STEP 2
All agents marked "busy" are force-transitioned to "ready." A system message is injected: [System]: Server restarted. Teammates interrupted.
⊕ STEP 3
Event subscriptions registered after recovery completes. Avoids spurious cleanup from the force-transitions in step 2.
Interrupted agents come back idle, not running. After a restart they are marked ready but do not resume on their own; a human re-engages them. The reason is cost containment: an auto-restart on crash can put several agents back into long-running tool loops unattended, and that failure mode (spawned agents consuming API budget with no one watching) is worse than the inconvenience of restarting them by hand. Freyja's re-wake mechanism (§5) is the right primitive to build a supervised resume on later: a parent could talk() each interrupted child back to life deliberately, rather than the runtime reviving all of them automatically.
Multi-agent systems pay a tax that single-agent systems do not: the same context gets re-sent to multiple agents. Published measurements across multi-agent frameworks put the redundancy high, on the order of half to most of the tokens consumed, depending on how aggressively context is shared.[R7]
| Framework | Reported token duplication | Cost multiplier |
|---|---|---|
| CAMEL | 86% | 7.1× |
| MetaGPT | 72% | 3.6× |
| AgentVerse | 53% | 2.1× |
External figures, cited as reported, not measured by this project.
Freyja's design keeps this low for structural reasons: the message bus shares concise findings rather than full context, artifact persistence avoids re-fetching, and per-type tool whitelists keep tool definitions lean. The duplication rate the runtime tracks sits well under the debate-style numbers above, because the bus carries summaries rather than replaying transcripts. This post puts no single figure on it because it moves with workload shape (a four-way parallel research fan-out duplicates more than a linear refactor), and a single headline number would imply a precision the measurement does not have. The structural argument is what generalizes; the instrumented rate confirms the direction rather than pinning a constant.
Claude Code uses leader-centric routing (teammate to leader to teammate). OpenCode moved to a full peer-to-peer mesh and reported that the lead could then focus on orchestration instead of relaying messages. Freyja's publish_finding / read_findings bus is already peer-to-peer, so any sibling can read any other's findings directly without the parent in the path. The honest comparison: this converges on the same answer OpenCode reached, which is reassuring but not novel. The bus solves the routing-bottleneck problem and nothing more. It does not address token duplication or crash recovery, which are the two places (below) where Freyja is still behind.
Split SubAgentState into MemberStatus + ExecutionStatus. UI uses execution status for spinners; recovery uses member status for decisions.
Persist subagent records to disk alongside artifacts. On bridge restart: force-transition stale agents, inject system notification, no auto-restart.
Instrument total_tokens_across_all_agents / hypothetical_single_agent_tokens. Surface in session export and in the trajectory records that §6's training loop would consume.
Event-driven restart of idle agents on message bus delivery. The missing piece for persistent agent teams.
Auto-expire old messages for long-running sessions. Namespace isolation per agent role.
Orchestrator voluntarily restricts own tools during coordination-heavy phases, forcing focus on delegation.
The field research lines up with Freyja's design: declarative agent types with tool filtering rather than undifferentiated workers, a peer-to-peer bus rather than a leader-centric relay, artifact persistence to files rather than in-memory only, and context-centric decomposition over role-based decomposition. Two of these (peer-to-peer routing, file persistence) are places other systems arrived at the same answer independently, so the agreement is convergence, not a lead. The open gaps are dual state machines and crash recovery, which are maturity work rather than architectural pivots. The training loop in §6 is the design choice the comparison set says least about, because none of these systems trains on its own delegation traces; Freyja's routing policy learns from a signal the rest of the field is still throwing away.