2026-06-05 · Blas Labs · research · rev 3

Specialized Subagent Architecture
for Freyja

How Freyja decomposes work, delegates it to context-scoped subagents, persists results as artifacts, coordinates the agents that run in parallel, and trains its own delegation policy on the trajectories they produce. This revision follows what shipped since 05.31: a skill-learning loop, an always-on scheduler, and a grounded-memory spine.

architecture multi-agent coordination training self-improvement

live · orchestrator → subagent coordination

problem research registry artifacts communication training evolution stateful profiles skill learning always-on multimodal patterns field research references

1 the problem

Why generic subagents fail

The default multi-agent pattern is the smallest thing that runs: spawn N children, return their results to the parent, continue. The trouble is the return step. Whatever a child produces has to come back through the parent's context window, and that window is a fixed, shared budget. As you add children or ask for longer outputs, their combined results grow without bound while the channel they return through does not. That mismatch is not a bug in any one implementation. It is a property of returning work as text through a finite shared channel, and it forces a set of choices that the simple pattern has no good answer for.

Context pollution

Parent window fills with noise

The parent's 1M token context window becomes saturated with raw research output from children. By the time the orchestrator needs to act, its own reasoning is buried under megabytes of subagent artifacts.

Lossy return

The return channel is too small

When several children finish, their combined output can exceed what the parent can hold. Now you have to choose: drop the budget on raw results and starve the orchestrator's own reasoning, compress each result and lose detail the parent might have needed, or cut and lose whole results outright. Every option at the text-return layer trades away something real, because the work was always larger than the channel.

No specialization

Same model, same tools, same prompt

All children use the same frontier model, the same tool set, the same system prompt. An exploration task gets the same 40-tool context as a code-writing task. No tool-set specialization, no thinking-effort tuning.

No communication

Siblings are isolated

Subagents cannot share intermediate findings. If sub_2 discovers an API endpoint that sub_5 needs, the information must route through the parent, creating an orchestrator bottleneck and adding turns of latency.

No learning

Delegation decisions evaporate

The orchestrator's choice to delegate a task as "explore" rather than "code" is a structured signal about task decomposition. But it is never captured in training data. Delegation strategy cannot improve over time.

Coordination overhead

O(N) wait, O(N) context

The parent blocks on wait_all, accumulating results linearly. Parallel subagents run concurrently, but the parent still has to read and integrate every result in sequence, so the wall-clock saving from parallel execution erodes against a serial integration step that grows with the number of children.

Work grows; the return channel does not

the parent's context is a fixed budget. as parallelism grows, the combined output of the children exceeds it, and returning that work as text forces a lossy choice no matter how it is made.

The fundamental insight

Generic delegation is a lossy compression scheme disguised as parallelism. You pay for every subagent's compute but retain only the first one's output. The fix is not better truncation. It is structural: persist each subagent's artifacts independently, return structured indices instead of raw text, and specialize agents so each one does less but does it better.

3 the agent type registry

Declarative specialization

Each agent type is a dataclass that fully describes the model policy, tool constraints, and behavioral prompt for a specialization. The registry is a dictionary keyed by type name. Adding a new type is a single dict entry, and the system prompt, tool injection, and (eventually) the training pipeline all read from the same record.

The original version of this post described a six-field dataclass and a five-entry registry. Both grew. The dataclass still carries twelve fields, and the registry now holds sixteen agent types. The growth from five to fifteen concentrated in two places: model selection moved from a single hard-coded string to a policy with fallbacks, and a family of read-only review/judge types appeared to support the coordination work in §5. The sixteenth type, skill-drafter, landed in this revision and is different in kind from the rest: it is the first agent whose job is to write back into the system that spawned it, turning a finished session into a reusable skill. §9 follows it end to end.

The shape of an agent type

Twelve fields define a complete agent specialization. The design stays minimal in spirit, every field earning its place by affecting runtime behavior or the training signal, but model selection alone now needs three of them: a policy, a fallback chain, and an inheritance flag.

MODEL

model + thinking_effort

No longer a single string. model_policy is a resolution strategy (inherit:parent, first_available:<chain>, or random_available:<chain>) paired with model_fallbacks. The resolver walks the chain at spawn time, so a type degrades gracefully when its preferred model is unavailable instead of hard-failing.

TOOL SET

tool_include / tool_exclude

Whitelist or blacklist of tools. explore gets read_file, search, web_search. code gets write_file, bash, run_tests. The strongest specialization lever we have.

BEHAVIOR

system_prompt + max_iterations

A focused prompt that defines the agent's role and constraints. verify: "Find bugs, do not fix them." explore: "Survey broadly, summarize findings." Max iterations prevent runaway agents.

Type	Model policy	Thinking	Tool surface	Role	Max iter
general	inherit:parent	auto	inherited safe tools	Fallback when no specialization fits.	100
explore	first_available (Sonnet→GPT→Kimi→DeepSeek)	medium	read/grep/glob/web + publish	Deep research. Never writes source.	160
explore-fast	random_available (Kimi/MiniMax/GLM)→Haiku	off	read/list/web	Cheap parallel reconnaissance.	60
code	inherit:parent	high	read/write/edit/bash/grep	Isolated code changes.	120
verify	first_available (GPT→Sonnet→DeepSeek→GLM)	high	read/grep/bash (read-only)	QA from a different model. Finds, doesn't fix.	100
plan	inherit:parent	medium	read/grep/skills	Read-only implementation planning.	80
review	first_available (GPT→Sonnet→DeepSeek→GLM)	high	read/grep (read-only)	Pre-merge bug and regression hunt.	100
test	inherit:parent	medium	read/grep/bash	Run and diagnose build/test validation.	100
browser-qa	inherit:parent	medium	browser_execute_js / screenshot	Exercises a running UI for real behavior.	100
performance	inherit:parent	high	browser + read/grep	Profiles hot paths, proposes low-risk wins.	140
docs	inherit:parent	medium	read/edit/grep	Writes design docs and guides.	100
memory-curator	inherit:parent	medium	read/grep/skills/memory	Prunes stale memory, flags gaps.	80
specifier	inherit:parent	low	read/grep/kanban	Expands triage cards into ready specs.	30
judge-calibrator	prefer_parent→Opus/GPT/Sonnet	high	none (pure reasoning)	One-shot judge config for a goal loop.	1
judge-deep	prefer_parent→Sonnet/GPT/DeepSeek	high	read-only verification	Skeptical goal adjudication, JSON verdict.	3
skill-drafter	first_available (Opus 4.8→4.7→Sonnet)	high	read/grep/skills + publish	Drafts a reusable skill from the session. Proposes, never writes.	15

Values current as of 2026-06-05. The newest row, skill-drafter, is the subject of §9; it is pinned to a quality tier rather than inheriting the parent's model, because a drafter triggered from a Slack DM would otherwise run on whatever cheap model answered the DM. auto thinking defers to the resolved model's default; inherit:parent reuses the parent's model rather than naming one, so these rows track whatever the parent is running.

Why one tool with a type parameter, not five separate tools

We considered having sub_agent_explore, sub_agent_code, etc. as separate tools. But this violates progressive disclosure: five delegation tools in the system prompt creates choice paralysis. A single sub_agent(type="explore") tool with a type parameter is cleaner, because the model makes one delegation decision, not two (whether to delegate, then how to delegate).

Zero-touch extensibility

Adding a new agent type requires one change: a new entry in the AGENT_TYPES dictionary. The system-prompt generator discovers available types and renders them into the parent's prompt, and the tool's agent_type enum reads its values from the same registry, so a new type is selectable the moment it is added. The growth from five to sixteen types is the evidence this holds: every type that appears in §5's coordination work was added as a dict entry, not a code path, and skill-drafter (§9) was the latest to land the same way. The trajectory exporter in §6 reads the same registry, so a new type is captured in exports for free the moment it lands.

4 artifact persistence

Solving the truncation crisis

The root cause of data loss is structural, not algorithmic: wait_all concatenates results into one blob, and the truncator sees one giant string with no internal boundaries. The fix is equally structural: each subagent writes its own artifact to disk, and the parent reads back a manifest of paths instead of a wall of concatenated text.

Old flow vs. new flow

proactive persistence inverts the truncation problem: instead of losing most of the output at the cut, the parent holds a lightweight manifest and reads individual artifacts on demand.

Each subagent writes its complete output under the session's project directory, ~/.freyja/projects/{session_id}/, before returning. A manifest records the resolved absolute path, the creating session, the operation, size, and timestamp for every file produced; the parent reads it back through the artifacts tool rather than guessing at filesystem paths. The wait_all collator returns the full final output of every completed child, so the parent gets both the summary and the durable artifact.

The parent's context now holds a compact structured index rather than the full concatenated text of every subagent. When the parent needs details from a specific subagent, it calls read_file on the artifact path. This is compaction-resilient: even after aggressive context compaction, the SUMMARY_PROMPT preserves artifact paths in its mandatory "files referenced" section, and the footer extraction logic specifically looks for path patterns. Compaction can drop a reference; it can never drop the work, because the work is on disk.

Compaction resilience

The manifest is what makes artifacts survive context compaction. Paths are absolute (~/.freyja/projects/abc123/sub_3.md) and live outside the transcript, so a summarization pass cannot lose them: the file is still on disk and the manifest still points at it. When the compactor runs, the work itself is unaffected; at most the parent re-reads the manifest through the artifacts tool to recover a path it had paged out. This is the difference between compaction losing a reference and compaction losing work: only the reference is ever at risk.

5 communication

From isolation to collaboration

The pure orchestrator-subagent pattern has a fundamental bottleneck: all information flows through the parent. If sub_2 discovers a critical API endpoint that sub_5 needs, the finding must: (1) travel up from sub_2 to parent, (2) be processed by parent, (3) travel down from parent to sub_5 in a new delegation. This adds turns of latency and consumes parent context window capacity.

The SessionMessageBus solves this with an append-only shared log. Any subagent can publish a finding; any sibling can read all published findings. The parent is no longer the sole information relay.

live · SessionMessageBus architecture

Two tools are injected into every child subagent: publish_finding and read_findings. Publishing appends to the log; reading accepts an optional since_index parameter for cursor-based pagination and a topic filter for targeted reads.

# publish_finding tool
def publish_finding(
    topic: Literal["findings", "errors", "progress"],
    content: str,
    agent_id: str    # auto-injected, not user-provided
) -> dict:
    """Publish a finding to the shared message bus."""
    return bus.append(Message(
        topic=topic,
        content=content,
        agent_id=agent_id,
        timestamp=time.time(),
        index=bus.next_index()
    ))

# read_findings tool
def read_findings(
    topic: str | None = None,    # filter by topic, or None for all
    since_index: int = 0         # cursor-based pagination
) -> list[Message]:
    """Read findings from siblings. Use since_index to avoid re-reading."""
    return bus.read(topic=topic, since_index=since_index)

Architecture decision: append-only log vs. per-subscriber queues

We considered two designs. asyncio.Queue per subscriber: each subagent gets its own queue, messages are fanned out at publish time. Clean async semantics, but O(N) memory per message and complex lifecycle management when agents die mid-task. Append-only log with cursors: one shared list, each reader tracks its own position. Simpler, O(1) memory per message, naturally handles agent death (no orphaned queues).

We chose the log. The 500-message cap prevents unbounded growth. Cursor-based reads mean agents only process new messages, not the full history. Topic filtering reduces noise without separate channels. The design is essentially a minimal Kafka (append-only, consumer offsets, topic partitioning) in 80 lines of Python.

Coordination strategies: the bus is one of four

The message bus is one coordination mode among four. A session runs under a coordination strategy that decides how the parent dispatches work and how children report back:

Strategy	Mechanism	When it fits
BUS	Parent delegates to profile workers; children `publish_finding` so siblings build on each other; parent synthesizes and resolves conflicts.	Overlapping research or review where discoveries help siblings.
kanban	Work is cards on a board (triage → ready → in-progress → done). A `specifier` promotes triage cards to ready; autopilot can auto-dispatch them.	Many loosely-related units of work with a backlog shape.
goal	A judge-evaluated loop. A `judge-calibrator` arms the rules, then `judge-deep` adjudicates each iteration with a strict-JSON verdict until the goal is met or stalls.	Qualitative objectives where "done" needs an independent verdict, not just produced output.
isolated	Children run in their own context with the full task ledger; the parent doesn't babysit. The worker drives its task to completion.	Self-contained tasks that don't benefit from cross-talk.

The strategies are not exclusive in spirit, since a goal loop can spawn BUS-coordinated researchers and a kanban card can be executed by an isolated worker, but a session has one active strategy that sets the defaults for how its children are wired.

Direct messaging and re-wake

The bus is broadcast; talk() is addressed. A parent, sibling, or child can send a message to any session it can name, whether by id, by label, or by alias (parent, operator). Two flags change the semantics:

talk(
    to: str | list[str],          # id, label, alias, or a list (multi-cast)
    content: str,
    force: bool = False,          # interrupt the recipient mid-operation:
                                  #   cancels its current stream/tool call,
                                  #   gives it one compliance turn to react
    wait_for_reply: bool = False, # block the sender's turn until the
                                  #   recipient replies (tagged to this msg)
    reply_timeout_s: int = 60,
)

The behavior that matters most for coordination is re-wake: a message sent to a non-running subagent revives it. The recipient picks up where it left off, with the new message prepended. This is what lets the "agent teams" pattern from §2 work without keeping idle agents resident. A specialist terminates when its task is done and resumes when a sibling needs it again, paying context cost only while it runs. force=true is the stop/redirect path: it interrupts a child mid-tool-call, which is how a parent kills a runaway exploration without waiting for it to finish.

Judges as a coordination primitive

The goal loop introduced a third kind of agent that is neither orchestrator nor worker: a verifier whose only job is to decide whether work is done. judge-deep is skeptical by default, read-only (it may run grep/cat but never writes, installs, or mutates git), and returns a strict-JSON verdict ({done, confidence, reason, criteria, open_questions}), capped at three iterations so it issues a verdict instead of drifting into doing the work itself. judge-calibrator runs once at goal-arming time to propose the rules the judge will apply. Separating the verifier from the producer is the generator-verifier pattern from §2, made structural: the model that decides "done" is a different model from the one that did the work, which is the cheapest available defense against an agent grading its own homework.

Passive finding injection

Passive notification means injecting recent findings into a subagent's context instead of making it poll read_findings. The runtime does this at each turn boundary: <system-reminder> blocks, context-pressure advisories, and sibling findings published since the agent's last cursor position are spliced into the agent's stream as side-channel cues. The plumbing (the bus, cursors, per-agent read positions) carries the findings, and the injection point sits at the start of every agent turn, so a discovery a sibling published two seconds ago is in front of this agent before it picks its next action.

6 training data

Capturing delegation decisions as training data

What this section describes is running. The trajectory export format (docs/TRAJECTORY-TRAINING.md, a v3 JSON schema) feeds a reward layer that scores delegations per agent type, and a GRPO-style loop fine-tunes the routing policy on those scores. The pipeline has a name in the codebase, ATIF, with the reward functions in RewardKit and the rollout orchestration in Harbor. The mechanism below is grounded in the spring-2026 agent-RL literature, and the seams to the export format are called out where they matter. v3 status. The capture half has shipped as production scripts: the v3 export (a version: 3 JSON schema) plus an ATIF-v1.6 converter and a ShareGPT converter. The reward half now has its first running instance in §9, where an outcome classifier turns each skill's use into a weighted, V-scored signal, the same per-type reward idea applied to one concrete object. The RL fine-tuning of the routing policy described below is still the part that has not shipped.

Every time the orchestrator calls sub_agent(type="explore"), it makes a structured decision about task decomposition: which specialization for which slice of work. That decision is a training signal. A delegation that led to a verified, used result is a positive example; one that produced a discarded artifact or a redundant re-exploration is a negative one. Because those decisions are captured in a learnable format, the routing policy (which type to spawn, with what budget, against what slice of the task) is improved with reinforcement learning instead of hand-tuned in a prompt.

The thing being trained is worth naming precisely, because it is easy to conflate two different policies. Selection is which agent type to spawn for a sub-task. Decomposition is how to cut the task into sub-tasks in the first place. The capture format below records both, but they have different reward structure: selection has a relatively clean counterfactual (would a different type have done better on the same slice?), while decomposition does not (a different cut produces different slices, so there is no like-for-like comparison). The loop trains selection first for that reason, and treats decomposition as a slower second-order signal layered on once the routing policy has stabilized.

Training data flow: delegation to ATIF

The data flow from a live sub_agent call to an ATIF trajectory. Each delegation emits a session event, the snapshot captures the full trajectory, and the v3 export carries it into the reward layer that scores it.

The artifact at the center of this is the v3 export format in docs/TRAJECTORY-TRAINING.md: a per-session JSON schema that records the message sequence, tool calls, and a subagent_trajectory_ref for each delegation. That ref's extra field carries three pieces of metadata that make a delegation learnable: the agent_type chosen, the artifact_path the child wrote, and the resolved model. That triple is what links a task description to a delegation decision to an outcome, and it is what the reward layer reads when it scores the delegation after the fact.

Per-type reward functions, sorted by how checkable they are

The clearest signal in the spring-2026 agent-RL work is that verifiable rewards (outcomes you can check programmatically) train better policies than learned preference scores, because they don't drift and can't be gamed by producing confident-sounding output.[R4] Freyja's agent types do not all have verifiable rewards, and that difference is the most important thing about this table. The reward layer sorts its functions by how checkable they are and weights the checkable ones most heavily, so the routing policy learns fastest on the types whose outcomes are scripts.

Agent type	Reward class	Primary reward	Measurement	Anti-reward
code	verifiable	Test pass rate	% of tests passing after changes (programmatic)	Regressions: tests green before, red after
test / browser-qa	verifiable	Reproduction fidelity	Does the reported pass/fail match an independent re-run	Flaky or unreproducible verdicts
verify / review	semi-verifiable	Bug-detection precision	Reported issues that reproduce / total reported	False positives that cost reviewer time
judge-deep	semi-verifiable	Verdict calibration	Agreement with held-out human/consensus verdicts	Confident verdicts later overturned
explore / explore-fast	soft	Downstream usefulness	Was retrieved context actually used by a sibling/parent (citation in a later artifact)	Redundant re-reads; coverage that nothing consumed
plan / general	soft	Task completion	Did the parent's overall task succeed	Burning more iterations than a specialized type would

The asymmetry is the design constraint. code and test can be trained with verifiable rewards almost immediately, because the reward is a script. The explore family is the hard case: "coverage" is easy to measure and easy to game (read every file, score high, help nobody), so the design rewards used context instead. A retrieval counts only if a downstream agent cited the artifact it produced. That ties an explorer's reward to a sibling's behavior, which is a credit-assignment problem across the delegation tree, not a single agent's episode.

Credit assignment across a delegation DAG

A single-agent RL setup assigns credit along a chain: state, action, reward, repeat. A delegation tree is not a chain. The parent spawns children, children may spawn grandchildren, several siblings may contribute to one synthesized result, and some produce artifacts that are never used. Assigning a scalar reward to the whole episode and backpropagating it equally rewards the explorer whose work was discarded exactly as much as the one whose work was cited. That is the central unsolved problem in training a delegation policy, and it is where the recent process-reward work is most relevant.[R5]

The design's answer is to score the tree as a DAG, not an episode. Each node (a subagent run) gets a local reward from its per-type function in the table above. Each edge (a delegation decision) gets credit proportional to how much its child's output propagated upward: concretely, whether the child's artifact was cited in the parent's synthesis, and whether the parent's task ultimately succeeded. An explorer whose artifact no parent cited gets near-zero edge credit even if its local "coverage" reward was high. This is the structural fix for the gaming failure mode. Local reward measures whether the work was good, edge credit measures whether the delegation was worth making, and the policy being trained is the edge policy.

Counterfactual delegation and group-relative optimization

Selection has a clean counterfactual that decomposition lacks: for a given task slice, you can ask whether a different agent type would have done better. The design exploits this with the group-relative approach from the GRPO line of work. For a sampled task slice, spawn the same slice against several candidate types (or the same type at several budgets), and optimize the routing policy relative to the group's outcomes rather than against an absolute value estimate.[R6] This sidesteps the hardest part of agent RL, which is estimating the absolute value of a delegation; ranking delegations within a batch is far more robust than scoring one in isolation. The cost is sample efficiency, since you pay for the counterfactual spawns, so the design runs counterfactual rollouts offline on logged trajectories where possible, replaying a recorded task slice against an alternate type rather than re-executing live.

What gets trained, and what stays a prompt

Not everything should be learned. The design draws the line at routing: the policy that maps a task slice to (agent type, budget) is worth training because it has a measurable outcome and a large decision space. The per-type system prompts, the tool whitelists, and the coordination strategy stay hand-authored, because they are low-dimensional, interpretable, and cheap to edit; training them would trade away the auditability that makes the registry a single dict entry. The endpoint is modest and specific: a routing policy that has learned, from logged delegation outcomes, that "enumerate the REST endpoints" is an explore job at a low budget, "implement the auth middleware" is a code job at a high one, and "decide whether this goal is actually met" is a judge-deep job, with the budgets learned from data instead of guessed.

Why the capture format is the load-bearing piece

The reward functions, the DAG credit assignment, the counterfactual rollouts, and the RL fine-tuning all sit downstream of one thing: the trajectory capture has to be lossless and right. If the trajectories don't record agent type, artifact lineage, and downstream citation from the start, none of the training above is recoverable later, because the signal was never written. That is why the v3 schema and the extra-field enrichment came first and are the most conservative part of the system: everything else can be retrained, reweighted, or swapped, but a delegation outcome that wasn't logged is gone. The rest of the loop evolves against a fixed, complete record.

7 architecture evolution

Where we are heading

The architecture evolves through five phases, each building on the previous. We are currently in Phase 1. Each subsequent phase is motivated by a concrete limitation of the current design.

Phase 1 — shipped, and overtaken

Orchestrator-subagent with specialized types

The parent orchestrator delegates via sub_agent(type=...). Each child runs in an isolated context with type-specific tools and prompts, and artifacts persist to disk through the manifest. This foundation shipped, and then development overtook the roadmap below on a different axis. The growth from five to fifteen agent types, the four coordination strategies, the talk() channel, re-wakeable subagents, and the judge family in §5 were never on this timeline; they emerged from coordination needs rather than from the plan. The trajectory capture format and the training loop it feeds (§6) both came in on this phase. The phases below are the axis the roadmap planned along; the coordination work happened orthogonally to all of them. The same has held since v2: the skill-learning loop (§9), the durable scheduler (§10), and the grounded-memory spine (§8) all shipped off this roadmap, on a learning-and-durability axis rather than the one planned below.

Phase 2 — next

Warm agent pools

Currently, every sub_agent call spawns a fresh agent that starts from scratch. Warm pools maintain a set of persistent workers across turns. An explore agent that mapped the codebase in turn 3 retains that understanding in turn 7 when a follow-up exploration is needed. This eliminates the cold-start cost of re-reading files that a previous instance already processed.

Implementation: a WorkerPool keyed by agent type, with LRU eviction when the pool exceeds a configurable size. Each worker retains its context window and artifact directory across invocations. The parent addresses workers by type, not by ID, asking for "an explore worker" rather than "resume sub_3."

Update (06.05): typed WorkerPools are still unbuilt, but the always-on half of this phase arrived on a different axis. The durable scheduler (§10) makes a delegation persist across restarts and carry working memory between fires, which is the cross-invocation persistence this phase wanted, reached by making jobs durable rather than by keeping workers warm.

Phase 3 — partly shipped

Event-driven message bus

The bus already pushes: the runtime injects <system-reminder> cues, context-pressure advisories, and sibling findings published since an agent's last cursor position into the agent's stream at each turn start. What remains is the declarative routing layer on top of it, topic subscriptions that say "route findings tagged api_endpoint to code agents" rather than delivering every sibling finding to every agent. The bus, cursors, per-agent read positions, and the turn-start injection point all exist; the topic-routing policy is the open piece.

Phase 4 — vision

Dynamic type creation

The orchestrator can create new agent types at runtime by specifying a tool set and prompt. A request like "I need an agent that only has the database tools and knows our schema conventions" mints a new type for this session, uses it for delegation, and captures the type definition for potential promotion to a permanent entry in the registry. The registry becomes a living, evolving catalog.

Phase 5 — long-term

Harbor integration: Freyja as a Harbor agent adapter

Freyja becomes one of Harbor's 22+ agent adapters. Freyja's subagent trajectories flow directly into Harbor's training pipeline without conversion. Harbor's cross-agent insights (patterns learned from other adapters' trajectories) flow back into Freyja's delegation policy. The training loop crosses framework boundaries.

Evolution: what shipped, what's next

the architecture is designed for incremental evolution. phase 1 is complete. each subsequent phase adds capability without modifying existing code.

8 stateful profiles

Profiles that remember

Every phase in the roadmap above keeps one assumption fixed: a profile is a stateless configuration. The registry entry for explore names a model policy, a tool surface, a thinking budget, and a prompt, and the runtime instantiates that configuration fresh for every task and discards it when the task returns. Two explore children spawned a minute apart share their definition and nothing else. Whatever the first one learned about the codebase, the second one re-derives from zero.

That is a strange thing to throw away. A profile is the one structure in the system that recurs: the same explore configuration runs hundreds of times against the same project, the same verify configuration checks the same classes of task over and over. The work is different each time, but the kind of work is stable, and stable kinds of work accumulate reusable structure. The next axis of evolution is not new profiles or new coordination strategies. It is giving each profile a durable, type-level store that every instance of that profile reads from and writes back to, so the configuration stops being the only thing the instances share.

This is a different mechanism from the warm pools in Phase 2. A warm pool keeps one worker's context alive across turns inside a single session, so a long-lived reviewer does not pay re-instantiation cost mid-conversation. A stateful profile is cross-session and type-level: the store outlives any individual worker and any individual session, and every future instance of the profile inherits it. Warm pools amortize startup within a session; stateful profiles amortize learning across the profile's entire history.

The shift

A profile becomes a pair: the configuration that defines how an instance behaves, and an accumulating store that defines what every instance already knows. The registry stops being a table of stateless configs and becomes a set of long-lived, specialized systems that happen to be invoked one task at a time.

The substrate for this already exists in the architecture. The artifact and manifest system from §4 is a content-addressed store that outlives the session that wrote it: every subagent already persists its full output to disk and registers it in a manifest the parent can read on demand. A stateful profile is what you get when that store is indexed by profile and queried before a task runs, not only read after one finishes. The two profiles the rest of this section develops, explore and verify, are the two where the payoff is largest and the prior art is clearest.

Explore that remembers what it explored

An explore instance spends most of its budget re-discovering things the profile has already seen. It searches the web, reads files, greps the codebase, and assembles a picture, and almost all of that picture overlaps with what some earlier explore instance assembled for an adjacent question. The wasted motion is not the reasoning, which is genuinely per-task. It is the retrieval: the same documents fetched, the same files read, the same searches issued, because the second instance has no way to ask whether the first one already found the answer.

A stateful explore profile closes that gap with a persistent store of prior retrievals, indexed so a new instance can ask “have I already found this?” before it issues a single search. The organizing question is how to index it. The most directly applicable design is ByteRover's Context Tree (2026), which organizes accumulated findings as a hierarchy of Domain → Topic → Subtopic → Entry, stores each entry as a markdown note with provenance back to its source, and retrieves through a progressive five-tier walk that returns relevant context in well under a second. That maps almost exactly onto the grouping a research profile wants: by topic, by goal, and by session. A new explore task resolves to a path in the tree, pulls the entries already hanging off that path, and only searches outward for what is missing.

The idea that an agent should consult its own history before acting is older than the recent memory systems, and the lineage matters because it tells you which parts are load-bearing. MemGPT (2023) introduced virtual context management, paging information between a small working context and an external store the way an operating system pages memory, which is the mechanism that lets a store outgrow any single context window. Generative Agents (2023) added a memory stream with a reflection step that periodically distills raw observations into higher-level summaries, scored by recency, importance, and relevance at retrieval time. Reflection is the part a research profile needs most: without it the store is a transcript that grows without bound, and with it each session leaves behind a compact digest rather than a log. ExpeL (2023) made the reuse explicit, keeping an experience pool keyed by task and retrieving relevant prior experience before attempting a new one, and CoALA (2023) gave the vocabulary that keeps these pieces distinct, separating episodic memory (what happened in past sessions) from semantic memory (facts about the domain) from procedural memory (how to do the task), which are three different stores with three different update rules even though a naive design collapses them into one.

The more recent work sharpens the retrieval side. A-MEM (2025) links memory notes into a Zettelkasten-style graph so related findings connect across sessions instead of sitting in separate buckets, which is what lets a topic accumulate rather than fragment. Mem0 (2025) runs an extract-and-consolidate loop over conversational history into a combined vector and graph store and reports large token savings at retrieval time, because the agent pulls a few consolidated facts instead of replaying a transcript. General Agentic Memory (2025) is the cleanest statement of the explore use case: it separates a cheap always-on page-store that records everything from a just-in-time researcher that answers, at query time, whether the needed information has already been found, which is precisely the “have I already explored this?” gate. And the 2026 memory-OS line, MemoryOS, EverMemOS with its thematic MemScene clusters, and MemForest with a hierarchical index keyed jointly by session and topic, converges on the same hierarchical, theme-clustered organization that the Context Tree describes, which is some evidence that this is the shape the problem wants rather than one team's idiosyncrasy.

Stateless explore vs. stateful explore

a stateful explore profile consults the shared store before searching. the second instance retrieves what the first found and spends its budget only on what is genuinely new.

Verify that accumulates its own verifiers

The verify profile has the same redundancy in a different shape. Today a verify instance receives a task, works out what correct would mean, and synthesizes the checks from scratch: it writes the assertions, reasons out a rubric, decides what to run. The next verify instance handed a structurally identical task does all of that again. The reasoning about this output is per-task and worth redoing. The machinery of how to check this class of task, the rubric, the executable checker, the replay harness, is not, and rebuilding it every time is both wasteful and a source of inconsistency, because two instances asked to verify the same kind of work can invent two different standards for it.

A stateful verify profile keeps a growing library of verification artifacts indexed by task type, and retrieves a pre-built verifier instead of synthesizing one. The closest prior art is the tool-induction line. LATM (2023) splits the work into a tool maker that writes a reusable tool once and caches it by problem description, and a cheaper tool user that retrieves and applies it, which is the exact division a verify profile wants between building a checker and running it. CRAFT (2023) organizes induced tools into toolsets by domain and retrieves them by embedding similarity, which is the “indexed by task type” requirement directly. TroVE (2024) is the one that takes the lifecycle seriously: it grows a toolbox and then trims it, so the library does not bloat into a pile of near-duplicate verifiers that make retrieval worse, and any honest version of this profile needs that trim step as much as it needs the grow step.

The admission rule is what separates a verifier library from a junk drawer. Voyager (2023) is the template: it builds an ever-growing skill library in which a skill is only admitted after it executes successfully against the environment, so the library is a set of verified capabilities rather than plausible-looking code. A verify profile should gate the same way, admitting a checker only once it has actually run and discriminated a known-good output from a known-bad one. LEGO-Prover (2023) shows the same admission discipline in formal proof, growing a library of lemmas each gated by a Lean type-check, and the discipline transfers: a checker earns its place in the library by passing an executable gate, not by looking reasonable.

The 2026 work targets the verification case specifically. RewardHarness (2026) maintains an evolving library of evaluation tools and skills per domain, with an orchestrator that refines them over time, which is almost a direct description of this profile. Prompt-Level Reward Specifications (2026) is the sharpest match to what a verifier should store: per task it produces an offline, task-adaptive rubric paired with an executable hard-constraint checker, caches the pair, and reuses it, which is exactly the rubric-plus-checker artifact a verify profile would retrieve by task type. AutoHarness (2026) synthesizes a code harness per environment type, the replay-harness analogue. DeepVerifier (2026) builds a failure taxonomy and indexes rubrics by failure type, which is a more discriminating key than task type alone: the profile can retrieve checks aimed at the specific ways this kind of task tends to break. VPR (2026) and AgentV-RL (2026) round out the picture from the reward-modeling side, with oracle verifiers organized by reasoning category and an agentic tool-using verifier respectively, both consistent with a library keyed by problem type.

Why these two profiles first

Both explore and verify do work that is heavily redundant across instances and cheap to verify when reused: a retrieved document is either still the right source or it is not, and a retrieved checker either passes its executable gate or it does not. That makes them the two profiles where an accumulating store pays off soonest and fails most visibly, which is exactly where you want to start.

The general form

Once two profiles carry their own stores, the registry itself changes character. Every entry becomes a pair of a configuration and an accumulating store, and the stores differ by profile in a way that follows the kind of experience each one generates. A useful frame here is the experience-compression spectrum (2026), which separates episodic experience (raw past sessions, compressible perhaps five to twenty times), procedural experience (distilled reusable skills, fifty to five hundred times), and declarative experience (general rules, a thousand times and up). An explore profile lives mostly in the episodic and semantic registers: what it found, about what. A verify profile lives in the procedural register: how to check a kind of thing. A code profile would accumulate a library of project-specific fixes and patterns; a plan profile would accumulate prior decompositions of similar goals. The store is not one mechanism bolted onto every profile, but a per-profile choice of what kind of experience is worth keeping and at what compression.

The trajectory loop in §6 is what makes this more than a cache. The training exports already capture which profile ran each step, with what tools, to what outcome, tagged by agent_type. That signal is exactly what a profile's store needs to learn from: it tells the verify profile which retrieved checkers actually caught real failures and which waved bad output through, and it tells the explore profile which cached findings were reused versus re-searched anyway. The store and the training loop reinforce each other, where the store gives the profile something to learn over, and the loop tells the store what was worth keeping.

Limitations

This is the most speculative direction in this writeup, and the prior art includes its own strongest counterargument. The most important caution is that library learning is harder than it looks. A 2025 study of LLM library learning found that in a LEGO-Prover-style system there was little evidence of genuine reuse, and the apparent gains largely vanished once compute was held constant, which means a profile store can look like it is helping while really just buying more attempts. The discipline that follows is non-negotiable: measure the reuse rate directly, control for compute when claiming a benefit, and treat “the store grew” as worthless until “the store was reused and the reuse helped” is demonstrated.

Three more failure modes are specific to this design. Capability erosion: work on self-evolving agents (2026) documents a phenomenon where specializing on a narrow distribution degrades general capability, so a profile that over-fits its store to the tasks it has seen can get worse at the tasks it has not. Staleness: an explore store accumulates findings that go out of date as URLs rot and code changes underneath them, so cached retrievals need provenance and an invalidation policy, which is exactly why the Context Tree carries provenance and a maturity decay on its entries. Cold start: a store beats re-derivation only after enough history has accumulated, so the early sessions pay the full cost of building the store and get none of the benefit, and a profile that is rarely invoked may never reach the crossover. None of these is fatal, but each one is a place where a careless version of this idea quietly makes the system slower and worse while appearing to learn, which is the failure the compute-controlled study warns about, in three new disguises.

Update · 06.05 · the spine shipped

What stopped being speculative

Everything above is the design as it stood on 05.31. In the weeks since, the part of it that was least speculative in principle and most speculative in practice, a durable store an instance reads before it acts, shipped. It landed on a smaller and more urgent object than a research profile: the agent's own record of what it just did.

The forcing case was a failure that has nothing to do with retrieval efficiency and everything to do with truth. An agent did real work across a long session, writing a 680-line tool file and editing a second project, its context was compacted as the session grew, and afterward it told the user, in plain language, that it had no recollection of the work and had only been exploring read-only. It had not been exploring. The summary that replaced the raw turns dropped the edits, and the model, with nothing left in context to contradict it, asserted the compression as fact. This is the §6 gaming failure turned inward: not an agent grading its own homework, but an agent forgetting it did the homework at all.

The fix is the stateful-profile mechanism in miniature, and it is shipped and unit-tested. A per-session effect ledger (SessionLedger, an append-only action_ledger.jsonl beside the artifact manifest) records every effect the agent has, classified at the tool boundary: file writes and edits, image generation, and mutating shell commands, the last attributed to exact files by diffing git status --porcelain before and after the call. The ledger survives compaction because it was never in the transcript to begin with. Each turn, a debounced <system-reminder> renders the ledger back as a standing, first-person note framed as ground truth: this is the durable record of what you have done; trust it over your in-context memory if they disagree. A reconcile-before-asserting rule tells the agent to consult it before claiming it did or did not do something, and a forgetting detector watches each turn for the exact failure above, a claim of “no recollection” or “read-only” while the ledger shows effects, and injects a one-shot correction while emitting telemetry when it fires.

The transcript is a viewport; the ledger is the document

the runtime records effects to a durable ledger the summarizer never touches, then renders that ledger back into every turn as ground truth. an agent can no longer truthfully claim it did nothing while the ledger shows what it did.

A second layer, also shipped, gives the agent a place to record the meaning of the work rather than only the fact of it: a structured working memory (WorkingMemory, one JSON document per session) with a small entity vocabulary, workstream, decision, finding, open_thread, and artifact_note, maintained through a working_memory tool and used to seed the summarizer so a compaction cannot silently drop a decision. The framing the design settled on is that the context window is a viewport, not the document: make the transcript a render over durable backing stores, and compaction stops being loss.

Two things should be said honestly about how far this gets toward the vision above. The closest realization of the “accumulate verified capability” half of §8 did not come from a profile store at all. It came from the skill-learning loop, where each admitted skill is a durable, outcome-ranked, type-level capability the system keeps across sessions, and that loop is the subject of §9. It is the strongest evidence so far that the shape proposed here is real rather than merely appealing. The part that has not shipped is the per-type research store, the explore profile that consults a Context Tree before it searches: the ledger and working memory are per-session, not cross-session and per-profile, and the provenance graph, the diff-aware artifact memory, and the sleep-time curation a mature profile store would need remain designed and unbuilt. The speculation moved. It has not arrived.

9 the skill-learning loop

Sessions that leave a skill behind

New in v3, and what it describes is running. §6 captures delegation as training data; §8 wants profiles that accumulate. The skill-learning loop is the first place both ideas run in production, on a smaller object than either: a single reusable skill. A sub-agent drafts it, a human admits it, and an outcome classifier decides whether it earns its place.

A skill is a SKILL.md file: a name, a trigger set, a type (build, guard, reference, or workflow), and a body of guidance that loads into context when its triggers match. The library is the system's accumulated, hand-auditable procedural memory. The question this loop answers is where new skills come from, and how the library stays good as it grows, and the answer is a closed loop with a sub-agent at one end and a reward signal at the other.

live · the skill-learning loop

The loop has five stages, and the design choice at each one is the same the rest of this post keeps making: push the work to a specialized, context-scoped agent, and keep a human on the one decision that is expensive to get wrong.

The five stages

A signal trips a counter, a sub-agent drafts, a human admits, the library serves, and an outcome scores. Only the third stage is human; the rest run unattended.

⊕ SIGNAL

Cadence or /learn-this

A counter ticks every turn and trips at five. The count lives in a workspace-global file, so it survives a restart and is shared across surfaces. An explicit /learn-this trips it on demand.

⊕ DRAFT

skill-drafter sub-agent

A real sub-session on Opus 4.8 reads the conversation and the existing library, then calls propose_skill once, or finishes with a plain explanation that nothing here is worth keeping.

⊕ ADMIT

Human confirm

The candidate surfaces as a full SKILL.md, diffed against what it would replace. A destructive overwrite is gated. The drafter proposes; it never writes to the library.

⊕ SERVE

Trigger-matched load

An admitted skill loads into later sessions when its triggers match, the same progressive-disclosure discipline the tool surface uses in §3.

⊕ SCORE

Outcome classifier

A second Opus 4.8 sub-agent reads the turns after each load and labels what the skill did, on the twelve-category scale below. The label carries a weight.

⊕ RANK

V-score re-orders

Labels roll up into a value score over the last thirty loads. The library lists by V, so a skill that stops earning its place sinks below the fold.

The drafter is a sub-agent, not a function call

The earliest version of this loop was a single structured-output model call: hand over the conversation, get back a candidate. It was replaced by skill-drafter, a registered agent type (§3) that runs as a real sub-session. It is pinned to Opus 4.8 rather than inheriting the parent's model, because a draft triggered from a Slack DM would otherwise run on whatever cheap model answered the DM, and the drafter is the high-precision arbiter of what is worth keeping. Its tool surface is read-only plus one write-shaped tool, propose_skill: it reads the existing library with load_skill before deciding, so a candidate that overwrites a known skill amends it rather than rewriting it from memory and silently dropping half its body. Because it is a real session, it streams, it shows up in the subagents panel, and a human can re-engage it to refine a candidate instead of accepting or rejecting it whole.

A human admits the skill

The drafter proposes; it never writes to the library. propose_skill writes the candidate to a holding area and raises it for review as a full SKILL.md, frontmatter and body, diffed against whatever it would replace, with an overwrite that removes more than a hundred lines or more than half of an existing skill gated behind an explicit destructive-promote confirmation. This is the generator-verifier split from §2 turned on the system's own memory: the model that proposes a capability is never the one that admits it. It is also the admission discipline §8 borrowed from Voyager[L4], made literal, a skill earns its place by passing a gate rather than by looking reasonable.

An outcome classifier turns use into a reward

Admission is the start of the signal, not the end of it. Once a skill is in the library and begins loading into real sessions, an outcome classifier (also Opus 4.8) reads the few turns after each load and labels what the skill actually did, on a twelve-category scale sorted by polarity. This is the per-type reward idea from §6 made concrete for skills: not a single scalar but a discrete label with a known weight, chosen because “the skill loaded” is trivially gameable while “the skill was used, and using it helped” is not. The weights are the only place the loop decides good from bad; every other module reads them.

Outcome	Polarity	Weight	What it means
user_endorsed	positive	+2.0	The user explicitly affirmed the response the skill influenced.
cited	positive	+1.5	The agent's response text referenced the skill outright (“per X…”).
compounded	positive	+1.2	The skill unblocked another skill or a downstream workflow.
partial	neutral	+0.3	Some guidance followed, some not, with no correction.
clean	neutral	0.0	Task done; no skill-attributable signal either way. The healthy default.
ignored	neutral	−0.2	Loaded, but the agent's later behavior shows no influence.
redundant	neutral	−0.3	Another skill loaded this turn covered the same ground.
false_trigger	negative	−0.6	Triggers matched lexically but the content was semantically irrelevant.
correction	negative	−1.0	The user corrected behavior the skill was supposed to govern.
superseded	negative	−1.2	The agent abandoned the skill's approach mid-task for one that worked.
error_loop	negative	−1.5	Three or more repeated tool errors the skill should have prevented.
outdated	decay	−1.5	The skill's advice was followed and produced an environment-attributable failure.

The twelve outcome categories and their weights, in the order the operator UI shows them. Polarity groupings are deliberately legible so a skill's record reads at a glance.

Those labels roll up into a single value score per skill: a windowed average over the most recent thirty outcomes, with a confidence term that saturates over those first thirty so a skill with three lucky loads does not outrank one with a long record. It renders in the library as a one-line headline, V=+1.34 · 23 loads · 18 cited, 4 clean, 1 correction, and the library lists by V. That ordering is the decay mechanism. A skill that keeps triggering on the wrong task (false_trigger) or sends the agent into a repeated-error loop (error_loop) sinks until it stops being surfaced. Nothing deletes it; it is out-competed. This is the grow-and-trim discipline §8 said a capability library needs[L3], with the trim implemented as ranking rather than removal.

What this is, and what it is not

This is the §6 training loop closed end to end on one object: a decision (draft this skill, shaped this way), an outcome (cited, ignored, corrected), and a reward (the V-score) that feeds back into which skills are available next. It is also the strongest shipped evidence for §8, an accumulating, admission-gated, outcome-ranked, type-level store. What it is not is the full stateful-profile vision: the store is keyed by skill and workspace-global, not a per-agent-type Context Tree, and the harder pieces are deferred on purpose, with no statistical-significance gate on V before a skill is trusted, no automatic skill-patching (a human still edits), and no consolidation of near-duplicate skills beyond a render-time cap. The 2025 library-learning caution from §8[L11] is the reason the V-score exists at all: “the store grew” is worthless until “the store was reused, and the reuse helped” is measured, and the V-score is that measurement, built to sink a skill rather than celebrate it.

10 always-on

The durable scheduler

Every section so far assumed a session begins when a human starts it and ends when the work returns. The scheduler, entirely new since v2, removes that assumption. It makes a delegation durable: the same agent can fire on a clock, or on a cadence it sets itself, across the desktop app and Slack, and it keeps firing when the app is closed.

A job is a record on disk (~/.freyja/schedules/jobs/{id}.json, written atomically), and the service that runs them holds no job state in memory at all. Every read hits disk, a per-job file lock coordinates mutations across processes, and a single owner-lock keeps exactly one tick loop alive. That reads like an implementation detail, but it closed a real class of bug, where a job created from Slack lived in the gateway process and stayed invisible to the desktop bridge and the dashboard until a restart. Statelessness is the feature here: there is one source of truth, and it is the disk.

A job composes along three independent axes, which is what lets a one-off reminder and an autonomous nightly loop be the same kind of object:

Axis	Choices	What it controls
Schedule	`once` · `interval` · `cron` · `self-paced`	When it fires. Natural language (“every weekday at 9am”) parses down to one of these; cron runs on a built-in interpreter with no extra dependency.
Execution	new session · persistent session · existing session	What it fires as. A persistent session is allocated on the first fire and reused, so the job keeps a context across runs.
Delivery	slack · desktop · filesystem · session · webhook · noop	Where the result goes. A morning brief can land in a Slack DM; a build watcher can write a file and notify nothing.

Memory across fires

This is where the scheduler touches §8. A job carries a memory spec: working notes on disk that persist between fires, with the last few deltas auto-injected into the next fire's prompt, and an artifact reference (a path, a repo, a doc URL) injected so the agent always knows its canonical output location. A scheduled morning-briefing job remembers what it said yesterday. This is stateful recurrence, and while it is not the per-type Context Tree of §8, it is a shipped instance of the same idea: an agent that reads its own durable store before it acts. A persistent-session execution keeps the whole context across fires; a memory spec keeps the distilled notes even when each fire is a fresh session.

Loops the agent paces itself

A self-paced schedule hands the cadence to the agent. The runtime registers two tools during a fire, continue_loop and complete_loop; the agent calls one before its turn ends, and the runtime moves the next fire time accordingly, bounded by a minimum and maximum delay and a stop condition. This is the autonomy story made literal, an agent that decides when it next needs to run. It is also where the cost-containment instinct from §13 reappears as policy rather than prose: a job snapshots its budget (tokens and cost per run, per day, a wall-clock cap), a retry policy, and the bridge's permission tier at create time, and restores that permission tier at fire time, so a scheduled run cannot quietly inherit a more permissive interactive session and burn budget unattended.

The always-on substrate

§7's Phase 2 named always-on agents as future, on the warm-pools axis. They arrived on a different one. A macOS LaunchAgent installs on the first durable job and runs the scheduler headless (--headless --scheduler-only), so jobs fire whether or not the desktop app is open. This is the substrate the roadmap wanted, reached by making jobs durable rather than by keeping workers warm. The honest scope: the scheduler queues into the existing bridge rather than a separate engine pool, and the broader always-on platform, an orchestrator/engine split and a general notification fabric, is designed and not yet shipped. What runs today is enough to put an agent on a clock and trust it to stay on its leash.

11 multimodal & recovery

Subagents that can see, and a runtime that bends

Two pieces of plumbing landed since v2 that do not change the architecture but change what a subagent can do and how the runtime behaves under pressure. They belong together because the second is the price of the first.

Seeing

view_image

Images into context

Loads an image from an artifact ref, a local path, or a URL, capped at six sources and fifty megabytes so one call cannot blow the budget. An agent can re-view a screenshot it took earlier.

browser_screenshot

Pixels, not a path

Returns the PNG bytes inline as an image block, so a browser-qa subagent sees the page it just exercised on its next turn instead of inferring it from the DOM.

cross-provider

Survives the model boundary

The same image tool-result serializes to Anthropic, to OpenAI's Responses API, and to Gemini's inline-data parts, so a verify on GPT and a browser-qa on Sonnet both see pixels.

scoped by type

Only who needs it

This is the §3 tool-surface argument in a new modality: browser-qa and performance carry the browser tools and inherit view_image; types with no business looking at images do not.

Not breaking

Images make requests large, and a large request can exceed the provider's body limit, a 413 that is distinct from running out of context: the message count is fine, the payload is too big. The runtime recovers in tiers, and the order is the whole point.

TIER 1 · elide

Prune the oldest images

Drop the oldest image bytes first, up to a few megabytes, and retry the request. A stale screenshot is the cheapest thing in the payload to lose.

TIER 2 · summarize

Only if pruning freed nothing

The summarizer is invoked only when there were no images to drop. Compacting prose is more lossy than dropping a picture, so it runs second, not first.

GOTCHA · thinking off

Summarize without reasoning

The summarizer runs with thinking disabled. A summarizer that spends its output budget on reasoning blocks emits an empty summary, and the compaction silently produces nothing.

This is maturity work, the kind §13 catalogs, but it is also the first crash-recovery-shaped mechanism in the runtime. The subagent state machine is still flat (§13), yet the context layer beneath it learned to survive a request that would previously have failed outright. Images are elided before the summarizer is ever called, which is the same instinct as the effect ledger in §8: when something has to be dropped, drop what is cheapest to lose and protect what is hardest to reconstruct.

12 coordination patterns

Decision framework

Knowing what to delegate is harder than knowing how to delegate. The following decision framework captures the patterns we've found effective and the anti-patterns we've learned to avoid.

When to use each agent type

explore

Survey, map, understand

"What does the auth module look like?" "Find all REST endpoints in the codebase." "Research how the payment flow works." Use when the task is about understanding, not changing. The agent reads broadly but writes nothing.

explore-fast

Quick reconnaissance

"Is there a Dockerfile in this repo?" "What test framework does this project use?" Narrow, factual questions where speed matters more than depth. Runs on a rotation of fast models (Kimi / MiniMax / GLM, falling back to Haiku) with thinking off. Spawn three to five in background for breadth. The other twelve types (plan, review, test, browser-qa, performance, docs, memory-curator, specifier, and the two judges) follow the same shape: a narrow tool surface and a model policy matched to the job.

code

Implement, test, iterate

"Implement the JWT validation middleware." "Add unit tests for the user service." The agent has write access and test-running capability. High thinking effort because implementation requires careful reasoning. The most expensive type per invocation.

verify

Check, validate, report

"Run the test suite and report failures." "Check the auth flow for security vulnerabilities." The agent reads and runs tests but does not fix what it finds. The separation is deliberate: the generator-verifier pattern requires the verifier to be independent of the generator.

Foreground vs. background

Dimension	Foreground (sync)	Background (async)
Blocking	Parent waits for result before continuing	Parent continues working; checks results later
Use when	Result is needed for the next decision. Exploration that determines the plan.	Result is supplementary. Parallel work that improves quality but isn't blocking.
Context cost	Result enters parent context immediately	Result stays in artifact file until explicitly read
Example	explore("What auth mechanism does this repo use?") → determines which code type to spawn	verify("Run full test suite") → parent continues coding; reads results before PR

Message bus vs. parent mediation

Dimension	Message bus	Parent mediation
Latency	Immediate. Sibling reads finding on next tool call.	2+ turns. Up to parent, processed, down to sibling.
Use when	Siblings working in parallel on related tasks. Findings from one improve another.	Parent needs to make a judgment call. "Sub_2 found an issue — should I redirect sub_3?"
Context cost	Zero parent context. Findings stay in bus.	Full parent context. Findings transit through parent window.
Control	Decentralized. Siblings self-coordinate.	Centralized. Parent decides what to relay.

Anti-patterns

Over-delegation

Delegating a task that takes the parent 3 tool calls to a subagent that takes 10 tool calls (cold start + re-read + actual work). The overhead of spawning exceeds the benefit. Rule of thumb: only delegate if the task would take the parent 5+ turns or requires a different tool set.

Tool-set bloat

Giving a specialized agent too many tools defeats the purpose of specialization. An explore agent with write_file access will occasionally write files, polluting the read-only invariant. Rule: start with the minimum tool set and add tools only when you have evidence the agent needs them.

Coordination overhead exceeding benefit

Three subagents communicating via message bus to solve a task that one agent could handle alone. The bus adds complexity without parallelism benefit. Rule: message bus is for independent parallel work with shared context needs. If agents are sequential (each waiting on the previous), use foreground delegation instead.

Type mismatch

Using explore for a task that requires writing files (explore has no write tools), or using code for pure research (wasting the high thinking budget). The type table above is the decision matrix: match the task to the type, don't force a type onto a task.

Ignoring artifact indices

The parent gets a manifest from wait_all but then asks for a full summary of all subagent work, effectively re-creating the blob problem. Rule: read specific artifacts on demand through the artifacts tool, not all artifacts in sequence. The manifest tells you which artifact is relevant: use that signal.

The meta-principle

Delegation is not parallelism. Delegation is context isolation. The primary benefit of subagents is not running things in parallel, though that helps. It is preventing one task's context from polluting another's. A subagent working in a focused context reasons better than one agent working in a context cluttered with every prior task's tool output, because attention is finite and every irrelevant token competes for it. Specialization reduces noise; isolation prevents pollution; persistence prevents loss.

13 what we learned from the field

Production lessons from Claude Code, OpenCode, and the research frontier

Other production systems have solved the same coordination and state problems, and reading how they did it surfaces both patterns Freyja is missing and choices it got right. Two engineering writeups (Claude Code, OpenCode) and the spring-2026 research frontier are the comparison set. The training signal (§6) is not in this list, because it isn't shipped: the lessons here are about coordination and state, not training.

Dual state machines: the maturity gap

Our SubAgentState is a flat 4-value enum: RUNNING → DONE | FAILED | CANCELLED. Both Claude Code and OpenCode use two independent state machines per agent:

Member Status (coarse)

5 states for lifecycle

ready → busy → ready / error / shutdown_requested → shutdown. Governs whether the agent can accept work, needs recovery, or should be cleaned up.

Execution Status (fine-grained)

6+ states for prompt loop

idle → starting → running → completing → completed → idle. The UI uses execution status for spinners; crash recovery uses member status for decisions.

Why it matters: we can't distinguish "agent initializing" from "agent actively running" from "agent wrapping up." The UI shows a spinner for all three. More critically, crash recovery needs to know whether an agent was genuinely busy or in a stale state. Splitting into two levels is on our immediate roadmap.

live · dual state machine transitions

Crash recovery: a narrower gap than v2 reported

This is the one claim from v2 that has already shifted. v2 reported no recovery for subagent state, and the automatic version is still unbuilt, but a narrow, supervised form shipped in the interim. A terminated subagent is archived to a disk sidecar, and a talk() aimed at it revives the session from that sidecar with the new message queued, falling back to a dead-inbox push when no sidecar is found. This is §5's re-wake extended to agents that have already exited, so a parent can deliberately bring an interrupted child back rather than losing it. What still does not survive a hard crash is in-flight work, and there is still no automatic bootstrap that force-transitions stale agents on restart. The artifact manifest from §4 remains the backstop: anything a child persisted before the crash survives because it is on disk. OpenCode's recovery sequence is the reference design for closing the remaining gap:

⊕ STEP 1

Register handlers

Permission restoration handler registered before recovery begins. Recovery may trigger cleanup, which may need to restore delegate-mode permissions.

⊕ STEP 2

Force-transition stale agents

All agents marked "busy" are force-transitioned to "ready." A system message is injected: [System]: Server restarted. Teammates interrupted.

⊕ STEP 3

Subscribe to cleanup after

Event subscriptions registered after recovery completes. Avoids spurious cleanup from the force-transitions in step 2.

Design decision: no auto-restart

Interrupted agents come back idle, not running. After a restart they are marked ready but do not resume on their own; a human re-engages them. The reason is cost containment: an auto-restart on crash can put several agents back into long-running tool loops unattended, and that failure mode (spawned agents consuming API budget with no one watching) is worse than the inconvenience of restarting them by hand. Freyja's re-wake mechanism (§5) is the right primitive to build a supervised resume on later: a parent could talk() each interrupted child back to life deliberately, rather than the runtime reviving all of them automatically.

live · bootstrap recovery sequence

Token duplication: the hidden cost

Multi-agent systems pay a tax that single-agent systems do not: the same context gets re-sent to multiple agents. Published measurements across multi-agent frameworks put the redundancy high, on the order of half to most of the tokens consumed, depending on how aggressively context is shared.[R7]

Framework	Reported token duplication	Cost multiplier
CAMEL	86%	7.1×
MetaGPT	72%	3.6×
AgentVerse	53%	2.1×

External figures, cited as reported, not measured by this project.

Freyja's design keeps this low for structural reasons: the message bus shares concise findings rather than full context, artifact persistence avoids re-fetching, and per-type tool whitelists keep tool definitions lean. The duplication rate the runtime tracks sits well under the debate-style numbers above, because the bus carries summaries rather than replaying transcripts. This post puts no single figure on it because it moves with workload shape (a four-way parallel research fan-out duplicates more than a linear refactor), and a single headline number would imply a precision the measurement does not have. The structural argument is what generalizes; the instrumented rate confirms the direction rather than pinning a constant.

Peer-to-peer routing

Claude Code uses leader-centric routing (teammate to leader to teammate). OpenCode moved to a full peer-to-peer mesh and reported that the lead could then focus on orchestration instead of relaying messages. Freyja's publish_finding / read_findings bus is already peer-to-peer, so any sibling can read any other's findings directly without the parent in the path. The honest comparison: this converges on the same answer OpenCode reached, which is reassuring but not novel. The bus solves the routing-bottleneck problem and nothing more. It does not address token duplication or crash recovery, which are the two places (below) where Freyja is still behind.

What goes on the roadmap

Immediate

Dual state machines

Split SubAgentState into MemberStatus + ExecutionStatus. UI uses execution status for spinners; recovery uses member status for decisions.

Immediate

Crash recovery bootstrap

Persist subagent records to disk alongside artifacts. On bridge restart: force-transition stale agents, inject system notification, no auto-restart.

Near-term

Token duplication metrics

Instrument total_tokens_across_all_agents / hypothetical_single_agent_tokens. Surface in session export and in the trajectory records that §6's training loop would consume.

Near-term

Auto-wake for warm pools

Event-driven restart of idle agents on message bus delivery. The missing piece for persistent agent teams.

Future

Message bus TTL + namespacing

Auto-expire old messages for long-running sessions. Namespace isolation per agent role.

Future

Permission delegation

Orchestrator voluntarily restricts own tools during coordination-heavy phases, forcing focus on delegation.

What the comparison confirms

The field research lines up with Freyja's design: declarative agent types with tool filtering rather than undifferentiated workers, a peer-to-peer bus rather than a leader-centric relay, artifact persistence to files rather than in-memory only, and context-centric decomposition over role-based decomposition. Two of these (peer-to-peer routing, file persistence) are places other systems arrived at the same answer independently, so the agreement is convergence, not a lead. The open gaps are dual state machines and crash recovery, which are maturity work rather than architectural pivots. The training loop in §6 is the design choice the comparison set says least about, because none of these systems trains on its own delegation traces; Freyja's routing policy learns from a signal the rest of the field is still throwing away.

Specialized Subagent Architecturefor Freyja

Why generic subagents fail

Parent window fills with noise

The return channel is too small

Same model, same tools, same prompt

Siblings are isolated

Delegation decisions evaporate

O(N) wait, O(N) context

Declarative specialization

The shape of an agent type

model + thinking_effort

tool_include / tool_exclude

system_prompt + max_iterations

Why one tool with a type parameter, not five separate tools

Zero-touch extensibility

Solving the truncation crisis

From isolation to collaboration

Architecture decision: append-only log vs. per-subscriber queues

Coordination strategies: the bus is one of four

Direct messaging and re-wake

Judges as a coordination primitive

Capturing delegation decisions as training data

Per-type reward functions, sorted by how checkable they are

Credit assignment across a delegation DAG

Counterfactual delegation and group-relative optimization

What gets trained, and what stays a prompt

Where we are heading

Orchestrator-subagent with specialized types

Warm agent pools

Event-driven message bus

Dynamic type creation

Harbor integration: Freyja as a Harbor agent adapter

Profiles that remember

Explore that remembers what it explored

Verify that accumulates its own verifiers

The general form

Limitations

What stopped being speculative

Sessions that leave a skill behind

The five stages

Cadence or /learn-this

skill-drafter sub-agent

Human confirm

Trigger-matched load

Outcome classifier

V-score re-orders

The drafter is a sub-agent, not a function call

A human admits the skill

An outcome classifier turns use into a reward

The durable scheduler

Memory across fires

Loops the agent paces itself

The always-on substrate

Subagents that can see, and a runtime that bends

Seeing

Images into context

Pixels, not a path

Survives the model boundary

Only who needs it

Not breaking

Prune the oldest images

Only if pruning freed nothing

Summarize without reasoning

Decision framework

When to use each agent type

Survey, map, understand

Quick reconnaissance

Implement, test, iterate

Check, validate, report

Foreground vs. background

Message bus vs. parent mediation

Anti-patterns

Over-delegation

Tool-set bloat

Coordination overhead exceeding benefit

Type mismatch

Ignoring artifact indices

Production lessons from Claude Code, OpenCode, and the research frontier

Dual state machines: the maturity gap

5 states for lifecycle

Specialized Subagent Architecture
for Freyja