Projects that compound
A runtime substrate for enterprise AGI, designed to ship in three groupings from the Champ June MVP onward.
Most enterprise AI today is a single-shot assistant with a fresh context every call. It doesn't know two teammates asked the same question yesterday, that the artifact it's about to draft exists in three revisions in a folder it can't see, or that the project has been running for six weeks and the next thing to do isn't the thing the user just asked for. The waste (duplicated drafts, repeated discovery, lost decisions, re-learned context) is the part of knowledge work nobody measures because measuring it gets uncomfortable.
The argument: durability moves from the model to the project. The model stays fungible; the project (decisions, artifacts with provenance, in-flight work, accumulated working knowledge) is the thing that compounds. Three user-visible behaviors fall out: continuity on entry, anti-duplication before work, skills that accumulate under light governance. All three are projections of the same shared state.
This inverts where the major platforms have landed. The enterprise-agent products shipped through 2026 differ in style and converge on one architecture: the model is the durable asset, the agent loop is the surface, the business context is what's loaded into the prompt each session. The design below inverts the durability axis. The project is the durable asset; the agent loop is the substrate's read-write surface; the model is a parameter. Composing with the field uses MCP for tool integration and pluggable adapters for cross-vendor agent-handoff. Re-inventing the connector layer is the wrong move.
The bet is doable in three groupings, in dependency order. The first ships a minimal-but-real event-sourced substrate plus the chokepoints needed for the June 15 KPMG Champ MVP; existing harness subsystems keep working and migrate to read from substrate projections over later phases. The second closes the smart loop. The third makes the substrate enterprise-deployable. Nothing requires throwing out what the harness has today, but the substrate goes into production from day one — partial in scope, complete in shape.
Three behaviors users see
1. Briefing on entry, not chat scrollback
Opening the project, the agent's first move is not to wait for a prompt. It briefs you. Not chat scrollback, not an LLM summary of past conversation. A structured projection from project events: current state, decisions and why, canonical artifacts, what is in flight and on whom, what is blocked, what to pick up next. Thirty seconds to read, every claim clickable.
The briefing isn't stored. It's projected from events emitted by the agent and your colleagues as they work. Nobody updates it. It's fresh by construction.
2. Anti-duplication before work
You type "draft the launch email." The agent doesn't draft. It responds: v3 exists from Thursday — I'll iterate on that. Alex is editing a parallel draft right now — coordinate or take over? You choose. The agent works from the right artifact or routes you to the next item.
The user sees the check happen. An agent that silently picks the right artifact is indistinguishable from one that silently picks the wrong one. Visibility is how trust is earned.
3. Skills accumulate, uncertainty routes
The agent notices patterns. This project has run a stakeholder-impact check before exec comms three times. Make it default? You accept, edit, or decline. Accepted suggestions become project-scoped skills applied by default on every future session for every member. The same loop runs at profile and org scope with progressively stiffer governance. Skills retire when usage drops or success rate decays. Skills are alive, not archived.
The same epistemic loop runs in the other direction at runtime. When the agent's confidence on a turn falls below threshold, the project doesn't bluff: the turn routes to a clarifying question, a stronger executor model, a paired reviewer that challenges the proposal before commit, or an HITL bubble-up. Skill accumulation and runtime uncertainty are two faces of the same project state — what the project knows over time, what it doesn't know right now, and how to act on either.
The structural shift
All three behaviors require the same move: the project becomes a first-class durable object that agent sessions and humans both read and write. The agent is one reactive participant in a shared, event-sourced project state, not an actor with private memory talking to one user.
Long-horizon work is the regime where today's AI is least useful and most expensive. A model that drafts an email in five seconds is impressive once; a team using it for six months learns the cost of getting context into the model exceeds the cost of the email. What breaks isn't the model. What breaks is everything around it: context loading, conflict resolution, trust that the agent knows what the team knows. Make the project the durable thing and the model becomes substitutable. Better models drop in (skills may need re-tuning against a new model's capabilities, but the substrate, the audit, the project state all survive); cheaper models handle cheap tasks; multiple models cooperate. The compounding asset is the project state.
What specifically compounds
"The project compounds" is the headline; the six items below are what that means in production. Each one is a substrate primitive whose quality measurably climbs as a project runs longer. Each one credits against a concrete signal that already lives in the event log.
| What compounds | Signal it credits against | Visible where |
|---|---|---|
| Suggested-next ranking | Accepted vs ignored vs dismissed across the suggested-next list, per suggestion-kind. Briefing prose improves as a downstream consequence: richer state to render from, better-ranked list to surface. | Surface 2; the ranking reflects the project's actual priorities after two to four weeks, and time-to-first-action shortens with it. |
| Skill library | Skill-applied-successfully and skill-applied-unsuccessfully events; per-scope maturation thresholds (Part 3). | Surface 5; verified skills graduate from in-context to retrieval-only, shrinking prompt overhead while raising hit rate. |
| Knowledge-gap closures | Low-confidence outputs and explicit don't-know responses, classified by source (missing skill, tool, memory, data). | Surface 8; an operator closes the gap, the next agent doesn't repeat it. |
| Cost ceilings | Per-Job and per-Profile cost rollups; budget alerts and reroute events. | Surface 8; budgets calibrate against actual work cost instead of guesses. |
| Reviewer accuracy | Reviewer-to-human agreement on the work-product checks that bubbled despite reviewer opinion. | Surfaces 5, 8; more triage handed off as accuracy climbs, less HITL load without quality dropping. |
| Access denial alternatives | Whether the agent's next attempt after a structured denial succeeded with the suggested alternative. | Surface 6; the agent learns which path works for this team's permissions, this project's policy. |
No single row is impressive on its own. The compounding claim is that every row sits on the same event log, credits against the same shape of signal, and improves whether or not anyone notices. The substrate is what makes "the project gets smarter" instrumented rather than aspirational.
Five durable things, one shape
Five things in the system want durability. Each is currently scattered across the harness, the runtime design, or implicit nowhere-state. The lift is naming them as one shape: an append-only event log plus materialized views.
| Thing | What it carries · where it lives today · where it lands |
|---|---|
| Configuration | What kind of agent, what tools, what skills, what model, what policy. Today: constructor arguments scattered across call sites. Lands as orchestrator AIE (per 07-mvp-and-roadmap.md §28). |
| Work containers | Bodies of work that span sessions, members, days. Today: the harness has Sessions; anything longer-lived lives in the Frontier App or the user's head. Lands as Project + Job. |
| Procedural knowledge | How to do things in this context, accumulated from usage. Today: the skill data model exists; the loop that fills it doesn't. Lands as Skills. |
| World state | What exists right now, who owns it, what is in flight, what was decided and why. Today: nothing above the session boundary. Lands as Memory + the project-state fields. |
| The trace | Every read, write, decision, tool call, denial, approval, sub-agent spawn, with causal lineage. Today: Prometheus and OpenTelemetry for ops visibility; no event log for the work itself. Lands as Audit. |
The structural claim: these five are the same underlying shape. A stream of immutable events, plus views projected on read. Audit is the storage by construction. Reversibility falls out of replay. Concurrency becomes a projection-time decision instead of a write-time lock. Memory and skills stop being two storage stories. The agent loop is one of many event emitters; humans, sub-agents, integrations, and approvers emit the same envelope shape.
The project state object
One shared object sits at the center, organized in five buckets. The briefing, the duplicate check, and the skill loop are all projections of it. The decision-flow bucket is the one that isn't in the harness today (decisions, open_questions, in_flight, blockers); it lands in the Frontier App's document store alongside the existing plan, program_memory, and generated_docs (per 07-mvp-and-roadmap.md §28.7). Cheap to add; the discipline is writing to it on every relevant event.
| Bucket | Fields | What it carries |
|---|---|---|
| Identity | members |
Who is on the project with roles. Lets the agent attribute actions, scope notifications, resolve names. Lets the briefing tell the next person who else has been working. |
| Artifacts | artifacts |
Versioned, with a canonical pointer per artifact, provenance (which session, user, agent turn produced this version), and a semantic index. Canonical pointer answers "fetch latest"; the index answers "is there already a draft of X?" |
| Decision-flow state NEW | decisionsopen_questionsin_flightblockers |
Decisions made (with alternatives, who decided, when, rationale); the things the project knows it doesn't know; work that's started-but-not-finished (intent + actor + start + expected outputs, feeds the duplicate check); what's waiting on what or whom. The briefing reads from all four. |
| Knowledge bindings | skillsmemory |
Project-scoped procedural knowledge (skills with confidence, success/failure counters, version) plus project-scoped state (calibrations, preferences, settled facts). Loaded into the system prompt at session start; shared writes go through a review queue. |
| Trace | events |
Append-only log of every change to any of the above. The briefing is generated from this. The audit trail is this. Replay reads this. Source of truth for the four buckets above; everything else is a materialized projection. |
The first-turn reframe
The most consequential change to the agent loop sounds boring. The first turn of every session is not a response to the user's prompt. It is a structured read of project state, a generated briefing, and a check against in-flight work. The substantive response comes second.
The harness already has the chokepoint: every tool call passes through one execute path, every session has one initialization hook. The lift is two things in the hook — load project state into the prompt, and require the first model turn to emit a briefing before any tool call.
Both lifts are orchestrator-level policy (prompt construction and tool gating), not model changes. Anthropic's context engineering writeup covers the mechanism this rides on.
How the briefing is generated
The briefing isn't an LLM summary of chat history. It's a deterministic projection of structured state, plus an LLM pass that translates structure to prose and ranks the suggested-next list. The structured part is verifiable; the prose part is constrained to operate only on facts the structured part provided.
Three steps. A deterministic query over project state yields the slice (recent decisions, open questions, in-flight items by actor, canonical artifacts with timestamps, blockers, recent event tail). The slice renders into a JSON intermediate. The model produces prose whose every claim is justifiable against the JSON, plus a ranked suggested-next list. The split exists to prevent confident hallucination of decisions that weren't made and artifacts that don't exist (see Chroma's context rot work; grounding on a provided JSON is one of the few mitigations that scales).
// briefing intermediate (excerpt)
{
"project": "GitHub NA rollout",
"members_active_this_week": ["alex", "priya", "you"],
"recent_decisions": [
{"id": "d_412", "summary": "Use Q4-budget framing for exec comms",
"decided_by": "alex", "at": "2026-05-18", "rationale_ref": "evt_8821"},
...
],
"in_flight": [
{"intent": "launch email v4", "actor": "alex", "started": "2026-05-21T09:14Z",
"expected_outputs": ["artifact:launch_email"], "session_id": "sess_91"}
],
"open_questions": [
{"id": "q_18", "text": "Do we send dev-experience angle to leadership?", "raised_by": "priya"}
],
"canonical_artifacts": [
{"id": "launch_email", "version": 3, "last_edited": "2026-05-20T17:02Z",
"last_edited_by": "agent", "external_state": "synced"},
...
],
"blockers": [
{"on": "engagement_dashboard", "waiting_for": "ops:provision-data", "since": "2026-05-19"}
],
"suggested_next": [
{"kind": "answer_open_question", "ref": "q_18", "rationale": "blocks email approval"},
{"kind": "unblock", "ref": "engagement_dashboard"}
]
}
The user reads the prose. The structure is what lets the model write it without inventing.
The duplicate check
Two parts on different primitives. Concurrent: when a session starts substantive work it claims an in_flight slot with intent, expected outputs, timeframe, and a heartbeat. The agent's pre-work step checks for overlap with active slots. Sessions subscribe to in_flight.added events; the runtime fans new claims out. Heartbeats refresh on every meaningful turn; claims expire after a configurable TTL (30 minutes default) without renewal, so a crashed session doesn't permanently block parallel work. Retrospective: a semantic index over project artifacts, embedded on create/update. Every artifact-creating tool's system prompt requires a pre-call lookup against the index; matches above threshold surface as candidates.
Retrospective is only as good as the index; concurrent is only as good as sessions claiming slots and renewing heartbeats. The runtime auto-claims when a creation tool fires (carrying most of the discipline) and the heartbeat ticks per turn. The in_flight registry ships in Grouping 1 because the cost is trivial and the payoff is immediate. The semantic index ships in late Grouping 1 or early Grouping 2 — embedding model choice, refresh cadence, and per-intent similarity thresholds are non-trivial calibration work, and the first Champ sessions start with zero artifacts to index against anyway. The intervention shown in Surface 3 lights up progressively as the index fills in.
AIE invocation, in mechanism
The orchestrator AIE calls other AIEs as tools. Each invocation rides on the same chokepoint and emits the same envelope shape as everything else. From 04-execution-paths.md §13: the NominatedAIETool wrapper creates or reuses an AIE conversation, sends the input payload, streams intermediate events via the existing SSE bridge, polls for artifact creation events via the existing poller, and returns a structured ToolResult with payload + artifact references. The orchestrator perceives every AIE invocation as a clean tool call.
What's new at the substrate layer: every invocation step is an event. aie.invoked with target AIE id + input payload + prompt hash; aie.streaming for each intermediate chunk; aie.completed with the final ToolResult; artifact.created for any artifacts produced; all linked via caused_by back to the orchestrator's LLM turn that requested the invocation. The audit chain walks across orchestrator and child AIE events in one graph. HITL events raised inside the child AIE bubble up via parent_handler_ref on the envelope (per 03-runtime-mechanics.md §15) and surface in the parent Job's HITL queue (Surface 4) with dedup by integration auth target.
For HR Suite's parallel dispatch (Maya's pay-equity scenario), the orchestrator emits N concurrent aie.invoked events; the runtime uses the existing execute_parallel machinery; each child streams independently; the merge is one more event in the parent's log. Background jobs (Maya's quarterly aggregations, attrition-risk monitoring) are AIE invocations with mode=background — they return immediately, the orchestrator continues other work, results auto-deliver on the next turn or via wait_all.
How skills evolve, in mechanism
The skill loop balances accumulation against pollution. Too little and the project never gets smarter; too much and the library bloats with low-signal proposals that drown the useful ones and burn prompt budget.
Three lifecycle states with a continuous confidence score, not discrete maturation buckets. Proposed: a candidate skill with name, procedural description, source events as evidence, proposed scope; user-scope auto-commits, shared-scope lands in the review queue. Live: loaded into the prompt at session start; usage signal (success/failure ratio plus firing recency) updates the confidence score continuously; the score gates whether the skill stays in-context or moves to retrieval-only. Retired: confidence below threshold or no firings in some window; auto-flag for deprecation review, evidence retained. The diagram below traces the lifecycle from observation through maturation; the three runtime states above are how implementations track each skill at any moment.
The maturation gate prevents both pollution and over-conservatism. Firings that help credit the skill; firings the user rejects debit it. Past a threshold of negative signal or no firings in some window, the skill auto-flags for deprecation. Thresholds are per-scope: org-scope skills retire slowly because they are high-investment, project-scope skills retire fast because projects move through phases. AgentRR-style record-and-replay is the telemetry shape — every firing is an event with enough metadata to compute the success signal post-hoc.
Rejected proposals retain their evidence. Without that, the same pattern observed across sessions regenerates the same proposal repeatedly and burns reviewer attention. With it, the system recognizes a pattern has been proposed and rejected and only re-proposes if conditions have meaningfully changed. Small in the prototype, indispensable at org scale.
Skills as the project's working knowledge
Briefing and duplicate check make project state legible. The skill loop makes the project's accumulated way of working durable. A new person on a six-month project inherits not just artifacts and decisions but how the project has learned to work. The three behaviors are facets of one structural choice: continuity = project state made legible; anti-duplication = project state checked before work; skill evolution = the project's working knowledge and its acknowledged unknowns maintained as part of that state.
The reviewer pattern, generalized
The orchestrator AIE generates; a reviewer-configured orchestrator AIE evaluates. Same type=orchestrator AIE (per 02-architecture-overview.md §8), with a mode: "reviewer" configuration variant: different system prompt tuned for evaluation rather than execution, smaller-default model, read-only tool roster plus a small action set (advance / send-back / return). Not a new AIE type; not a new permission model; not separate infrastructure. The reviewer's accuracy against human ground truth is tracked the same way executor accuracy is.
One primitive, three modes. All three reuse the same AIE infrastructure; the difference is when the reviewer runs relative to the executor.
- Triage reviewer (post-execution). Reads a completed unit of work, advances if it meets the bar, sends back with structured feedback otherwise. A 94% reviewer-to-human agreement rate means the reviewer autonomously triages 94% of work-product checks; the 6% disagreements bubble to humans, and irreversible or shared-scope actions bubble unconditionally regardless of reviewer opinion. This is the default mode, and the one that ships first in Grouping 2.
- Challenger reviewer (parallel-to-execution). Runs alongside the executor on high-stakes work, challenging proposed decisions before commit. The signature is a paired model with a different prompt and a different prior, surfacing dissent the executor wouldn't see in itself. Used for HR Suite pay-equity recommendations, Finance approval-gate decisions, anything where a wrong commit is expensive to roll back. The pattern reads as constitutional AI at the orchestration layer rather than the model layer; the constitution is in the reviewer's skill folder, not baked into weights.
- Curator reviewer (post-hoc, scheduled). Reads completed sessions on a cadence (overnight, weekly), mines them for skill candidates and guardrail proposals, files them into the review queue (Surface 5). This is the same flywheel Anthropic ships as "Dreaming" in Managed Agents; the curator-config orchestrator runs as a background AIE on schedule, no new primitive needed.
Where the reviewer sits relative to HITL is the important part. The reviewer is a triage / challenger / curator layer, not a substitute for human approval on the things humans must approve. Shared-scope memory updates, plan changes, and side-effecting actions still bubble to a human per 03-runtime-mechanics.md §15. The reviewer handles the within-execution quality bar ("did this generated doc meet the project's conventions") and emits structured feedback that the observation pipeline mines for skill candidates.
The reviewer is itself event-sourced, accuracy-tracked, and forkable. Every reviewer decision lands as an event with caused_by lineage back to the executor emission it judged; the reviewer's own profile evolves through the same skill-loop machinery the executor does. Reviewer accuracy is one of the six things that compound (Part 1): more triage handed off as it climbs, less HITL load without quality dropping. The triage mode ships in Grouping 2; challenger and curator modes ship as configuration variants of the same primitive, not new code paths.
Buildout on a timeline
Each grouping depends on substrate primitives the previous one introduced. Inside a grouping, items can ship in any order; across groupings, the sequence is fixed. The buildout maps to the existing phasing in 07-mvp-and-roadmap.md §27: Grouping 1 ≈ Phase 1 (June 15); Grouping 2 ≈ Phase 2; Grouping 3 ≈ Phases 3 and 4. The architecture beneath all three groupings is the same three-layer stack (substrate / runtime / surfaces) shown later in this part.
The shape, layered
Three layers, in stack order from durable to surface. The substrate at the bottom: append-only event log plus a projector library plus a small set of named subsystems (profiles, projects, skills, memory, audit, access). The runtime in the middle: the existing harness, with its turn loop, tool-execute chokepoint, prompt builder, sub-agent surface, HITL handler, now reading from substrate projections instead of in-memory state. The surfaces at the top: Frontier Apps, Autopilot chat, the API, workflow nodes, each composing the runtime with a different presentation and a different profile.
Memory in four scopes (canonical)
Memory is the most-conflated word in agent-system design. It means at least four different things attached to four different units with four different write policies. Name them separately or they collapse into "the agent has a memory thing."
Session memory carries what was learned inside one conversation; the working knowledge that survives compaction even when the raw transcript doesn't. Writes auto-commit. Project memory carries decisions, calibrations, conventions, in-flight claims, blockers, open questions, canonical artifact pointers for one body of work. Shared-scope writes go through propose-approve. Profile memory carries cross-project calibrations, templates, and graduated skill candidates for a way of working (the Champ profile's accumulated knowledge across every change-management program KPMG has ever run). Writes need profile-admin approval. Org memory carries brand voice, compliance constraints, audit retention rules, regulatory regime — the things that apply firm-wide. Writes need tenant-admin approval.
Org memory is intentionally narrow. Most of what people reach for as "org-wide" lives more accurately at profile scope — the unit of competence is one way of working, not the whole firm. A user is the actor in every event but is not itself a memory scope. The harness's older eight-scope listing in 03-runtime-mechanics.md §10 adds Application, AIE, and Integration as advanced opt-ins (the AIE scope for an AIE that accumulates cross-program knowledge, like HR Suite's Readiness Assessor; the Integration scope for canonical-version pointers in Google Docs), but they don't show up in the briefing or any first-class UI.
Four design positions
Four choices about substrate shape carry the rest of the design. The first three define the event log itself; the fourth is the engineering move that unlocks everything downstream.
1. Events have a common envelope, not a schema. Every event carries the same outer fields: id, type, actor, project-id, profile-id, session-id, caused-by, timestamp, payload, content-hash, optional signature. Type is a string the producer chooses; payload is opaque JSON. Audit and trace surfaces work against the envelope; subsystems work against their own payload shapes. New event types don't require central registration. Profiles, projects, skills, memory, artifacts — all the same shape, all event-sourced. Surfaces (Frontier Apps, Autopilot, API, workflow-engine nodes) compose the runtime with a profile via this same uniformity; adding a surface is one config row plus a UI shell, not a fork.
2. Projections owned by named subsystems. The substrate provides ordered reads, append writes, and a projector library. It does not define "the canonical view of a project" or "the current state of skills." Each subsystem (profiles, projects, skills, memory, audit, access) owns its projections, its read API, its invalidation rules. Adding a new subsystem doesn't touch the substrate.
3. Causal lineage in the envelope, not inferred. Every event names the event that caused it. A tool call names the LLM message that requested it. A skill proposal names the tool calls that triggered the observation. A patch names the proposal. Missing causes are an error on write, not a warning later. "Why did this decision happen" becomes a graph walk, not a database join across timestamps.
4. Prompts are content-hashed. Every prompt sent to the model lands as an event whose payload includes the prompt hash; the prompt text is stored once keyed by hash. Replay against the same hash produces deterministic inputs (the model's output may still sample differently, but the input is reproducible). Fork-and-diff against a different hash separates prompt changes from output changes. This is what makes regression testing, model-swap evaluation, and root-cause replay tractable.
The event envelope
The shape every event takes:
{
"event_id": "evt_a7d2e9b1",
"type": "decision.recorded",
"actor": { "id": "usr_7f3b", "name": "Sam", "role": "compliance_analyst" },
"project_id": "proj_q3cr",
"profile_id": "prof_champ",
"session_id": "sess_91",
"caused_by": "evt_4b9e1d0a",
"timestamp": "2026-05-29T10:19:26.341Z",
"prompt_hash": "sha256:6f3a...",
"payload": { ... subsystem-specific ... },
"signature": "ed25519:..."
}
The outer fields are uniform; the payload is the subsystem's. Audit, replay, fork, diff, and lineage walk all work on the envelope. Subsystems read their own payload shapes for their projections. The signature is optional for most types, mandatory for the ones regulated surfaces care about (Grouping 3).
Capabilities that fall out of this shape
The envelope, the content-hashed prompts, the projector library, the causal lineage in every event — these are the engineering. The capabilities below are what the engineering buys. Each one is a projection over primitives the substrate already has, not a feature on the side.
Deterministic replay, four flows
Content-hashed prompts plus event-sourced everything else add up to a regression-and-experimentation surface today's hyperscaler platforms don't expose. Maxim, Galileo, and Holistic Agent Leaderboard run agent evals against simulated scenarios pre-deployment; none ship a production replay primitive that re-executes a real customer interaction byte-for-byte from durable logs against a new model, prompt, or policy. Four workflows fall out for free once the substrate is in production:
- Model swap evaluation. Take a real Champ session that ran last week. Re-execute its event sequence against a different model (Claude 4.6 → Sonnet 4.7, GPT-4o → Haiku 4.5) with the same prompt hashes. Diff the resulting
decision.recordedandtool.requestedstreams. Cost / latency / output deltas without re-instrumenting anything. - Prompt A/B. Fork a project at any event; in the fork, change a system-prompt skill folder; let it replay forward. Two event logs diverge from a known shared prefix. The diff between divergence points is the prompt change's effect.
- Root-cause replay. Production session went sideways at 14:32Z. Open the audit chain (Surface 6); walk back to the divergence point; replay forward with debug-level verbosity; observe what tool result the model misread or which skill misfired. Same UI surface, different operator mode.
- Deploy-gate regression. Before promoting a new orchestrator version or a new model, replay last quarter's prod sessions against the candidate. The deploy gate is "diff stays below threshold." This is what Sarbanes-Oxley auditors and EU AI Act reviewers actually want — not a static eval suite, but a reproducible test that the system being shipped behaves like the system that was certified. The same primitive answers the data-sovereignty question for any hosted agent product: a customer can export their event log, run it on their own substrate, prove what happened.
None of this is four different systems. It's one event log, one replayer, one fork primitive, four operator modes. One caveat: model outputs are stochastic, so replay reproduces inputs deterministically but outputs may sample differently. For input-determinism this is enough; for output-determinism you also pin temperature and seed (and accept that frontier models don't always honor seed; replay-with-tolerance is the production stance).
Chained delegated identity through sub-agents
Every event in the envelope names an actor. The orchestrator runs on behalf of a user; a sub-agent invoked by the orchestrator runs on behalf of the same user but through the orchestrator's profile-allowed tool roster; a cross-vendor handoff (over A2A, an MCP server-server call, ANP, or a direct API) extends the chain across the vendor boundary. The substrate writes every link into the log, and the audit walk reconstructs "which principal authorized which sub-agent under what scope to produce which load-bearing claim in the final output."
Gartner publicly predicts more than half of agentic AI initiatives will stall on identity by 2028. CoSAI and NIST are starting work on delegated principal auth through sub-agent chains; no shipped product solves it. A2A signs agent cards (the sender's domain identity); ANP signs via decentralized identifiers; none of the candidate cross-vendor protocols publicly sign the chain that produced the call. This is the substrate's most direct enterprise-security claim: confused-deputy attacks across multi-agent orchestrations become inspectable, not theoretical. Ships partially in Grouping 1 (actor on every event, caused_by lineage), formalized in Grouping 3 (signed chain at every handoff boundary, role-checked at each substrate write).
Knowledge gaps as a first-class telemetry signal
Every vendor instruments tokens and latency. Nobody publicly instruments what the agent didn't know. The observation pipeline (Grouping 2) already watches tool sequences for skill-candidate patterns; extending it to watch low-confidence outputs and explicit don't-know responses turns them into a tracked signal. The operator console (Surface 8) ranks gaps by frequency, classifies by source (missing skill, missing tool, missing memory, missing data source), and links each one to an authoring affordance. Closing a gap is the operator-loop ingredient that makes the project measurably smarter week-over-week. Without instrumenting it, "the project compounds" stays a hope.
Hard budget enforcement at the loop
Every FinOps vendor in 2026 ships cost dashboards. The substrate has the chokepoint where enforcement actually lives: the tool-execute path and the per-turn LLM call. Three enforcement modes, all in Grouping 3:
- Kill at ceiling. Per-Job, per-tenant, or per-Profile budget cap. On overrun the orchestrator session enters terminating state, finishes its current turn, emits a
budget.exceededevent, suspends. Resume requires a budget extension event. - Reroute under pressure. Above N% of ceiling, the runtime swaps the executor to a cheaper model tier; the reviewer profile keeps its quality model. A budget signal in the prompt is a suggestion; this is enforcement.
- Pause for approval. The orchestrator emits a HITL envelope with
type=budget_exceeded; the user or designated approver confirms continue or abort. Same HITL pipeline as everything else, dedup-by-key applies (a runaway parallel dispatch surfaces one approval, not N).
OpenAI's Workspace Agents session billing and Anthropic's $0.08/session-hour-plus-tokens are both opaque and unbounded by default. The substrate gives enterprises the runtime gate.
Bring-your-own-model, bring-your-own-deployment
Replay treats the model as a parameter. The audit chain doesn't care which provider produced a turn. Effective-access doesn't care where the deployment runs. A customer can replace the executor with their own fine-tune, their own hosted Claude in their VPC, OpenAI Frontier behind their account, a local Llama variant — replay still works, audit still walks, skills still apply, the project's accumulated state survives the swap. This is the substrate's structural answer to model lock-in, the standing concern with any hosted agent product. The orchestrator AIE config has a model_pin field; per-tool model overrides land on the same event log; the reviewer's model is configured independently so quality and cost can be tuned separately.
Confidence-gated execution
The orchestrator can route a turn through a reviewer profile when its confidence drops below threshold, surface a HITL clarification, or escalate to a stronger executor model. The confidence signal already exists at the LLM-call boundary (logprobs blended with self-reported confidence per the harness's existing process_completion). Wiring it into the runtime as a routing decision is one chokepoint addition: tool-execute inspects pre-call confidence and chooses among proceed / escalate-model / paired-reviewer / HITL. The hosted-agent products in the field don't carry a first-class confidence-gating primitive; uncertainty handling is left to the model. Ships in Grouping 2 alongside the reviewer pattern (Part 3); the reviewer is the natural target when confidence is low and the work is reversible.
Interop — the primitives the wire doesn't own
A complete agent-interop layer has to answer a fixed set of questions: capability discovery, wire transport, principal-of-chain identity, authorization, audit, cost, HITL bubble-up across boundaries, long-running durability, replay against versioned state, and observability. The candidate wire formats cover the first two well and the rest unevenly or not at all. The substrate's posture: own the primitives that don't belong on any wire; let the customer's wire of choice carry the call.
The split, primitive by primitive:
- Discovery and binding. Auto-nomination from the tenant's AIE roster and registered MCP servers, through one runtime chokepoint. The tool roster is per-session and resolved live, not a static manifest.
- Wire transport. Adapter pattern. MCP in and out covers most cross-agent traffic; A2A, ANP, and others land as adapters when a customer needs them. The substrate is wire-opaque.
- Delegated principal chain. Every event carries an
actor; every cross-vendor call carries the full chain (user → orchestrator profile → sub-agent → target), signed end-to-end. Wire formats sign the sender at best; the chain that produced the call is what regulated reviewers and confused-deputy defenses actually need. - Authorization. The five-filter effective-access pipeline (registered capability ∩ profile allowlist ∩ project gate ∩ user permissions ∩ mode policy) runs at every call, uniform across wires. Denials carry a reason axis and a
suggested_alternative, not an opaque 403. - Provenance and audit.
caused_bylineage on every event, forward- and backward-traversable, crossing vendor boundaries through the signed chain. The audit walk is a graph traversal, not a join. - Cost attribution and enforcement. Token and cost metadata on every tool and AIE invocation, rolled up to per-Job / per-Profile / per-tenant. Kill / reroute / pause at the loop, not in a dashboard after the fact (Part 4, hard budget enforcement).
- HITL bubble-up across boundaries. Sub-agent and cross-vendor HITL events surface in the parent Job's queue via
parent_handler_ref, with dedup by integration auth target so N parallel children produce one approval. - Long-running durability. The event log is the rehydration substrate. A session in
awaiting-human-longtears down its runner; resume replays from the most recent committed boundary on any pod. - Replay and versioning. Content-hashed prompts plus versioned event types let any call replay against a new model, a new orchestrator version, or a new prompt later (Part 4, replay flows).
- Observability. Events are the trace. Per-event token, cost, latency, and outcome metadata feed Surface 8 directly, no separate tracing layer.
What the substrate doesn't try to be: a connector marketplace, a wire protocol, or a skill format. The leverage move is owning the primitives the wire doesn't and letting the field churn around them.
Substrate evolution
Event-sourced systems are hard to evolve. Once an event type ships, its payload shape is permanent in the historical record, and replay must keep working against old envelopes. Three disciplines mitigate this from day one:
Versioned event types. Type names carry an explicit version (decision.recorded.v1, decision.recorded.v2). Adding a field is a new type version, not a mutation. Old events stay readable by old handlers; new events get the new handler. Subsystems own their migration windows.
Content-hashed payload schemas. Each payload version's JSON schema lands as a substrate-internal event itself. Replay tooling resolves the schema for an event by walking back to the most recent schema event of that type. Old replay tools work on old events; new replay tools work on both.
Migration as projector composition. When a subsystem changes its projected view (the briefing's JSON intermediate format, say), it ships a new projector alongside the old one. Both materialize from the same events. Surfaces read whichever projector they're pinned to. The migration window closes when no surface reads the old projector anymore.
Pre-Grouping-1: don't event-source the existing harness state in one go. New events flow through the substrate; existing tables remain authoritative for their current concerns and get migrated table-by-table as their owners are ready. Source-of-truth for the project subsystem (the new code path) is the event log from day one. Source-of-truth for old subsystems (artifacts, sessions, jobs) remains their tables until each subsystem opts into projection-based reads.
The three groupings, in detail
Each grouping depends on substrate primitives the previous one introduced. Inside a grouping, items can ship in any order; across groupings, the sequence is fixed.
Grouping 1 — substrate basics + the user-visible MVP behaviors (Champ June 15)
What this unlocks. Continuity on entry, anti-duplication (via in_flight registry; semantic index lights up progressively), and a docs-folder memory model — the Champ MVP behaviors from Part 1. The event log is in production from day one for the new project subsystem; existing harness tables remain authoritative for the things they already own and migrate over later phases.
What we build. The event log (append-only Postgres-backed, common envelope, content-hashed prompts), the projector library that registers per-event-type handlers and materializes named views, a Redis read-side cache. Strict-mode replay for divergence detection (deterministic-input, sampled-output caveat acknowledged). Fork from any event into a new run id; SQL-level diff between two runs.
On top of the substrate: the profile subsystem (orchestrator AIE of type=orchestrator per 07-mvp-and-roadmap.md §28.4, named-input contract, inherited AIE infra). The project subsystem (project-id resolves to a projected state covering members, decisions with rationale and lineage, in-flight claims with heartbeats, blockers, open questions, suggested next, canonical artifacts). The memory subsystem (four scopes wired through the prompt builder). The artifact subsystem (versioned objects with canonical pointer; external roundtrip is just another event). Per-tenant cost tracking from day one (every aie.invoked and tool.requested event carries token + cost metadata; aggregations roll up to per-Job + per-tenant projections, even though outcome-priced billing is Grouping 3). The three chokepoint integrations: session create loads project state and runs a briefing turn before any tool call; tool execute runs the duplicate-check projection (in_flight registry; semantic index when ready) before any artifact-creating tool fires; prompt build assembles scopes-skills-state into the system prompt.
Source-of-truth resolution. For the new project subsystem (decision-flow state, briefings, in_flight): event log is canonical, Postgres tables are projections invalidated on new events. For existing harness subsystems (artifacts, sessions, jobs, memory_entries): tables remain authoritative; events emit alongside writes for audit and replay. Each pre-existing subsystem migrates to projection-based reads when its owner is ready, with a strict-mode replay test guarding the cutover. No big-bang migration.
Multi-tenant isolation. The event log partitions by project-id with tenant-id as the isolation boundary. Cross-tenant joins are impossible at the storage layer; the projector library enforces tenant scope on every read; row-level security is the floor. Projections are per-tenant.
Background jobs. Treated as AIE invocations with mode=background (per Maya's HR Suite scenario). They're long-running orchestrator AIE sessions that emit periodic aie.streaming events; the user's task drawer (Surface 4) lists them alongside foreground Jobs with a background status pill. The substrate doesn't need a special "background job" primitive — they're just sessions that stay open.
Deliberately not in. The skill observation pipeline. The reviewer-config AIE. RBAC beyond user pass-through. Long-running durability beyond what the harness already has. Parallel sub-agent dispatch with structured merge. The semantic index in its production form (ships partial in this grouping; full calibration in Grouping 2). Outcome-priced billing (Grouping 3; only the cost-tracking events ship here).
The interesting design tension. How much projection state lives in materialized Postgres tables versus computed on read from the event log. Hot reads (briefing, duplicate-check, current canonical artifact) materialize into tables and invalidate on new events; everything else computes from the log. What's hot is a measurement question that production answers; the materialization layer is built to allow shifting the line.
Grouping 2 — the smart loop closes
What this unlocks. The project gets smarter over time without anyone editing a config. Skills the project has learned apply by default to every new session. Builders run a profile in training mode to surface gaps and distill traces into skill proposals. New profiles fork from base profiles with a sane inheritance story.
What we build. The full skill subsystem lifecycle. Entry pathways are explicit: human-authored skills land at experimental; skills extracted from session traces land at unvalidated; skills mined from existing tool and AIE schemas land at experimental because the underlying capability is already trusted; skills evolved from observed failure patterns land at unvalidated. Maturation gates fire on skill-applied-successfully and skill-applied-unsuccessfully events; thresholds per scope (project in weeks, profile in months, org in quarters). Retirement is the inverse: drops in usage or success rate auto-flag for deprecation review.
The observation pipeline runs as a background process subscribed to recent tool-requested and tool-responded event streams. Pattern signatures (tool sequence + outcome + intent class) get clustered. Repeating successful patterns become skill proposals; repeating failure patterns become guardrail proposals ("before doing X, check Y"). Each proposal carries its evidence event ids by reference, so rejecting a proposal retains the evidence and prevents the same pattern from regenerating a duplicate.
The reviewer AIE type ships here. Same AIE infrastructure as the orchestrator AIE; different cognitive shape (evaluation, not generation); often a cheaper model. The reviewer reads a completed unit of work, its event log, its caused-by chain, and emits advance / send-back / return-to-backlog. Profile training mode is a profile-level flag putting a session into explore-heavy behavior. Profile forking and diff are substrate-level operations; the parent's audit trail is intact in the child, and rollbacks replay to an earlier event. Plan-as-skills lands here, not in Grouping 1: the FA still owns the canonical editable plan artifact; on session open, the plan loads as a set of project-scope skills.
Deliberately not in. Cross-profile skill transfer (noticing a date-conflict check fires across three rollouts and proposing graduation to profile scope). Compositional skill hierarchies. Adversarial self-play. Automated probe-set construction for skill evaluation. RBAC beyond user pass-through.
The interesting design tension. How aggressive the observation pipeline should be. Too aggressive and the review queue floods (published research on adaptive skill libraries reports bloat factors of 4× to 9× when proposals are unmanaged). Too conservative and the loop never closes. The opening answer is conservative: rate-limited per session, batched reviews, automatic clustering of near-duplicate proposals at the experimental stage, retained rejection evidence. The production answer depends on review-fatigue telemetry only real usage produces.
Grouping 3 — governance, durability, parallel execution
What this unlocks. The substrate becomes enterprise-deployable in the senses that matter: RBAC across the stack, multi-tenancy hardening, budget controls, long-running Jobs that survive pod restarts and integration-token rotations, parallel sub-agent dispatch with structured merge, signed audit, outcome-priced billing rollups.
What we build. The effective-access pipeline at the tool-execute chokepoint (registered tools ∩ profile allowlist ∩ project gate ∩ user permissions ∩ mode policy). Denials are structured (reason axis, missing element, suggested alternative) so the agent can adjust rather than fail opaquely. RBAC across the stack: user identity flows through every event as the actor; profile-level role gates separate admin / contributor / viewer; project-level membership controls who reads project state; tool-level required roles enforce at execute; the skill approver matrix per scope chooses who can approve writes; HITL approvers are picked by role.
The HITL envelope formalized. One shape across the runtime, carrying type, required/user-facing flags, timeout class, dedupe key, parent-handler reference, approver role, and the evidence event ids that justify the prompt. Bubble-up to parent Job handlers is mandatory for nested sub-agents. Dedupe by key means parallel sub-agents that all need the same auth surface one prompt, not N. Timeout class drives durability: short = in-process futures; medium = pod-graceful-shutdown with reload; long = full session rehydrate-on-resume-webhook with event-log replay.
Long-running durability falls out of the event-sourced substrate. A session entering awaiting-human-long tears down its runner coroutine and sandbox; state lives in the event log; resume rehydrates on any pod. The proof that the substrate is real is that this works without bespoke per-session serialization.
Parallel sub-agent dispatch with structured merge: named strategies (dedupe_and_rank, summarize_and_reconcile, compare_and_vote, union_with_provenance) chosen per dispatch. Sub-agent traces stream into the parent Job's view in real time; sub-agent HITLs bubble.
Quotas, cost, outcome pricing. Per-tenant token / concurrency / cost caps. Per-Job cost / latency / token rollups, queryable for billing. Budget caps surface in the model's context so the agent can choose cheaper paths under budget pressure.
Signed audit. Specific event types in regulated surfaces (HR Suite pay-equity decisions, Finance approval gates) carry cryptographic signatures at write time. The substrate doesn't change shape; the signature is a wrapper that verifiable-replay tooling validates. Unified audit query surface: "show me every memory write by profile X this week" is one event-log query; "why was this decision made" is a causal walk via caused_by. Workflow generation and promotion: the orchestrator can propose saving a useful sequence of skill firings as a generated workflow; promotion to first-class AIE is an admin gate.
The interesting design tension. How the substrate scales past a thousand concurrent active projects. The event log partitions by project-id with tenant-id as the isolation boundary (one tenant's events never join another's). Cold projects sleep; warm projects keep a Redis cache; hot projects pin to a sticky pod. Replay-on-resume is bounded because projections snapshot at known boundaries (every K events or every N hours, whichever fires first). A speculative direction once the loop closes: tickets-as-substrate-objects with autonomous orchestrator decomposition and reviewer-profile evaluation (sometimes called the "kanban inversion") becomes a natural composition. Composable, not load-bearing. Out of scope for the three groupings.
What stays out, deliberately
Three things are not in any grouping.
Voice surfaces are a separate code path. The latency floor for spoken interaction is below what the per-turn LLM loop with tool execution can deliver. The substrate exports a read API a voice surface consumes; the voice agent loop lives in its own runtime with its own constraints. Siblings, not children.
Cross-tenant projects are not in. A consulting tenant advising a client where the client sees the project but the tenant's other clients do not is a real shape, but the access pipeline above and the four-scope memory model don't encode it. Adding it is design work, not just engineering: the tenant boundary today is a hard isolation boundary and softening it requires explicit access design at the project level.
Compositional skill hierarchies are not in. Strategic skills composing functional skills composing atomic skills, with credit assignment flowing back up. The single-pass skill loop ships in Grouping 2. The compositional version requires credit-assignment research still in flight; building ahead of the research is more risk than the gain warrants.
The substrate is the work; the surfaces are how that work shows up in front of people. Eight surfaces, mapped to the moments that matter: starting a program with an approved plan, opening a project, asking the agent for work, managing the queue that's waiting on you, governing what the agent learned, investigating why a decision was made, setting up an orchestrator (for builders), and closing the operator loop on knowledge gaps, cost, regression, and principal chains.
1. Starting a program — plan generation and approval
Champ is plan-gated: the orchestrator session can't start until the program has an approved plan. The admin clicks Generate plan; a separate plan-generation AIE in the Frontier App produces a structured plan doc; the admin reviews and either edits or approves. Only then does the chat UI unblock. The state machine lives in the app, not the orchestrator (per 07-mvp-and-roadmap.md §28.6); the orchestrator receives the approved plan as a named runtime input.
| # | AIE | Step | side-effects |
|---|---|---|---|
| 1 | NBnarrative_builder@v3 | Draft the change narrative; voice: dev-experience | none |
| 2 | SEstakeholder_email@v2 | Draft email to leadership + eng leads + ICs | none |
| 3 | AFadoption_faq@v1 | Generate FAQ from narrative + known objections | none |
| 4 | ETengagement_tracker@v2 | Schedule: launch+2w — set up dashboards | writes |
| 5 | ERexecutive_reporter@v3 | Conditional: launch+6w if engagement > threshold | none |
plan.approved event → orchestrator AIE becomes invokable
2. Opening a project — the briefing
Once the plan is approved, the first thing a user sees on every session is not a chat input. It is a briefing — generated deterministically from project events, structured as decisions made, items in flight, open questions, blockers, and a ranked list of what to pick up next, each with a Start session on this affordance. Every claim is clickable; every number references an event. The agent has not done anything yet. The substrate has projected what the project knows.
| 1. | May 19 14:32 | Adopt phased rollout for Enterprise Cloud orgs | [RAT-124] |
| 2. | May 18 11:07 | Use training-led adoption vs comms-first | [RAT-118] |
| 3. | May 16 09:41 | Deprioritize custom provisioning for now | [RAT-112] |
| CR-317 | Update change impact assessment for Engineering | Alex Morgan | in_flight | |
| CR-289 | Finalize enablement plan for ITSM pilot | Jamie Lee | in_flight |
| Q-142 | How should we sequence org-level vs team-level rollouts? | Raised by Jamie Lee |
| Q-137 | What's the minimum viable success metric for Phase 1? | Raised by Priya Nair |
| B-067 | Pending legal review of customer data handling addendum | Owner: Taylor Kim |
| 1 | CR-317 | Update change impact assessment for Engineering Reason: Unblocks rollout comms and stakeholder planning | High impact | Start session → |
| 2 | B-067 | Follow up on legal review of data handling addendum Reason: Legal approval is gating customer comms | High impact | Start session → |
| 3 | Q-142 | Decide on sequencing: org-level vs team-level rollouts Reason: Needed to finalize rollout plan and timelines | Medium impact | Start session → |
| 4 | CR-289 | Finalize enablement plan for ITSM pilot Reason: Pilot plan due this week | Medium impact | Start session → |
3. Asking the agent — the intervention
The interesting moment isn't when the agent succeeds. It's when the agent refuses to start. A user types "draft the launch email"; the agent sees that launch_email v3 exists, sees that another member has a parallel in-flight claim, and responds with a structured intervention before any tool fires. The user picks. The agent claims the in_flight slot and gets to work.
| artifact_id | version | last edited | similarity | |
|---|---|---|---|---|
| launch_email | v3 | 2 days ago by Alex | 0.91 | view |
| actor | intent | started | session ref |
|---|---|---|---|
| Jamie Lee | launch email v4 | 09:14Z (12m ago) | sess_87 |
launch_email v3…in_flight.claimed intent: "iterate on v3" · expected_outputs: [launch_email v4]get_latest_artifact(launch_email) → v3 (canonical, 2d ago)4. Your work — task drawer plus HITL inbox
One unified surface for "what's waiting on you." Across every project the user belongs to: in-flight Jobs (your own and ones you're a member of), pending HITL approvals routed to your role, blocked items where you're the unblocker, and recently completed work. Same substrate emission, two materialized projections (jobs view + HITL queue), one inbox. Click any row to open its full context.
| Job | Project | Last activity | status | |
|---|---|---|---|---|
| Draft Q4 launch email (iterate on v3) | Q3 Change Rollout | just now · you | running | |
| Pay-equity dispatch (3 AIEs in parallel) | Pay-equity 2026 | 12m ago · Maya | awaiting_clarification | |
| Comms plan for Phase 2 rollout | Q3 Change Rollout | 2h ago · Jamie | awaiting_review | |
| Stakeholder mapping (parallel branch) | Pricing Model Refresh | 1d ago · you | paused | |
| Quarterly pay-equity drift scan | Pay-equity 2026 | running · background | background |
| ✓ | FAQ generation for Q3 launch — published | Q3 Change Rollout | 17:42Z |
| ✓ | Vendor analysis report — exec review complete | Pricing Model Refresh | 14:08Z |
| ✓ | Org chart import + manager mapping | Pay-equity 2026 | 10:31Z |
session.*, job.*, or hitl.* event. Dedup-by-key means two parallel sub-agents that both need SharePoint auth surface as one approval; both resolve when you authorize.5. Reviewing what the project has learned
The skill review queue is where the project's accumulated way of working gets governed. The observation pipeline emits proposals with evidence; humans accept, edit, decline, or defer. Each proposal carries its evidence by reference — you click into the events that generated it, see the actual tool calls, and judge whether the pattern is durably true or just three users coincidentally doing the same thing.
| evt_8f3a7c1d | Alex | 2026-05-19 14:32:11Z | Alex reviewed stakeholder draft v3 → emitted check_finance_alignment tool call | › |
| evt_3b9e2a44 | Alex | 2026-05-18 09:17:03Z | Alex compared forecast vs budget → emitted check_finance_alignment tool call | › |
| evt_6c1d0e91 | Maya | 2026-05-17 16:45:27Z | Maya validated Q2 plan → emitted check_finance_alignment tool call | › |
| evt_a7d2e9b1 | Maya | 2026-05-19 11:08:54Z | Maya documented decision on vendor → emitted summarize_decision_rationale | › |
| evt_2d4f6c77 | Alex | 2026-05-18 10:22:31Z | Alex finalized design choice → emitted summarize_decision_rationale | › |
| evt_9e4b1f53 | Sam | 2026-05-17 13:55:42Z | Sam captured trade-offs → emitted summarize_decision_rationale | › |
6. Investigating why a decision was made
Audit isn't a feature in this system; it is the storage. Every event names the event that caused it, and the question "why did this happen" is a graph walk. The user picks a decision (or an artifact, or any committed event), and the surface renders the causal chain backward through time — session start, tool requests, patches proposed, patches applied, the decision recorded — each row with actor, timestamp, and a one-line summary, the selected event highlighted, its full payload in JSON on the right.
{ "event_id": "e1d9f2a7-5c72-4d2b-9d15-2b6e4f0e6c91", "event_type": "decision.recorded", "timestamp": "2026-05-29T10:19:26.341Z", "actor": { "actor_id": "usr_7f3b9c2d", "name": "Sam Rivera", "email": "sam.rivera@acme.com", "role": "compliance_analyst" }, "project": { "project_id": "proj_q3cr", "name": "Q3 Change Rollout" }, "decision": { "decision_id": "dec_91baf6f1", "title": "deprioritize custom provisioning", "outcome": "approved", "rationale": "Reduces maintenance overhead and risk. Custom workflows will continue to be supported", "confidence": 0.78 } }
| artifact | type | reference |
|---|---|---|
| RAT-112 | Decision | /CR-GH-NA/RAT-112 ↗ |
| dec_91baf6f1 | Decision Record | /decisions/dec_91baf6f1 ↗ |
| run_83a7c1d4 | Agent Run | /runs/run_83a7c1d4 ↗ |
| policy_v3.4.1 | Policy Version | /policies/v3.4.1 ↗ |
7. Setting up an orchestrator — the builder wizard
A one-time builder flow rather than a 9-tab settings panel. The orchestrator is an AI Employee of type=orchestrator (per 02-architecture-overview.md §8), and most of what a Phase 1 builder needs to configure is the same four things every AIE has: identity, capabilities, knowledge, governance. Four steps with sane defaults; the full event-sourced config edit lives behind a separate "advanced settings" view (also rendered, also event-sourced, intentionally not shown for the common path).
plan
approved plan doc from the Frontier App's plan store; orchestrator reads but doesn't write
remove
program_memory
project-scope memory; orchestrator reads and proposes updates via review queue
remove
docs_folder
all artifacts in scope; orchestrator reads via get_latest_artifact
remove
champ/skills/orchestration/
12 procedural skills (load order, AIE selection rules, comms tone)
edit
kpmg/firm_voice/
3 firm-wide voice and compliance skills
edit
text
free-form response shown in chat
generated_doc[]
typed docs: title + type enum (narrative, comms, faq, dashboard, report) + content/link · routed to the docs folder
8. Closing the loop — the operator console
Most of the seven surfaces above serve people doing the work: the consultant briefed on entry (Surface 2), the analyst caught by an anti-duplication intervention (Surface 3), the manager approving an HITL bubble-up (Surface 4), the compliance reviewer walking back a decision (Surface 6). A few serve people who configure the system: the admin generating a plan (Surface 1), the curator triaging the skill review queue (Surface 5), the builder publishing an orchestrator (Surface 7). The operator console is for the person whose job is the loop itself: the orchestrator owner, the platform engineer, the FinOps lead who lives inside the system every day. Four panels, one substrate, all four signals "is this thing actually compounding" needs.
Each panel is a projection over the same event log; the console is four queries with their own UI, not four systems. Knowledge gaps come from the observation pipeline subscribing to low-confidence outputs and explicit don't-knows. Cost vs budget rolls up the per-event token and cost metadata that ships in Grouping 1. Replay regression results come from the deploy-gate flow (Part 4, fourth replay mode). The principal chain comes from walking caused_by across every decision the operator filters as high-risk.
Without this surface, every claim in Part 1 ("projects compound") is something the operator has to trust. With it, every one of the six compounding things has a place where its trend is visible, its cost is bounded, its regressions are caught before deploy, and its decisions are walkable. The console is where "the substrate is the moat" stops being a closing line and becomes an instrumented surface.
| freq | Question the agent couldn't confidently answer | best conf. | likely source | authoring affordance |
|---|---|---|---|---|
| 8 | "What's our standard SLA for legal sign-off on customer comms?" | 0.42 | missing memory | Author memory → |
| 5 | "How do we handle a region-specific data-residency exception?" | 0.31 | missing skill | Propose skill → |
| 3 | "Is the GitHub Enterprise SKU stable for >5k seats this quarter?" | 0.55 | missing data source | Request connector → |
| 3 | "Which AIE handles localized FAQ generation for non-NA regions?" | 0.20 | missing tool | Add AIE → |
| 2 | "What's the recommended deprecation timeline for legacy AzureAD groups?" | 0.38 | missing memory | Defer |
| Top spenders this period | jobs | cost | % of envelope | enforcement |
|---|---|---|---|---|
| KPMG / Champ | 142 | $3,210 | 64% of $5,000 | 1 job auto-rerouted |
| KPMG / PPM | 38 | $980 | 49% of $2,000 | — |
| KPMG / Risk Advisory | 12 | $430 | 21% of $2,000 | — |
| run | candidate vs base | sessions | divergence | deploy gate |
|---|---|---|---|---|
| r_8821 | Champ orchestrator @v1.5 vs @v1.4 | 312 prod | 0.7% decisions differ | pass |
| r_8807 | model: Sonnet 4.7 vs Sonnet 4.6 | 312 prod | 4.2% decisions differ (cost −18%) | review |
| r_8791 | skill folder rev vs base | 312 prod | 0.0% — no semantic change | pass |
caused_by across the orchestrator, its sub-AIEs, and any A2A handoff. Together the four panels are the operator's instrument for the six compounding things in Part 1.Things that are harder than they look
Ten places where the architecture has a clean answer on paper and an empirical question in production.
Semantic index quality. False negatives break the trust contract; false positives annoy users. Threshold tuning is project-specific and probably has to learn from user feedback. Acknowledged in Grouping 1 by shipping the in_flight registry first and the index second once there are artifacts to index against. Chroma's context-1 work on self-editing search agents applies at the artifact-index level.
Briefing quality is hard to evaluate. The whole synthesis hinges on the briefing being useful. Grounding on a JSON intermediate (Part 3) prevents hallucination of decisions and artifacts that don't exist, but doesn't measure whether the briefing surfaces the right things, in the right order, with the right emphasis. The "Suggested next" list is a recommender with an accepted/ignored/dismissed reward signal; the briefing prose itself needs an eval surface. Open: a per-session "was this briefing useful?" microsurvey, or measuring downstream action latency post-briefing as the proxy.
Tool roster scaling ceiling. Auto-nominating every tenant AIE works at tens of AIEs; at hundreds or thousands (Autopilot scale, a mature HR Suite tenant) even WARM-tier prompt summaries blow out budget. Three options: pay the prompt growth and let caching amortize; embed AIE descriptors and retrieve-on-demand; hot/warm/cold tier by telemetry. Choice is a measurement question that production answers. The substrate's read API handles the implementation; the policy is open.
Service identity for AIE invocations. MVP assumes the invoking user's identity flows through every sub-agent and external call. Real world: an HR workflow needs admin-level read access to data the invoking user can't see, filtered down to user-appropriate output. The pattern is service-identity-with-row-level-filters; design work, not yet built. Real capability, real escalation risk. Solving this and the chained-principal story (Part 4) together is Grouping 3.
HITL preemption false-positive rate. Selective bubble-up means the orchestrator can preempt an optional HITL it thinks it has context for. The failure mode is silent wrong behavior — the orchestrator skips an approval the human would have rejected and nothing flags it. The eval shape is F1 against ground-truth human decisions on a held-out set; we don't have that production-grade yet. Conservative default for the first month per profile: bubble unconditionally, learn what's safely preemptable from disagreements.
Skill distillation stability. The bootstrapping loop depends on converting traces into skills without hand-curation. Bad distillation produces unstable feedback. The Ctx2Skill / adversarial-self-play research is promising; not yet at the maturity to drop HITL-gating on auto-distilled skills. Until then, all distilled skills land at unvalidated and require human promotion. Phase 3 unblocks autonomous distillation if the research lands.
Named merge strategies for parallel dispatch. The candidates (dedupe_and_rank, summarize_and_reconcile, compare_and_vote, union_with_provenance) are sketches; the right names emerge from two or three real merge cases in production. Parallel fan-out beyond N=5–7 may require structured merges whose quality degradation curve hasn't been measured. Single-orchestrator-with-named-merge ships in Grouping 3; the primitives crystallize after.
Multi-project memory bleed and graduation. Some skills learned in one project should propagate to others (the date-conflict check applies to every rollout the team runs). The graduation path from project to profile to org is propose-approve; something has to notice the pattern fires across projects to propose graduation. The cross-project observation pipeline is a Grouping 2+ piece; the cross-project privacy story (one tenant's pattern doesn't leak into another's) is the harder half.
Substrate evolution at scale. Event log schema migrations are a known industry pain. The versioning + projector-composition discipline (Part 4) is the right architectural answer; it doesn't make it cheap. Every new event type added in Groupings 2 and 3 has to be replay-safe against the Grouping 1 history. Test discipline matters: strict-mode replay against last week's events on every substrate change.
The "Champ MVP IS Grouping 1" framing has caveats. Grouping 1 is real and shippable for June 15; "real" means a minimal event-sourced project subsystem in production, the briefing chokepoint live, the in_flight registry deployed, the four decision-flow fields in the docs folder. Not "every existing harness subsystem migrated to projections by then." The substrate is partial in scope at June 15, complete in shape. Old subsystems migrate in waves over Phases 2-3 as their owners are ready, with strict-mode replay guarding each cutover.
What we're not doing that others are, and why
Five things the field is building that this design deliberately doesn't.
A competing connector marketplace. MCP's ecosystem and Anthropic's open SKILL.md format already cover the connector and skill layers; the marketplaces around them are noisy enough that adding another is a tax, not a service. The substrate participates in both and adds maturity gates and provenance fields as extension metadata (Part 4). Leverage move: spend engineering on the substrate, not the marketplace.
A competing cross-vendor handoff protocol. The wire layer has more candidates than the field can absorb; adding another would be noise. The substrate adapts through wire-transport adapters (Part 4) and contributes a chained-principal signing extension upstream where it fits.
A visual no-code agent canvas. Visual flow editors are a standing market category. The orchestrator wizard (Surface 7) is for builders configuring profiles, not authoring DAGs; profile training mode and the skill review queue (Surface 5) are for skill curation. The bet is that orchestration is too dynamic to express in a DAG editor and that the few users who want a DAG are better served by saving generated workflows (Grouping 3) and promoting them to first-class AIEs.
Cross-tenant projects. Consulting tenant advising a client where the client sees the project but the tenant's other clients do not is a real shape. The access pipeline and the four-scope memory model don't encode it. Adding it is design work — the tenant boundary today is a hard isolation boundary, softening requires explicit project-level access design — not just engineering. Deferred until the pattern has more than one customer asking.
Tying to one model provider. Hosted agent products tie customers to their model provider. The substrate treats the model as a parameter (see "Bring-your-own-model" in Part 4). Lock-in is to the operational layer — the substrate, the skills, the project state — not to whose tokens the executor consumed last quarter. Structural, not aspirational: replay only works because the model is a parameter.
What "done" looks like
For the June 15 KPMG Champ ship: the orchestrator AIE runs every session; the FA's docs folder holds plan + program_memory + generated_docs plus the four new fields (decisions, open_questions, in_flight, blockers); session-entry briefing renders; the duplicate-check intervention fires before any artifact-creating tool; every event flows through the substrate; the audit trail walks. Grouping 1 is real. Grouping 2 and 3 are reachable from the same substrate without re-architecture.
For the longer arc: a new person joining a six-month-old project experiences it as a coherent, current, accumulating thing, not as chat scrollback plus a folder of artifacts plus a colleague to interrupt for context. The agent is one of several participants in a durable, governed, event-sourced shared object. Skills accumulate. Knowledge gaps close. Replay catches regressions before deploy. Every cross-vendor handoff is signed back to the originating principal. Costs are bounded at the loop, not after the fact. The six things that compound (Part 1) all sit on the same event log and improve whether or not anyone notices.
The agent loop is fungible. The substrate is the moat.