on agent runtimes, projects, and the substrate underneath

Projects that compound

A runtime substrate for enterprise agent systems, designed to ship in three phases from the Champ June MVP onward.

~20 min read / agent harness / runtime substrate / institutional memory

Most AI agents today are loops with a memory layer bolted on. The shape is the same whether you point one at an enterprise project, a code repo, or your personal account: a narrow input surface (the prompt, the files, the tool output) and a memory layer trying to widen it. Even the best case version — curated tools, hand-written skills, careful prompts, and project-scoped memory — still remembers the wrong things: user preferences and a flat doc corpus, not the project's decisions, the in-flight work, or what teammates are doing. The asset that compounds is the team's working knowledge; the system's stays flat. The agent doesn't know two teammates asked the same question yesterday, that the artifact it's about to draft exists in three revisions in a folder it can't see, or that the project has been running for six weeks and the next thing to do isn't the thing the user just asked for. The waste (duplicated drafts, repeated discovery, lost decisions, re-learned context) is the part of knowledge work nobody measures because measuring it gets uncomfortable.

The waste sits on the wrong unit. IDEs put durability on the repo. Retrieval products put it on the doc corpus. Hosted agents put it on the chat thread. None of those is the project: the decisions, the artifacts with provenance, the in-flight work, the accumulated knowledge. That is what compounds, and no one ships it as a first-class object. Three behaviors fall out of building it: continuity on entry, anti-duplication before work, skills that accumulate under governance.

Built that way, the project is event-sourced and multi-participant by construction, the agent loop is its read-write surface, and the model is a parameter against state that lives elsewhere. Nothing here requires throwing out the harness an enterprise already runs; the substrate slides in underneath it — the runtime (today's harness) keeps its turn loop and tool gates, and starts reading from substrate projections instead of in-memory state, partial in scope and complete in shape from day one.

Part 1 — the user-visible claim

Three behaviors users see

Opening the project, the agent's first move is not to wait for a prompt. It briefs you. Not chat scrollback, not an LLM summary of past conversation. A structured projection from project events: current state, decisions and why, canonical artifacts, what is in flight and on whom, what is blocked, what to pick up next. Thirty seconds to read, every claim clickable.

The briefing isn't stored. It's projected from events emitted by the agent and your colleagues as they work. Nobody updates it. It's fresh by construction.

See Surface 2 · Opening a project →

You type "draft the launch email." The agent doesn't draft. It responds: v3 exists from Thursday; I'll iterate on that. Alex is editing a parallel draft right now. Coordinate or take over? You choose. The agent works from the right artifact or routes you to the next item.

The user sees the check happen. An agent that silently picks the right artifact is indistinguishable from one that silently picks the wrong one. Visibility is how trust is earned.

See Surface 3 · Asking the agent →

The agent notices patterns. This project has run a stakeholder-impact check before exec comms three times. Make it default? You accept, edit, or decline. Accepted suggestions become project-scoped skills applied by default on every future session for every member. The same loop runs at profile and org scope with progressively stiffer governance. Skills retire when usage drops or success rate decays. Skills are alive, not archived.

The same epistemic loop runs in the other direction at runtime. When the agent's confidence on a turn falls below threshold, the project doesn't bluff: the turn routes to a clarifying question, a stronger executor model, a paired reviewer that challenges the proposal before commit, or an HITL (human-in-the-loop) bubble-up. Skill accumulation and runtime uncertainty are two faces of the same project state: what the project knows over time, what it doesn't know right now, and how to act on either.

See Surface 5 · Reviewing what the project has learned →

All three behaviors require the same move: the project becomes a first-class durable object that agent sessions and humans both read and write. The agent is one reactive participant in a shared, event-sourced project state, not an actor with private memory talking to one user.

Long-horizon work is the regime where today's AI is least useful and most expensive. A model that drafts an email in five seconds is impressive once; a team using it for six months learns the cost of getting context into the model exceeds the cost of the email. What breaks isn't the model. What breaks is everything around it: context loading, conflict resolution, trust that the agent knows what the team knows. Make the project the durable thing and the model becomes substitutable. Better models drop in (skills may need re-tuning against a new model's capabilities, but the substrate, the audit, the project state all survive); cheaper models handle cheap tasks; multiple models cooperate.

What specifically compounds

Each row is a substrate primitive whose quality measurably climbs as the project runs, crediting against a concrete signal that already lives in the event log.

What compounds	Signal it credits against	Visible where
Suggested-next rankingOrder of the "pick up next" items the briefing surfaces.	Accepted vs ignored vs dismissed across the suggested-next list, per suggestion-kind. Briefing prose improves as a downstream consequence: richer state to render from, better-ranked list to surface.	Surface 2; the ranking reflects the project's actual priorities after two to four weeks, and time-to-first-action shortens with it.
Skill libraryPatterns the project applies by default once the team accepts them.	Skill-applied-successfully and skill-applied-unsuccessfully events; per-scope maturation thresholds (Part 3).	Surface 5; verified skills graduate from in-context to retrieval-only, shrinking prompt overhead while raising hit rate.
Knowledge-gap closuresWhat the agent didn't know that an operator has since filled in.	Low-confidence outputs and explicit don't-know responses, classified by source (missing skill, tool, memory, data).	Surface 8; an operator closes the gap, the next agent doesn't repeat it.
Cost ceilingsPer-Job and per-Profile budgets the substrate enforces and recalibrates.	Per-Job and per-Profile cost rollups; budget alerts and reroute events.	Surface 8; budgets calibrate against actual work cost instead of guesses.
Reviewer accuracyHow often the in-loop reviewer agrees with human ground truth.	Reviewer-to-human agreement on the work-product checks that bubbled despite reviewer opinion.	Surfaces 5, 8; more triage handed off as accuracy climbs, less HITL load without quality dropping.
Access denial alternativesFallback paths the agent learns when permissions block the primary one.	Whether the agent's next attempt after a structured denial succeeded with the suggested alternative.	Surface 6; the agent learns which path works for this team's permissions, this project's policy.

No single row is impressive on its own. The compounding claim is that every row sits on the same event log, credits against the same shape of signal, and improves whether or not anyone notices. The substrate is what makes "the project gets smarter" instrumented rather than aspirational. We'll refer back to these six rows as the six compounding things throughout the piece — every later section that talks about the substrate "getting smarter" is talking about one of these six rows getting better on its own log.

Part 2 — the durable substrate

Those behaviors all run against the same substrate. What does it have to durably hold?

Five durable things, one shape

Five things in the system want durability. Each is currently scattered across the runtime (today's harness), the surfaces above it, or implicit nowhere-state. The work is naming them as one shape: an append-only event log plus materialized views.

Thing	What it carries · where it lives today · where it lands
Configuration	What kind of agent, what tools, what skills, what model, what policy. Today: constructor arguments scattered across call sites. Lands as orchestrator AIE (AI Employee — the configurable agent primitive: identity, capabilities, knowledge, governance).
Work containers	Bodies of work that span sessions, members, days. Today: the harness has Sessions; anything longer-lived lives in the Frontier App or the user's head. Lands as Project + Job.
Procedural knowledge	How to do things in this context, accumulated from usage. Today: the skill data model exists; the loop that fills it doesn't. Lands as Skills.
World state	What exists right now, who owns it, what is in flight, what was decided and why. Today: nothing above the session boundary. Lands as Memory + the project-state fields.
The trace	Every read, write, decision, tool call, denial, approval, sub-agent spawn, with causal lineage. Today: Prometheus and OpenTelemetry for ops visibility; no event log for the work itself. Lands as Audit.

One naming convention applies throughout: a Project is a durable body of work (a six-month change-management engagement, a research program); a Job is a unit of work inside a Project with a defined start, expected outputs, and an owner ("draft the launch email," "run the stakeholder-impact check"); a Session is one continuous conversation inside a Job, typically one user at one sitting. Project + Job is the durable pair the substrate carries; Session is the runtime envelope around each LLM-driven turn series.

The structural claim: these five are the same underlying shape. A stream of immutable events, plus views projected on read. Audit is the storage by construction. Reversibility falls out of replay. Concurrency becomes a projection-time decision instead of a write-time lock. Memory and skills stop being two storage stories. The agent loop is one of many event emitters; humans, sub-agents, integrations, and approvers all emit events of the same outer shape.

The project state object

One shared object sits at the center, organized in five buckets. The briefing, the duplicate check, and the skill loop are all projections of it. The decision-flow bucket is the one that isn't in the harness today (decisions, open_questions, in_flight, blockers); it lands in the Frontier App's document store alongside the existing plan, program_memory, and generated_docs. Cheap to add; the discipline is writing to it on every relevant event.

Bucket	Fields	What it carries
Identity	`members`	Who is on the project with roles. Lets the agent attribute actions, scope notifications, resolve names. Lets the briefing tell the next person who else has been working.
Artifacts	`artifacts`	Versioned, with a canonical pointer per artifact, provenance (which session, user, agent turn produced this version), and a semantic index. Canonical pointer answers "fetch latest"; the index answers "is there already a draft of X?"
Decision-flow state NEW	`decisions` `open_questions` `in_flight` `blockers`	Decisions made (with alternatives, who decided, when, rationale); the things the project knows it doesn't know; work that's started-but-not-finished (intent + actor + start + expected outputs, feeds the duplicate check); what's waiting on what or whom. The briefing reads from all four.
Knowledge bindings	`skills` `memory`	Project-scoped procedural knowledge (skills with confidence, success/failure counters, version) plus project-scoped state (calibrations, preferences, settled facts). Loaded into the system prompt at session start; shared writes go through a review queue.
Trace	`events`	Append-only log of every change to any of the above. The briefing is generated from this. The audit trail is this. Replay reads this. Source of truth for the four buckets above; everything else is a materialized projection.

Every participant — users, agent sessions, sub-agents — reads and writes the same project state. Private-scope writes commit directly; shared-scope writes propose through a review queue. The briefing on session entry is a projection of the state, generated fresh from events rather than stored separately.

Part 3 — how the runtime experiences the substrate

Storage shape is half of it. The other half is what changes about the agent loop when this substrate exists to read from.

The first-turn reframe

The most consequential change to the agent loop sounds boring. The first turn of every session is not a response to the user's prompt. It is a structured read of project state, a generated briefing, and a check against in-flight work. The substantive response comes second.

The harness already has the gate: every tool call passes through one execute path, every session has one initialization hook. The work is two things in that hook: load project state into the prompt, and require the first model turn to emit a briefing before any tool call.

A normal session opens with a briefing turn and a duplicate check before any drafting happens. The agent claims an in_flight slot before doing work, so a parallel session by another user sees the slot and routes around it.

Both lifts are orchestrator-level policy (prompt construction and tool gating), not model changes. Anthropic's context engineering writeup covers the mechanism this rides on.

How the briefing is generated

The briefing isn't an LLM summary of chat history. It's a deterministic projection of structured state, plus an LLM pass that translates structure to prose and ranks the suggested-next list. The structured part is verifiable; the prose part is constrained to operate only on facts the structured part provided.

Three steps. A deterministic query over project state yields the slice (recent decisions, open questions, in-flight items by actor, canonical artifacts with timestamps, blockers, recent event tail). The slice renders into a JSON intermediate. The model produces prose whose every claim is justifiable against the JSON, plus a ranked suggested-next list. The split exists to prevent confident hallucination of decisions that weren't made and artifacts that don't exist (see Chroma's context rot work; grounding on a provided JSON is one of the few mitigations that scales).

// briefing intermediate (excerpt)
{
  "project": "GitHub NA rollout",
  "members_active_this_week": ["alex", "priya", "you"],
  "recent_decisions": [
    {"id": "d_412", "summary": "Use Q4-budget framing for exec comms",
     "decided_by": "alex", "at": "2026-05-18", "rationale_ref": "evt_8821"},
    ...
  ],
  "in_flight": [
    {"intent": "launch email v4", "actor": "alex", "started": "2026-05-21T09:14Z",
     "expected_outputs": ["artifact:launch_email"], "session_id": "sess_91"}
  ],
  "open_questions": [
    {"id": "q_18", "text": "Do we send dev-experience angle to leadership?", "raised_by": "priya"}
  ],
  "canonical_artifacts": [
    {"id": "launch_email", "version": 3, "last_edited": "2026-05-20T17:02Z",
     "last_edited_by": "agent", "external_state": "synced"},
    ...
  ],
  "blockers": [
    {"on": "engagement_dashboard", "waiting_for": "ops:provision-data", "since": "2026-05-19"}
  ],
  "suggested_next": [
    {"kind": "answer_open_question", "ref": "q_18", "rationale": "blocks email approval"},
    {"kind": "unblock", "ref": "engagement_dashboard"}
  ]
}

The user reads the prose. The structure is what lets the model write it without inventing.

The duplicate check

Two parts on different primitives. Concurrent: when a session starts substantive work it claims an in_flight slot with intent, expected outputs, timeframe, and a heartbeat. The agent's pre-work step checks for overlap with active slots. Sessions subscribe to in_flight.added events; the runtime fans new claims out. Heartbeats refresh on every meaningful turn; claims expire after a configurable TTL (30 minutes default) without renewal, so a crashed session doesn't permanently block parallel work. Retrospective: a semantic index over project artifacts, embedded on create/update. Every artifact-creating tool's system prompt requires a pre-call lookup against the index; matches above threshold surface as candidates.

("Champ" recurs in two linked senses throughout this piece: KPMG's change-management engagement that ships June 15, and the orchestrator AIE profile configured to run that program — every reference below resolves back to that pair.)

Retrospective is only as good as the index; concurrent is only as good as sessions claiming slots and renewing heartbeats. The runtime auto-claims when a creation tool fires (carrying most of the discipline) and the heartbeat ticks per turn. The in_flight registry ships in Phase 1 because the cost is trivial and the payoff is immediate. The semantic index ships in late Phase 1 or early Phase 2; embedding model choice, refresh cadence, and per-intent similarity thresholds are non-trivial calibration work, and the first Champ sessions start with zero artifacts to index against anyway. The intervention shown in Surface 3 lights up progressively as the index fills in.

AIE invocation, in mechanism

The orchestrator AIE calls other AIEs as tools. Each invocation rides on the same gate and emits the same outer event shape as everything else. The wrapper creates or reuses an AIE conversation, sends the input payload, streams intermediate events over the existing SSE (server-sent events) bridge, polls for artifact-creation events via the existing poller, and returns a structured ToolResult with payload and artifact references. The orchestrator perceives every AIE invocation as a clean tool call.

What's new at the substrate layer: every invocation step is an event. aie.invoked with target AIE id + input payload + prompt hash; aie.streaming for each intermediate chunk; aie.completed with the final ToolResult; artifact.created for any artifacts produced; all linked via caused_by back to the orchestrator's LLM turn that requested the invocation. The audit chain walks across orchestrator and child AIE events in one graph. HITL events raised inside the child AIE bubble up via parent_handler_ref on the event and surface in the parent Job's HITL queue (Surface 4) with dedup by integration auth target.

For HR Suite's parallel dispatch (Maya's pay-equity scenario), the orchestrator emits N concurrent aie.invoked events; the runtime uses the existing execute_parallel machinery; each child streams independently; the merge is one more event in the parent's log. Background jobs (Maya's quarterly aggregations, attrition-risk monitoring) are AIE invocations with mode=background: they return immediately, the orchestrator continues other work, results auto-deliver on the next turn or via wait_all.

How skills evolve, in mechanism

The skill loop balances accumulation against pollution. Too little and the project never gets smarter; too much and the library bloats with low-signal proposals that drown the useful ones and burn prompt budget.

Four lifecycle states with a continuous confidence score underneath, not discrete maturation buckets. Unvalidated: a candidate skill with name, procedural description, source events as evidence, proposed scope; user-scope auto-commits, shared-scope lands in the review queue with no firings yet credited. Experimental: loaded into the prompt at session start; usage signal (success/failure ratio plus firing recency) updates the confidence score continuously; the score gates whether the skill stays in-context or moves to retrieval-only. Verified: enough positive signal that the skill is frozen except for major revisions and is trusted by default in new sessions. Deprecated: confidence below threshold or no firings in some window; excluded from new sessions, evidence retained. The diagram below traces the lifecycle from observation through maturation; the four runtime states above are how implementations track each skill at any moment.

The skill loop, scoped, across the four lifecycle states (unvalidated → experimental → verified, with deprecated as the retirement terminus). User-scope writes commit directly into experimental; shared-scope writes land as unvalidated in the review queue. Skills mature into verified through positive usage signal and fall to deprecated through low usage or decay. Rejected proposals retain their evidence so the same pattern doesn't regenerate a duplicate next week.

The maturation gate prevents both pollution and over-conservatism. Firings that help credit the skill; firings the user rejects debit it. Past a threshold of negative signal or no firings in some window, the skill auto-flags for deprecation. Thresholds are per-scope: org-scope skills retire slowly because they are high-investment, project-scope skills retire fast because projects move through phases. AgentRR-style record-and-replay is the telemetry shape: every firing is an event with enough metadata to compute the success signal post-hoc.

Rejected proposals retain their evidence. Without that, the same pattern observed across sessions regenerates the same proposal repeatedly and burns reviewer attention. With it, the system recognizes a pattern has been proposed and rejected and only re-proposes if conditions have meaningfully changed. Small in the prototype, indispensable at org scale.

Skills as the project's working knowledge

Briefing and duplicate check make project state legible. The skill loop makes the project's accumulated way of working durable. A new person on a six-month project inherits not just artifacts and decisions but how the project has learned to work. The three behaviors are facets of one structural choice: continuity = project state made legible; anti-duplication = project state checked before work; skill evolution = the project's working knowledge and its acknowledged unknowns maintained as part of that state.

The reviewer pattern, generalized

The orchestrator AIE generates; a reviewer-configured orchestrator AIE evaluates. Same type=orchestrator AIE, with a mode: "reviewer" configuration variant: different system prompt tuned for evaluation rather than execution, smaller-default model, read-only tool roster plus a small action set (advance / send-back / return). Not a new AIE type; not a new permission model; not separate infrastructure. The reviewer's accuracy against human ground truth is tracked the same way executor accuracy is.

One primitive, three modes. All three reuse the same AIE infrastructure; the difference is when the reviewer runs relative to the executor.

Triage reviewer (post-execution). Reads a completed unit of work, advances if it meets the bar, sends back with structured feedback otherwise. A 94% reviewer-to-human agreement rate means the reviewer autonomously triages 94% of work-product checks; the 6% disagreements bubble to humans, and irreversible or shared-scope actions bubble unconditionally regardless of reviewer opinion. This is the default mode, and the one that ships first in Phase 2.
Challenger reviewer (parallel-to-execution). Runs alongside the executor on high-stakes work, challenging proposed decisions before commit. The signature is a paired model with a different prompt and a different prior, surfacing dissent the executor wouldn't see in itself. Used for HR Suite pay-equity recommendations, Finance approval-gate decisions, anything where a wrong commit is expensive to roll back. The pattern reads as constitutional AI at the orchestration layer rather than the model layer; the constitution is in the reviewer's skill folder, not baked into weights.
Curator reviewer (post-hoc, scheduled). Reads completed sessions on a cadence (overnight, weekly), mines them for skill candidates and guardrail proposals, files them into the review queue (Surface 5). This is the same flywheel Anthropic ships as "Dreaming" in Managed Agents; the curator-config orchestrator runs as a background AIE on schedule, no new primitive needed.

Where the reviewer sits relative to HITL is the important part. The reviewer is a triage / challenger / curator layer, not a substitute for human approval on the things humans must approve. Shared-scope memory updates, plan changes, and side-effecting actions still bubble to a human. The reviewer handles the within-execution quality bar ("did this generated doc meet the project's conventions") and emits structured feedback that the observation pipeline mines for skill candidates.

The reviewer is itself event-sourced, accuracy-tracked, and forkable. Every reviewer decision lands as an event with caused_by lineage back to the executor emission it judged; the reviewer's own profile evolves through the same skill-loop machinery the executor does. Reviewer accuracy is one of the six things that compound (Part 1): more triage handed off as it climbs, less HITL load without quality dropping. The triage mode ships in Phase 2; challenger and curator modes ship as configuration variants of the same primitive, not new code paths.

Part 4 — the architecture

The buildout in Part 5 lands on a specific shape. Before walking the phases, here is the steady-state architecture they converge on: three layers, four memory scopes, one event shape, and the capabilities that fall out of it.

Each phase's work is concentrated. Phase 1 ships an event-sourced project state plus the briefing and duplicate-check gates; existing subsystems keep working and migrate to read from projections over later phases. Phase 2 closes the learning loop. Phase 3 makes the substrate enterprise-deployable.

The shape, layered

Three layers, in stack order from durable to surface. The substrate at the bottom: append-only event log plus a projector library plus a small set of named subsystems (profiles, projects, skills, memory, audit, access). The runtime in the middle — the per-turn execution layer, which today is the existing harness: turn loop, tool-execute gate, prompt builder, sub-agent surface, HITL handler, now reading from substrate projections instead of in-memory state. The surfaces at the top: Frontier Apps, Autopilot chat, the API, workflow nodes, each composing the runtime with a different presentation and a different profile.

Three layers. The substrate owns durability and the trace. The runtime owns per-turn execution, with three named gates that read from substrate projections instead of from in-memory state. The surfaces compose the runtime with a profile and a presentation.

Memory in four scopes (canonical)

Memory is the most-conflated word in agent-system design. It means at least four different things attached to four different units with four different write policies. Name them separately or they collapse into "the agent has a memory thing."

Session memory carries what was learned inside one conversation; the working knowledge that survives compaction even when the raw transcript doesn't. Writes auto-commit. Project memory carries decisions, calibrations, conventions, in-flight claims, blockers, open questions, canonical artifact pointers for one body of work. Shared-scope writes go through propose-approve. Profile memory carries cross-project calibrations, templates, and graduated skill candidates for a way of working (the Champ profile's accumulated knowledge across every change-management program KPMG has ever run). Writes need profile-admin approval. Org memory carries brand voice, compliance constraints, audit retention rules, regulatory regime: the things that apply firm-wide. Writes need tenant-admin approval.

Org memory is intentionally narrow. Most of what people reach for as "org-wide" lives more accurately at profile scope; the unit of competence is one way of working, not the whole firm. A user is the actor in every event but is not itself a memory scope. Three more scopes are available as advanced opt-ins: Application, AIE, and Integration (the AIE scope for an AIE that accumulates cross-program knowledge, like HR Suite's Readiness Assessor; the Integration scope for canonical-version pointers in Google Docs). None of them shows up in the briefing or any first-class UI.

Four scopes, four different units, four different write policies. At session start, the runtime composes the agent's working context by reading project, profile, and org scopes in order, then layering session-local state as the conversation accumulates. Promotion across scopes is explicit.

Four design positions

The first three choices below define the event log itself. The fourth is the engineering move that unlocks everything downstream.

1. Events share a common outer shape, not a schema. Every event carries the same outer fields: id, type, actor, project-id, profile-id, session-id, caused-by, timestamp, payload, content-hash, optional signature. Type is a string the producer chooses; payload is opaque JSON. Audit and trace surfaces work against the outer fields; subsystems work against their own payload shapes. New event types don't require central registration. Profiles, projects, skills, memory, artifacts: all the same shape, all event-sourced. Surfaces (Frontier Apps, Autopilot, API, workflow-engine nodes) compose the runtime with a profile via this same uniformity; adding a surface is one config row plus a UI shell, not a fork.

2. Projections owned by named subsystems. The substrate provides ordered reads, append writes, and a projector library. It does not define "the canonical view of a project" or "the current state of skills." Each subsystem (profiles, projects, skills, memory, audit, access) owns its projections, its read API, its invalidation rules. Adding a new subsystem doesn't touch the substrate.

3. Causal lineage in the outer fields, not inferred. Every event names the event that caused it. A tool call names the LLM message that requested it. A skill proposal names the tool calls that triggered the observation. A patch names the proposal. Missing causes are an error on write, not a warning later. "Why did this decision happen" becomes a graph walk, not a database join across timestamps.

4. Prompts are content-hashed. Every prompt sent to the model lands as an event whose payload includes the prompt hash; the prompt text is stored once keyed by hash. Replay against the same hash produces deterministic inputs (the model's output may still sample differently, but the input is reproducible). Fork-and-diff against a different hash separates prompt changes from output changes. Regression testing, model-swap evaluation, and root-cause replay become tractable as a result.

The event shape

The shape every event takes:

{
  "event_id":   "evt_a7d2e9b1",
  "type":       "decision.recorded",
  "actor":      { "id": "usr_7f3b", "name": "Sam", "role": "compliance_analyst" },
  "project_id": "proj_q3cr",
  "profile_id": "prof_champ",
  "session_id": "sess_91",
  "caused_by":  "evt_4b9e1d0a",
  "timestamp":  "2026-05-29T10:19:26.341Z",
  "prompt_hash": "sha256:6f3a...",
  "payload":    { ... subsystem-specific ... },
  "signature":  "ed25519:..."
}

The outer fields are uniform; the payload is the subsystem's. Audit, replay, fork, diff, and lineage walk all work on the outer fields. Subsystems read their own payload shapes for their projections. The signature is optional for most types, mandatory for the ones regulated surfaces care about (Phase 3).

Capabilities that fall out of this shape

The shared event shape, the content-hashed prompts, the projector library, the causal lineage in every event: these are the engineering. Each capability below is a projection over primitives the substrate already has, not a feature on the side.

Deterministic replay, four flows

Content-hashed prompts plus event-sourced everything else add up to a regression-and-experimentation surface today's hyperscaler platforms don't expose. Maxim, Galileo, and Holistic Agent Leaderboard run agent evals against simulated scenarios pre-deployment; none ship a production replay primitive that re-executes a real customer interaction byte-for-byte from durable logs against a new model, prompt, or policy. Four workflows fall out for free once the substrate is in production:

Model swap evaluation. Take a real Champ session that ran last week. Re-execute its event sequence against a different model (Claude 4.6 → Sonnet 4.7, GPT-4o → Haiku 4.5) with the same prompt hashes. Diff the resulting decision.recorded and tool.requested streams. Cost / latency / output deltas without re-instrumenting anything.
Prompt A/B. Fork a project at any event; in the fork, change a system-prompt skill folder; let it replay forward. Two event logs diverge from a known shared prefix. The diff between divergence points is the prompt change's effect.
Root-cause replay. Production session went sideways at 14:32Z. Open the audit chain (Surface 6); walk back to the divergence point; replay forward with debug-level verbosity; observe what tool result the model misread or which skill misfired. Same UI surface, different operator mode.
Deploy-gate regression. Before promoting a new orchestrator version or a new model, replay last quarter's prod sessions against the candidate. The deploy gate is "diff stays below threshold." This is what Sarbanes-Oxley auditors and EU AI Act reviewers actually want — not a static eval suite, but a reproducible test that the system being shipped behaves like the system that was certified. The same primitive answers the data-sovereignty question for any hosted agent product: a customer can export their event log, run it on their own substrate, prove what happened.

None of this is four different systems. It's one event log, one replayer, one fork primitive, four operator modes. One caveat: model outputs are stochastic, so replay reproduces inputs deterministically but outputs may sample differently. For input-determinism this is enough; for output-determinism you also pin temperature and seed (and accept that frontier models don't always honor seed; replay-with-tolerance is the production stance).

Chained delegated identity through sub-agents

Every event in the log names an actor. The orchestrator runs on behalf of a user; a sub-agent invoked by the orchestrator runs on behalf of the same user but through the orchestrator's profile-allowed tool roster; a cross-vendor handoff (over A2A (Google's Agent-to-Agent protocol), an MCP (Model Context Protocol) server-server call, ANP (Agent Network Protocol), or a direct API) extends the chain across the vendor boundary. The substrate writes every link into the log, and the audit walk reconstructs "which principal authorized which sub-agent under what scope to produce which load-bearing claim in the final output."

Gartner publicly predicts more than half of agentic AI initiatives will stall on identity by 2028. CoSAI and NIST are starting work on delegated principal auth through sub-agent chains; no shipped product solves it. A2A signs agent cards (the sender's domain identity); ANP signs via decentralized identifiers; none of the candidate cross-vendor protocols publicly sign the chain that produced the call. This is the substrate's most direct enterprise-security claim: confused-deputy attacks across multi-agent orchestrations become inspectable, not theoretical. Ships partially in Phase 1 (actor on every event, caused_by lineage), formalized in Phase 3 (signed chain at every handoff boundary, role-checked at each substrate write).

Knowledge gaps as a first-class telemetry signal

Every vendor instruments tokens and latency. Nobody publicly instruments what the agent didn't know. The observation pipeline (Phase 2) already watches tool sequences for skill-candidate patterns; extending it to watch low-confidence outputs and explicit don't-know responses turns them into a tracked signal. The operator console (Surface 8) ranks gaps by frequency, classifies by source (missing skill, missing tool, missing memory, missing data source), and links each one to an authoring affordance. Closing a gap is the operator-loop ingredient that makes the project measurably smarter week-over-week. Without instrumenting it, "the project compounds" stays a hope.

Hard budget enforcement at the loop

Every FinOps vendor in 2026 ships cost dashboards. The substrate has the gate where enforcement actually lives: the tool-execute path and the per-turn LLM call. Three enforcement modes, all in Phase 3:

Kill at ceiling. Per-Job, per-tenant, or per-Profile budget cap. On overrun the orchestrator session enters terminating state, finishes its current turn, emits a budget.exceeded event, suspends. Resume requires a budget extension event.
Reroute under pressure. Above N% of ceiling, the runtime swaps the executor to a cheaper model tier; the reviewer profile keeps its quality model. A budget signal in the prompt is a suggestion; this is enforcement.
Pause for approval. The orchestrator emits a HITL event with type=budget_exceeded; the user or designated approver confirms continue or abort. Same HITL pipeline as everything else, dedup-by-key applies (a runaway parallel dispatch surfaces one approval, not N).

OpenAI's Workspace Agents session billing and Anthropic's $0.08/session-hour-plus-tokens are both opaque and unbounded by default. The substrate gives enterprises the runtime gate.

Bring-your-own-model, bring-your-own-deployment

Replay treats the model as a parameter. The audit chain doesn't care which provider produced a turn. Effective-access doesn't care where the deployment runs. A customer can replace the executor with their own fine-tune, their own hosted Claude in their VPC, OpenAI Frontier behind their account, or a local Llama variant. Replay still works, audit still walks, skills still apply, the project's accumulated state survives the swap. This is the substrate's structural answer to model lock-in, the standing concern with any hosted agent product. The orchestrator AIE config has a model_pin field; per-tool model overrides land on the same event log; the reviewer's model is configured independently so quality and cost can be tuned separately.

Confidence-gated execution

The orchestrator can route a turn through a reviewer profile when its confidence drops below threshold, surface a HITL clarification, or escalate to a stronger executor model. The confidence signal already exists at the LLM-call boundary (logprobs blended with self-reported confidence per the harness's existing process_completion). Wiring it into the runtime as a routing decision is one gate addition: tool-execute inspects pre-call confidence and chooses among proceed / escalate-model / paired-reviewer / HITL. The hosted-agent products in the field don't carry a first-class confidence-gating primitive; uncertainty handling is left to the model. Ships in Phase 2 alongside the reviewer pattern (Part 3); the reviewer is the natural target when confidence is low and the work is reversible.

Interop — the primitives the wire doesn't own

A complete agent-interop layer has to answer a fixed set of questions: capability discovery, wire transport, principal-of-chain identity, authorization, audit, cost, HITL bubble-up across boundaries, long-running durability, replay against versioned state, and observability. The candidate wire formats cover the first two well and the rest unevenly or not at all. The substrate's posture: own the primitives that don't belong on any wire; let the customer's wire of choice carry the call.

The split, primitive by primitive:

Discovery and binding. Auto-nomination from the tenant's AIE roster and registered MCP servers, through one runtime gate. The tool roster is per-session and resolved live, not a static manifest.
Wire transport. Adapter pattern. MCP in and out covers most cross-agent traffic; A2A, ANP, and others land as adapters when a customer needs them. The substrate is wire-opaque.
Delegated principal chain. Every event carries an actor; every cross-vendor call carries the full chain (user → orchestrator profile → sub-agent → target), signed end-to-end. Wire formats sign the sender at best; the chain that produced the call is what regulated reviewers and confused-deputy defenses actually need.
Authorization. The five-filter effective-access pipeline (registered capability ∩ profile allowlist ∩ project gate ∩ user permissions ∩ mode policy) runs at every call, uniform across wires. Denials carry a reason axis and a suggested_alternative, not an opaque 403.
Provenance and audit. caused_by lineage on every event, forward- and backward-traversable, crossing vendor boundaries through the signed chain. The audit walk is a graph traversal, not a join.
Cost attribution and enforcement. Token and cost metadata on every tool and AIE invocation, rolled up to per-Job / per-Profile / per-tenant. Kill / reroute / pause at the loop, not in a dashboard after the fact (see Hard budget enforcement, below).
HITL bubble-up across boundaries. Sub-agent and cross-vendor HITL events surface in the parent Job's queue via parent_handler_ref, with dedup by integration auth target so N parallel children produce one approval.
Long-running durability. The event log is the rehydration substrate. A session in awaiting-human-long tears down its runner; resume replays from the most recent committed boundary on any pod.
Replay and versioning. Content-hashed prompts plus versioned event types let any call replay against a new model, a new orchestrator version, or a new prompt later (see Deterministic replay, above).
Observability. Events are the trace. Per-event token, cost, latency, and outcome metadata feed Surface 8 directly, no separate tracing layer.

What the substrate doesn't try to be: a connector marketplace, a wire protocol, or a skill format. The right move is owning the primitives the wire doesn't and letting the rest of the field shift around them.

Substrate evolution

Event-sourced systems are hard to evolve. Once an event type ships, its payload shape is permanent in the historical record, and replay must keep working against old events. Three disciplines mitigate this from day one:

Versioned event types. Type names carry an explicit version (decision.recorded.v1, decision.recorded.v2). Adding a field is a new type version, not a mutation. Old events stay readable by old handlers; new events get the new handler. Subsystems own their migration windows.

Content-hashed payload schemas. Each payload version's JSON schema lands as a substrate-internal event itself. Replay tooling resolves the schema for an event by walking back to the most recent schema event of that type. Old replay tools work on old events; new replay tools work on both.

Migration as projector composition. When a subsystem changes its projected view (the briefing's JSON intermediate format, say), it ships a new projector alongside the old one. Both materialize from the same events. Surfaces read whichever projector they're pinned to. The migration window closes when no surface reads the old projector anymore.

Pre-Phase-1: don't event-source the existing harness state in one go. New events flow through the substrate; existing tables remain authoritative for their current concerns and get migrated table-by-table as their owners are ready. Source-of-truth for the project subsystem (the new code path) is the event log from day one. Source-of-truth for old subsystems (artifacts, sessions, jobs) remains their tables until each subsystem opts into projection-based reads.

Part 5 — the buildout, in three phases

The architecture above is the steady state. Getting there is phased, not a single drop.

Buildout on a timeline

Each phase depends on substrate primitives the previous one introduced. Inside a phase, items can ship in any order; across phases, the sequence is fixed. Phase 1 ships against the June 15 Champ MVP date; Phases 2 and 3 follow on subsequent quarterly releases. The architecture beneath all three phases is the same three-layer stack (substrate / runtime / surfaces) shown in Part 4.

The three phases, in detail

Phase 1 — substrate basics + the user-visible MVP behaviors (Champ June 15)

What this unlocks. Continuity on entry, anti-duplication (via in_flight registry; semantic index lights up progressively), and a docs-folder memory model: the Champ MVP behaviors from Part 1. The event log is in production from day one for the new project subsystem; existing harness tables remain authoritative for the things they already own and migrate over later phases.

What we build. The event log (append-only Postgres-backed, common outer shape, content-hashed prompts), the projector library that registers per-event-type handlers and materializes named views, a Redis read-side cache. Strict-mode replay for divergence detection (deterministic-input, sampled-output caveat acknowledged). Fork from any event into a new run id; SQL-level diff between two runs.

On top of the substrate: the profile subsystem (orchestrator AIE of type=orchestrator, named-input contract, inherited AIE infrastructure). The project subsystem (project-id resolves to a projected state covering members, decisions with rationale and lineage, in-flight claims with heartbeats, blockers, open questions, suggested next, canonical artifacts). The memory subsystem (four scopes wired through the prompt builder). The artifact subsystem (versioned objects with canonical pointer; external roundtrip is just another event). Per-tenant cost tracking from day one (every aie.invoked and tool.requested event carries token + cost metadata; aggregations roll up to per-Job + per-tenant projections, even though outcome-priced billing is Phase 3). The three gate integrations: session create loads project state and runs a briefing turn before any tool call; tool execute runs the duplicate-check projection (in_flight registry; semantic index when ready) before any artifact-creating tool fires; prompt build assembles scopes-skills-state into the system prompt.

Source-of-truth resolution. For the new project subsystem (decision-flow state, briefings, in_flight): event log is canonical, Postgres tables are projections invalidated on new events. For existing harness subsystems (artifacts, sessions, jobs, memory_entries): tables remain authoritative; events emit alongside writes for audit and replay. Each pre-existing subsystem migrates to projection-based reads when its owner is ready, with a strict-mode replay test guarding the cutover. No big-bang migration.

Multi-tenant isolation. The event log partitions by project-id with tenant-id as the isolation boundary. Cross-tenant joins are impossible at the storage layer; the projector library enforces tenant scope on every read; row-level security is the floor. Projections are per-tenant.

Background jobs. Treated as AIE invocations with mode=background (per Maya's HR Suite scenario). They're long-running orchestrator AIE sessions that emit periodic aie.streaming events; the user's task drawer (Surface 4) lists them alongside foreground Jobs with a background status pill. The substrate doesn't need a special "background job" primitive; they're just sessions that stay open.

Deliberately not in. The skill observation pipeline. The reviewer-config AIE. RBAC beyond user pass-through. Long-running durability beyond what the harness already has. Parallel sub-agent dispatch with structured merge. The semantic index in its production form (ships partial in this phase; full calibration in Phase 2). Outcome-priced billing (Phase 3; only the cost-tracking events ship here).

The interesting design tension. How much projection state lives in materialized Postgres tables versus computed on read from the event log. Hot reads (briefing, duplicate-check, current canonical artifact) materialize into tables and invalidate on new events; everything else computes from the log. What's hot is a measurement question that production answers; the materialization layer is built to allow shifting the line.

Phase 2 — the learning loop closes

What this unlocks. The project gets smarter over time without anyone editing a config. Skills the project has learned apply by default to every new session. Builders run a profile in training mode to surface gaps and distill traces into skill proposals. New profiles fork from base profiles with a sane inheritance story.

What we build. The full skill subsystem lifecycle across the four canonical states (unvalidated → experimental → verified, with deprecated as the retirement terminus). Entry pathways are explicit: human-authored skills land at experimental; skills extracted from session traces land at unvalidated; skills mined from existing tool and AIE schemas land at experimental because the underlying capability is already trusted; skills evolved from observed failure patterns land at unvalidated. Maturation gates fire on skill-applied-successfully and skill-applied-unsuccessfully events and promote skills from experimental to verified; thresholds per scope (project in weeks, profile in months, org in quarters). Retirement is the inverse: drops in usage or success rate auto-flag for the deprecated bucket via a deprecation review.

The observation pipeline runs as a background process subscribed to recent tool-requested and tool-responded event streams. Pattern signatures (tool sequence + outcome + intent class) get clustered. Repeating successful patterns become skill proposals; repeating failure patterns become guardrail proposals ("before doing X, check Y"). Each proposal carries its evidence event ids by reference, so rejecting a proposal retains the evidence and prevents the same pattern from regenerating a duplicate.

The reviewer AIE type ships here. Same AIE infrastructure as the orchestrator AIE; different cognitive shape (evaluation, not generation); often a cheaper model. The reviewer reads a completed unit of work, its event log, its caused-by chain, and emits advance / send-back / return-to-backlog. Profile training mode is a profile-level flag putting a session into explore-heavy behavior. Profile forking and diff are substrate-level operations; the parent's audit trail is intact in the child, and rollbacks replay to an earlier event. Plan-as-skills lands here, not in Phase 1: the FA (Frontier App, per Part 2) still owns the canonical editable plan artifact; on session open, the plan loads as a set of project-scope skills.

Deliberately not in. Cross-profile skill transfer (noticing a date-conflict check fires across three rollouts and proposing graduation to profile scope). Adversarial self-play. Automated probe-set construction for skill evaluation. RBAC beyond user pass-through. (Compositional skill hierarchies are also out — covered in the section-closer below.)

The interesting design tension. How aggressive the observation pipeline should be. Too aggressive and the review queue floods (published research on adaptive skill libraries reports bloat factors of 4× to 9× when proposals are unmanaged). Too conservative and the loop never closes. The opening answer is conservative: rate-limited per session, batched reviews, automatic clustering of near-duplicate proposals at the experimental stage, retained rejection evidence. The production answer depends on review-fatigue telemetry only real usage produces.

Phase 3 — governance, durability, parallel execution

What this unlocks. The substrate becomes enterprise-deployable in the senses that matter: RBAC across the stack, multi-tenancy hardening, budget controls, long-running Jobs that survive pod restarts and integration-token rotations, parallel sub-agent dispatch with structured merge, signed audit, outcome-priced billing rollups.

What we build. The effective-access pipeline at the tool-execute gate, as specified in the Interop section above — landing at every tool call rather than only on cross-vendor wires. RBAC across the stack: user identity flows through every event as the actor; profile-level role gates separate admin / contributor / viewer; project-level membership controls who reads project state; tool-level required roles enforce at execute; the skill approver matrix per scope chooses who can approve writes; HITL approvers are picked by role.

The HITL event shape formalized. One shape across the runtime, carrying type, required/user-facing flags, timeout class, dedupe key, parent-handler reference, approver role, and the evidence event ids that justify the prompt. Bubble-up to parent Job handlers is mandatory for nested sub-agents. Dedupe by key means parallel sub-agents that all need the same auth surface one prompt, not N. Timeout class drives durability: short = in-process futures; medium = pod-graceful-shutdown with reload; long = full session rehydrate-on-resume-webhook with event-log replay.

Long-running durability falls out of the event-sourced substrate. A session entering awaiting-human-long tears down its runner coroutine and sandbox; state lives in the event log; resume rehydrates on any pod. The proof that the substrate is real is that this works without bespoke per-session serialization.

Parallel sub-agent dispatch with structured merge: named strategies (dedupe_and_rank, summarize_and_reconcile, compare_and_vote, union_with_provenance) chosen per dispatch. Sub-agent traces stream into the parent Job's view in real time; sub-agent HITLs bubble.

Quotas, cost, outcome pricing. Per-tenant token / concurrency / cost caps. Per-Job cost / latency / token rollups, queryable for billing. Budget caps surface in the model's context so the agent can choose cheaper paths under budget pressure.

Signed audit. Specific event types in regulated surfaces (HR Suite pay-equity decisions, Finance approval gates) carry cryptographic signatures at write time. The substrate doesn't change shape; the signature is a wrapper that verifiable-replay tooling validates. Unified audit query surface: "show me every memory write by profile X this week" is one event-log query; "why was this decision made" is a causal walk via caused_by. Workflow generation and promotion: the orchestrator can propose saving a useful sequence of skill firings as a generated workflow; promotion to first-class AIE is an admin gate.

The interesting design tension. How the substrate scales past a thousand concurrent active projects. The event log partitions by project-id with tenant-id as the isolation boundary (one tenant's events never join another's). Cold projects sleep; warm projects keep a Redis cache; hot projects pin to a sticky pod. Replay-on-resume is bounded because projections snapshot at known boundaries (every K events or every N hours, whichever fires first). A speculative direction once the loop closes: tickets-as-substrate-objects with autonomous orchestrator decomposition and reviewer-profile evaluation (sometimes called the "kanban inversion") becomes a natural composition. Composable, not load-bearing. Out of scope for the three phases.

What stays out, deliberately

Three things are not in any phase.

Voice surfaces are a separate code path. The latency floor for spoken interaction is below what the per-turn LLM loop with tool execution can deliver. The substrate exports a read API a voice surface consumes; the voice agent loop lives in its own runtime with its own constraints. Siblings, not children.

Cross-tenant projects are not in. A consulting tenant advising a client where the client sees the project but the tenant's other clients do not is a real shape, but the access pipeline above and the four-scope memory model don't encode it. Adding it is design work, not just engineering: the tenant boundary today is a hard isolation boundary and softening it requires explicit access design at the project level.

Compositional skill hierarchies are not in. Strategic skills composing functional skills composing atomic skills, with credit assignment flowing back up. The single-pass skill loop ships in Phase 2. The compositional version requires credit-assignment research still in flight; building ahead of the research is more risk than the gain warrants.

Part 6 — what the user sees

The substrate is the work; the surfaces are how that work shows up in front of people. Eight surfaces, mapped to the moments that matter: starting a program with an approved plan, opening a project, asking the agent for work, managing the queue that's waiting on you, governing what the agent learned, investigating why a decision was made, setting up an orchestrator (for builders), and closing the operator loop on knowledge gaps, cost, regression, and principal chains.

1. Starting a program — plan generation and approval

Champ is plan-gated: the orchestrator session can't start until the program has an approved plan. The admin clicks Generate plan; a separate plan-generation AIE in the Frontier App produces a structured plan doc; the admin reviews and either edits or approves. Only then does the chat UI unblock. The state machine lives in the app, not the orchestrator; the orchestrator receives the approved plan as a named runtime input.

Plan lifecycle

not_generated

generating — 2 sec ago

generated — now

approved

execution_enabled

Generated by

PGplan_generator@v1.4 (AIE)

prompt_hash: sha256:9a3f…

elapsed: 14.2s · 8.7k tokens

Permissions

Generate: R1 only

Approve: R1 only

Edit: R1, R2

Proposed plan — v3 (draft)

5 steps · 1 parallel block · est. 6 weeks

Plan-generation AIE inferred the structure from the program description ("change-management rollout of GitHub Enterprise across NA orgs, ~3,200 developers") and the Champ orchestrator's nominated AIE roster.

#	AIE	Step	side-effects
1	NBnarrative_builder@v3	Draft the change narrative; voice: dev-experience	none
2	SEstakeholder_email@v2	Draft email to leadership + eng leads + ICs	none
3	AFadoption_faq@v1	Generate FAQ from narrative + known objections	none
4	ETengagement_tracker@v2	Schedule: launch+2w — set up dashboards	writes
5	ERexecutive_reporter@v3	Conditional: launch+6w if engagement > threshold	none

Validation against current orchestrator tool registry: 5/5 AIEs accessible · all version pins resolve · dependency graph acyclic

↻ Regenerate ✎ Edit plan Approve · unblock chat → Approving emits plan.approved event → orchestrator AIE becomes invokable

Plan-generation is its own AIE invocation in the Frontier App — a separate event in the project's log, with its own prompt_hash, cost, and provenance. The orchestrator never runs against an unapproved plan because the FA simply doesn't invoke it. Plan-as-state-machine in the app; plan-as-named-input doc to the orchestrator. The two representations stay decoupled.

2. Opening a project — the briefing

Once the plan is approved, the first thing a user sees on every session is not a chat input. It is a briefing, generated deterministically from project events, structured as decisions made, items in flight, open questions, blockers, and a ranked list of what to pick up next, each with a Start session on this affordance. Every claim is clickable; every number references an event. The agent has not done anything yet. The substrate has projected what the project knows.

Projects

Q3 Change Rollout — GitHub NA

ERP Carve-Out — Region EMEA

Pricing Model Refresh — Global

GenAI Adoption — Marketing

Data Platform — Finance

Org Design — Product & Eng

ITSM Modernization — Global

M&A Integration — Project Nova

Archive · 14

TodayBriefing regenerated 09:31:14 · 23 events since last session

Q3 Change Rollout — GitHub NA

Project CR-GH-NA | Phase: Execute | Owner: Alex Morgan

Decisions made this week (3)

View all →

1.	May 19 14:32	Adopt phased rollout for Enterprise Cloud orgs	[RAT-124]
2.	May 18 11:07	Use training-led adoption vs comms-first	[RAT-118]
3.	May 16 09:41	Deprioritize custom provisioning for now	[RAT-112]

In-flight items (2)

View all →

	CR-317	Update change impact assessment for Engineering	Alex Morgan	in_flight
	CR-289	Finalize enablement plan for ITSM pilot	Jamie Lee	in_flight

Open questions (2)

View all →

Q-142	How should we sequence org-level vs team-level rollouts?	Raised by Jamie Lee
Q-137	What's the minimum viable success metric for Phase 1?	Raised by Priya Nair

Blockers (1)

View all →

B-067

Pending legal review of customer data handling addendum

Owner: Taylor Kim

Suggested next

ranked by impact & readiness · click to start a focused session

1	CR-317	Update change impact assessment for Engineering Reason: Unblocks rollout comms and stakeholder planning	High impact	Start session →
2	B-067	Follow up on legal review of data handling addendum Reason: Legal approval is gating customer comms	High impact	Start session →
3	Q-142	Decide on sequencing: org-level vs team-level rollouts Reason: Needed to finalize rollout plan and timelines	Medium impact	Start session →
4	CR-289	Finalize enablement plan for ITSM pilot Reason: Pilot plan due this week	Medium impact	Start session →

Members (6)

Alex Morgan

Owner

Jamie Lee

Consultant

Priya Nair

Manager

Taylor Kim

Legal

Samir Patel

Analyst

Activity tail

09:15

Jamie Lee updated CR-317

Added impact summary for Eng

08:42

Alex Morgan commented on Q-142

Agree on sequencing question

Yesterday 16:55

Priya Nair added RAT-124

Phased rollout rationale

Yesterday 14:21

Taylor Kim updated B-067

Legal review in progress

The briefing is a projection over the project's event log. Nobody writes it. It is computed at session start by walking decisions, in-flight claims, open questions, blockers, and recent activity, then rendered as prose by the agent's first turn. The suggested-next list is ranked by impact and readiness from the same projection.

3. Asking the agent — the intervention

The interesting moment isn't when the agent succeeds. It's when the agent refuses to start. A user types "draft the launch email"; the agent sees that launch_email v3 exists, sees that another member has a parallel in-flight claim, and responds with a structured intervention before any tool fires. The user picks. The agent claims the in_flight slot and gets to work.

User · Alex Morgan

draft the launch email

09:26:11Z

Agent

BEFORE_I_DRAFT

Matching artifacts

artifact_id	version	last edited	similarity
launch_email	v3	2 days ago by Alex	0.91	view

Active in-flight

actor	intent	started	session ref
Jamie Lee	launch email v4	09:14Z (12m ago)	sess_87

What I'd recommend

09:26:12Z

User · Alex Morgan

iterate on v3

09:26:25Z

Agent

Claiming in_flight slot, loading launch_email v3…

✓ in_flight.claimed intent: "iterate on v3" · expected_outputs: [launch_email v4]

✓ get_latest_artifact(launch_email) → v3 (canonical, 2d ago)

✓ context loaded: 5 session claims + project memory + plan step 2

Proposing edits below. Open side-by-side diff →

09:26:27Z

Live to other members: Jamie Lee's session (sess_87) now sees "Alex claimed launch_email v4 — coordinate?" intervention if Jamie's next tool call would conflict.

Session context

Claims this session (5)

📄Target launch window Oct 6–12, 2026

09:05Z

📄Primary audience is internal employees (Global)

09:07Z

📄Key message: simplified approvals + faster cycle times

09:08Z

📄Tone: clear, confident, concise

09:08Z

📄CTA: Read more in the Change Rollout hub

09:09Z

Tools used

🔍search_artifacts 3

📋list_sessions 1

📄read_artifact 1

⚡semantic_search 1

Session

session_idsess_91

started09:04:31Z

tokens2,772

The agent's first response to an artifact-creating request is a structured intervention, not a draft. Matching artifacts come from the semantic index over the project's canonical store; active in-flight rows come from the registry every session writes to when it starts substantive work. The recommendations are ranked, not enumerated arbitrarily — option 1 is the model's best guess at intent.

4. Your work — task drawer plus HITL inbox

Compounds (Part 1): reviewer accuracy — each HITL bubble-up the human resolves is a labeled disagreement-or-agreement event that tunes when the reviewer can triage on its own next time.

One unified surface for "what's waiting on you." Across every project the user belongs to: in-flight Jobs (your own and ones you're a member of), pending HITL approvals routed to your role, blocked items where you're the unblocker, and recently completed work. Same substrate emission, two materialized projections (jobs view + HITL queue), one inbox. Click any row to open its full context.

Filters

Pending HITL 3

Jobs you initiated 5

Jobs you're on 12

Background jobs 4

Completed last 7d 23

Project

Q3 Change Rollout 8

Pay-equity 2026 5

Pricing Model Refresh 3

ERP Carve-Out (paused) 1

NEEDS YOUR ATTENTION

HITL · approval Approve memory update: "Readiness threshold for NA-region rollouts = 78"

Raised by agent (sess_91) · Q3 Change Rollout · scope: project · evidence: 3 successful rollouts at this threshold

Approve Edit rationale Reject View evidence ↗ raised 8m ago

HITL · external auth Authorize SharePoint write to publish narrative_v4

Bubbled up from narrative_builder@v3 invocation · Q3 Change Rollout · dedup_key: auth::sharepoint — also requested by 1 other session, will resolve both

Authorize Deny raised 23m ago

HITL · clarification Engineering org only, or include adjacent functions?

Pay-equity 2026 · agent's first turn on a new dispatch · ambiguous scope

Send

IN-FLIGHT JOBS

Job	Project	Last activity	status
Draft Q4 launch email (iterate on v3)	Q3 Change Rollout	just now · you	running
Pay-equity dispatch (3 AIEs in parallel)	Pay-equity 2026	12m ago · Maya	awaiting_clarification
Comms plan for Phase 2 rollout	Q3 Change Rollout	2h ago · Jamie	awaiting_review
Stakeholder mapping (parallel branch)	Pricing Model Refresh	1d ago · you	paused
Quarterly pay-equity drift scan	Pay-equity 2026	running · background	background

COMPLETED YESTERDAY

✓	FAQ generation for Q3 launch — published	Q3 Change Rollout	17:42Z
✓	Vendor analysis report — exec review complete	Pricing Model Refresh	14:08Z
✓	Org chart import + manager mapping	Pay-equity 2026	10:31Z

This week

⊕

jobs completed

⏱

median HITL response

⊟

duplicates avoided

HITL backlog

approvals 1

auth/integration 1

clarifications 1

deduped 4 → 3 (1 auth request shared across sessions)

Two materialized projections from one event log: jobs (sessions grouped into work containers) and HITL queue (bubble-ups routed by approver role with dedupe). Both projections re-render on every session.*, job.*, or hitl.* event. Dedup-by-key means two parallel sub-agents that both need SharePoint auth surface as one approval; both resolve when you authorize.

5. Reviewing what the project has learned

Compounds (Part 1): skill library — accepted, edited, declined, and deferred proposals are the loop that grows the project's procedural knowledge under governance.

The skill review queue is where the project's accumulated way of working gets governed. The observation pipeline emits proposals with evidence; humans accept, edit, decline, or defer. Each proposal carries its evidence by reference: click into the events that generated it, see the actual tool calls, and judge whether the pattern is durably true or just three users coincidentally doing the same thing.

scope: project profile tenant

status: unvalidated experimental verified deprecated

sort: most evidence most recent least reviewed

Proposals (12)

check_finance_alignment

Verify proposed changes align with finance guidelines and budget guardrails.

project observed in 4 sessions

Evidence (3)

evt_8f3a7c1d	Alex	2026-05-19 14:32:11Z	Alex reviewed stakeholder draft v3 → emitted check_finance_alignment tool call	›
evt_3b9e2a44	Alex	2026-05-18 09:17:03Z	Alex compared forecast vs budget → emitted check_finance_alignment tool call	›
evt_6c1d0e91	Maya	2026-05-17 16:45:27Z	Maya validated Q2 plan → emitted check_finance_alignment tool call	›

AcceptA EditE DeclineD DeferF CONFIDENCE unvalidated experimental verified

summarize_decision_rationale

Summarize the key rationale and trade-offs behind a decision.

project extracted from trace

Evidence (3)

evt_a7d2e9b1	Maya	2026-05-19 11:08:54Z	Maya documented decision on vendor → emitted summarize_decision_rationale	›
evt_2d4f6c77	Alex	2026-05-18 10:22:31Z	Alex finalized design choice → emitted summarize_decision_rationale	›
evt_9e4b1f53	Sam	2026-05-17 13:55:42Z	Sam captured trade-offs → emitted summarize_decision_rationale	›

AcceptA EditE DeclineD DeferF CONFIDENCE unvalidated experimental verified

identify_stakeholders

Identify and list relevant stakeholders for the initiative.

project observed in 3 sessions

Queue stats

≡

pending

✓

reviewed today

⊕

auto-clustered

⏱

median review time

Recent decisions

prioritize_risksverified

Accepted by Alex · 18m ago

draft_comms_updateexperimental

Edited by Maya · 42m ago

assess_compliance_impactverified

Accepted by Sam · 1h ago

build_rollout_plandeferred

Deferred by Alex · 2h ago

Auto-clustering

8 proposals grouped into 3 clusters

Review clusters →

Each proposal opens with its scope, evidence count, and origin (observed vs extracted vs mined). The confidence ladder shows where the proposal sits in the maturation arc. Accept commits at the current confidence; Edit lets you adjust the procedural description before commit; Decline retains the evidence so the same pattern can't regenerate a duplicate next week; Defer parks it for later review.

6. Investigating why a decision was made

Compounds (Part 1): access denial alternatives — the same causal-chain primitive that renders this audit walk is what lets the agent learn which fallback path worked the last time permissions blocked a primary one.

Audit isn't a feature in this system; it is the storage. Every event names the event that caused it, and the question "why did this happen" is a graph walk. The user picks a decision (or an artifact, or any committed event), and the surface renders the causal chain backward through time (session start, tool requests, patches proposed, patches applied, the decision recorded), each row with actor, timestamp, and a one-line summary, the selected event highlighted, its full payload in JSON on the right.

Filters

Actor

Emma Chen 3

Arjun Kumar 1

Sam Rivera 1

Event type

session.started 1

tool.requested 1

patch.proposed 1

patch.applied 1

decision.recorded 1

5 events in this chain

1 session · 5m 24s duration

2026-05-29 10:14:02 → 10:19:26

View full trace graph →

Causal chain

View: chronological

10:14:02.123

05/29/2026 UTC

session.started EC Emma Chen

Session started: Q3 Change Rollout investigation

⌄

caused_by: — (root)

10:14:15.487

05/29/2026 UTC

tool.requested EC Emma Chen

Requested tool: policy_context.fetch

⌄

caused_by: a3c7f6b2-8d1a-4b70-9e6e-1f8b2c9e7a11

10:15:02.731

05/29/2026 UTC

patch.proposed AK Arjun Kumar

Proposed patch: deprioritize custom provisioning

⌄

caused_by: 4b9e1d0a-b6c3-4c70-9f3d-2a6c1e5d7b22

10:16:48.905

05/29/2026 UTC

patch.applied AK Arjun Kumar

Applied patch: deprioritize custom provisioning

⌄

caused_by: 7d2e4f9c-3a5b-4e69-8d8d-9b1f2d7a6c33

10:19:26.341

05/29/2026 UTC

decision.recorded SR Sam Rivera

Recorded decision: deprioritize custom provisioning

⌃

caused_by (parent → child) currently selected

Event details

event_id: e1d9f2a7…6c91

Payload (JSON) View raw </>

{
  "event_id": "e1d9f2a7-5c72-4d2b-9d15-2b6e4f0e6c91",
  "event_type": "decision.recorded",
  "timestamp": "2026-05-29T10:19:26.341Z",
  "actor": {
    "actor_id": "usr_7f3b9c2d",
    "name": "Sam Rivera",
    "email": "sam.rivera@acme.com",
    "role": "compliance_analyst"
  },
  "project": {
    "project_id": "proj_q3cr",
    "name": "Q3 Change Rollout"
  },
  "decision": {
    "decision_id": "dec_91baf6f1",
    "title": "deprioritize custom provisioning",
    "outcome": "approved",
    "rationale": "Reduces maintenance overhead and risk. Custom workflows will continue to be supported",
    "confidence": 0.78
  }
}

Appears in

artifact	type	reference
RAT-112	Decision	/CR-GH-NA/RAT-112 ↗
dec_91baf6f1	Decision Record	/decisions/dec_91baf6f1 ↗
run_83a7c1d4	Agent Run	/runs/run_83a7c1d4 ↗
policy_v3.4.1	Policy Version	/policies/v3.4.1 ↗

Actions

⑂ Fork from here

Create new branch at this event

↻ Replay from here

Re-execute forward from this event

Forking or replaying will create a new trace branch.

Every event names its cause. Walking the chain backward from a decision shows exactly which session, which tool calls, which proposed patches led to it. Fork creates a counterfactual branch from any point; replay re-executes forward against the same event sequence to detect non-determinism. The same surface serves compliance auditors, debugging agent engineers, and curious power users.

7. Setting up an orchestrator — the builder wizard

A one-time builder flow rather than a 9-tab settings panel. The orchestrator is an AI Employee of type=orchestrator, and most of what a Phase 1 builder needs to configure is the same four things every AIE has: identity, capabilities, knowledge, governance. Four steps with sane defaults; the full event-sourced config edit lives behind a separate "advanced settings" view (also rendered, also event-sourced, intentionally not shown for the common path).

✓ Identity → ✓ Capabilities → 3 Knowledge & named I/O → 4 Governance

~5 minutes

Step 3 — Knowledge & named I/O

What does this orchestrator read at session start, and what does it return as structured output? Default values are derived from the Champ Frontier App's document-store schema.

Named runtime inputs — pulled into the prompt every session

required plan approved plan doc from the Frontier App's plan store; orchestrator reads but doesn't write remove

optional program_memory project-scope memory; orchestrator reads and proposes updates via review queue remove

optional docs_folder all artifacts in scope; orchestrator reads via get_latest_artifact remove

+ Add named input Frontier App's doc-store schema parses automatically

Skill folders — loaded into the system prompt by scope

project champ/skills/orchestration/ 12 procedural skills (load order, AIE selection rules, comms tone) edit

tenant kpmg/firm_voice/ 3 firm-wide voice and compliance skills edit

+ Add skill folder Project-scope skills auto-populate from the observation pipeline (post-Phase 2)

Named outputs — structured returns the FA expects

text free-form response shown in chat

generated_doc[] typed docs: title + type enum (narrative, comms, faq, dashboard, report) + content/link · routed to the docs folder

← Back: Capabilities Advanced settings (event log) ↗ Next: Governance →

What this becomes

// emitted as AIE config events

type: "orchestrator"

name: "Champ"

named_inputs: [plan*,

program_memory, docs_folder]

named_outputs: [text,

generated_doc[]]

skill_folders: [project,

tenant]

tools: auto-nominate + 3 standard

version: v0.1 (draft)

Every change is

an event in this AIE's log

12 events so far in this session · diff against base@v1.18

After publishing

Every session that loads this orchestrator sees v0.1

Audit / metrics / permissions free via existing AIE infra

Roll back via "replay to event N"

Four steps, sane defaults, an "advanced settings" door for the rare power user. The wizard's right rail shows what the chosen settings will compile to as AIE configuration events — every change is event-sourced, so diffing, forking, and rolling back this orchestrator is the same substrate primitive used everywhere else. A Champ-shaped AIE is published in under five minutes; further refinement happens in advanced settings post-launch.

8. Closing the loop — the operator console

Compounds (Part 1): cost ceilings, knowledge-gap closures, and reviewer accuracy — three of the six rows have their trend, their budget, and their regression catch rendered as panels of this single console.

Most of the seven surfaces above serve people doing the work: the consultant briefed on entry (Surface 2), the analyst caught by an anti-duplication intervention (Surface 3), the manager approving an HITL bubble-up (Surface 4), the compliance reviewer walking back a decision (Surface 6). A few serve people who configure the system: the admin generating a plan (Surface 1), the curator triaging the skill review queue (Surface 5), the builder publishing an orchestrator (Surface 7). The operator console is for the person whose job is the loop itself: the orchestrator owner, the platform engineer, the FinOps lead who lives inside the system every day. Four panels, one substrate, all four signals "is this thing actually compounding" needs.

Each panel is a projection over the same event log; the console is four queries with their own UI, not four systems. Knowledge gaps come from the observation pipeline subscribing to low-confidence outputs and explicit don't-knows. Cost vs budget rolls up the per-event token and cost metadata that ships in Phase 1. Replay regression results come from the deploy-gate flow (Part 4, fourth replay mode). The principal chain comes from walking caused_by across every decision the operator filters as high-risk.

Without this surface, every claim in Part 1 ("projects compound") is something the operator has to trust. With it, every one of the six compounding things has a place where its trend is visible, its cost is bounded, its regressions are caught before deploy, and its decisions are walkable. The console is where the compounding claim stops being a promise and becomes a measurable surface.

Scope

project

profile (Champ)

tenant (KPMG)

Time range

7 days

30 days

90 days

custom…

Jump to

Knowledge gaps (19)

Cost vs budget

Replay regression (3)

Principal chain (12)

Action this period

Gaps closed: 7 / 19

Deploys gated: 3 (2 pass)

Chains reviewed: 12

Knowledge gaps — last 30 days (19, sorted by frequency)

View all →

freq	Question the agent couldn't confidently answer	best conf.	likely source	authoring affordance
8	"What's our standard SLA for legal sign-off on customer comms?"	0.42	missing memory	Author memory →
5	"How do we handle a region-specific data-residency exception?"	0.31	missing skill	Propose skill →
3	"Is the GitHub Enterprise SKU stable for >5k seats this quarter?"	0.55	missing data source	Request connector →
3	"Which AIE handles localized FAQ generation for non-NA regions?"	0.20	missing tool	Add AIE →
2	"What's the recommended deprecation timeline for legacy AzureAD groups?"	0.38	missing memory	Defer

Cost vs budget — this period

$4,620 / $7,200 spent · 64% · 11 days remaining

Top spenders this period	jobs	cost	% of budget	enforcement
KPMG / Champ	142	$3,210	64% of $5,000	1 job auto-rerouted
KPMG / PPM	38	$980	49% of $2,000	—
KPMG / Risk Advisory	12	$430	21% of $2,000	—

Replay regression — recent runs (3 candidates queued)

View all →

run	candidate vs base	sessions	divergence	deploy gate
r_8821	Champ orchestrator @v1.5 vs @v1.4	312 prod	0.7% decisions differ	pass
r_8807	model: Sonnet 4.7 vs Sonnet 4.6	312 prod	4.2% decisions differ (cost −18%)	review
r_8791	skill folder rev vs base	312 prod	0.0% — no semantic change	pass

The 4.2% divergence on r_8807 is the operator's call: the replay surface (fork + replay from Surface 6) renders side-by-side decision diffs; the deploy is held until accept or reject. The deploy is held until accept or reject — the reproducible-test posture the deploy-gate replay is for.

Principal chain — recent high-risk decisions (filtered: irreversible · external write · regulated)

View all →

Published narrative_v4 → SharePoint May 25 14:18Z · sess_91 · CR-GH-NA

A Alex Morgan → Champ orchestrator → narrative_builder@v3 → A2A: SharePoint chain signed ✓ inspect →

Approved shared-scope memory update: readiness threshold = 78 May 24 11:07Z · sess_88 · CR-GH-NA

A Alex Morgan → Champ orchestrator → propose_shared_update → project-admin HITL chain signed ✓ inspect →

Dispatched pay-equity audit (3 sub-AIEs in parallel) May 23 09:14Z · sess_72 · Pay-equity 2026

M Maya Patel → HR Suite orchestrator → pay_equity_audit + 3 sub-AIEs → Workday read · admin-scoped chain signed ✓ inspect →

This period (30d)

gaps surfaced

✓

gaps closed (37%)

64%

of $7.2k budget

↻

deploys gated

⊙

chains walked

Gap close rate, by week

Each authored fix shrinks next week's gap surface. The slope is what compounding looks like, instrumented.

Auto-action this week

1 job killed at ceiling

3 jobs rerouted to cheaper tier

0 sessions paused for approval

12 HITL preemptions audited

Four projections, one event log. Knowledge gaps come from low-confidence outputs the observation pipeline already watches. Cost vs budget is the per-event metadata rolled up by Profile, with enforcement events (kill / reroute / pause) overlaid on the burn line. Replay regression runs the deploy-gate flow against last quarter's prod sessions and surfaces the diff. The principal chain on each high-risk decision walks caused_by across the orchestrator, its sub-AIEs, and any A2A handoff. Together the four panels are the operator's instrument for the six compounding things in Part 1.

Part 7 — what we built: the runtime console, in code

The eight surfaces above are forward-looking — what the system will look like to the people who interact with it. This section is the present-tense companion: what shipped on the implementation branch behind PR #429, what's clickable today on localhost:8420, and how every claim in Parts 1–5 became a test the operator can drive against a live substrate. The pieces are real but the surface is intentionally engineering-grade: a single-page operator console at /runtime_console that exposes every projection, hook, and tool from the prior sections without polishing them into product UI yet. The point isn't to ship the same console to consultants — the point is to make every substrate primitive a clickable test.

A note on the phase labels you'll see below. Part 5 named the buildout in three phases. What actually landed is Phase 1B — Part 5's Phase 1 surface as-built, with five spec gaps closed — plus two Phase 2 primitives pulled forward (artifact lifecycle, shared-update review queue) and a Tier 2/3 verification layer (causal-chain LLM provenance splice, hermetic LLM recording/replay, session fork + structural diff, HITL resolution loop) that landed in the same branch because the cost of deferring them was higher than the cost of finishing them. "Phase 1B" in this section is shorthand for that combined surface; the implementation-status table at the end accounts for every primitive Part 5 listed against the current reality.

From mocks to merged

Across the Phase 1B surface, the pulled-forward Phase 2 primitives, and the Tier 2/3 verification work, what's live in the branch:

An event-sourced log carrying 26 event types, each emitted at exactly one site: session lifecycle, per-turn, briefing, tool-execute, in-flight, artifact lifecycle, decision, HITL ask + resolution, shared-update propose-approve.
Two projectors: project_state (the briefing-and-slice surface) and in_flight (the duplicate-check registry).
Profiles — versioned, project-scoped, with required named_inputs and a declared tool roster.
The docs folder — read at session-entry and snapshot-attached to the runtime context.
The briefing turn — slice → LLM → prose injected into the agent's system prompt before the user's first turn.
The tool-execute hook with three layered gates: briefing-first (refuse operational tools until the briefing lands), duplicate-check (artifact-creating tools route through the in-flight registry), then run.
Six built-in tools that let the agent both introspect and mutate the substrate from inside a turn — two read-side (record_decision, read_project_state) and four write-side (emit_artifact, review_artifact, publish_artifact, propose_shared_update).

The branch ships with two seeded profiles (Champ at kpmg, designing-llm-router at ema) and a project hydration script that lands 21 events plus 2 canonical docs into proj_llm_router so the slice already feels like an in-flight engineering program from session 1.

Where Part 6's mocks render the FA-facing surfaces (start a program, get briefed, investigate a decision, set up an orchestrator), the runtime console is the engineering surface underneath them. It's where you see the briefing actually land, the duplicate-check actually fire, the slice actually update mid-turn — all running against the live event log in Postgres, not against the simulated screenshots above. It's also the surface that makes the qualitative tests of the substrate's value props clickable.

The runtime console at `/runtime_console`

One page. Four tabs (Profiles · Sessions · Project state · Events) over a three-column workbench. The whole console deliberately reuses the operator-mock aesthetic — paper background, warm-orange accent, square borders, Geist Mono throughout — because the substrate is what it visualizes, not Material Design. The chat surface is a typed block model; every SSE event the runner emits has a block variant, so a refresh re-paints the conversation exactly as streaming did. The right rail's KPIs increment from SSE deltas during a turn (no waiting for tab reloads), and the project-state tab auto-refreshes (debounced 500ms) on every state-affecting tool event. A bearer-token field at the top, a color-keyed user-switcher pill next to it, a project picker populated by scanning the event log for distinct project ids in the tenant. Below: seed buttons, scenario chips, the briefing's suggested_next rendered as one-click starter prompts.

Profiles

Sessions

Project state

Events

auto-refresh: on (3s)

Quick actions

Seed Champ profile (kpmg)

Seed router profile (ema)

Seed router project

New session

Recent sessions

3058b5cc0a completed

19b93199b2 completed

aa4f45f5e1 awaiting_human

▸ System prompt (live) chars: 2,847 briefing inserted ✓

briefing

Router project sits between cascade v2 landing and v3 design. Taxonomy axis enums have been ratified (intent × complexity × domain) per dec_seed_001 and dec_seed_002. Cascade v2 is implemented in router/cascade_v2.py and passed offline replay (dec_seed_003). Next question on the table is the live grader model and sampling rate for cascade-v3 rollout. Router Lead has been the only active member this week. The single open question is evt_seed_q1 (grader sampling rate).

suggested next: iterate → dec_seed_003 — land cascade v2 · answer_open_question → evt_seed_q1 — decide on grader model

14:23:04

you

Summarize where the project stands. Cite specific recent decisions and any open questions by id.

14:23:08

tool

→ read_project_state ok

fields: ["recent_decisions", "recent_questions"]

{"recent_decisions": [6 entries], "recent_questions": [{"request_id": "req_seed_q1", "question": "Should the live grader sample at 5% or 10% during the cascade-v3 rollout?"}]}

14:23:10

decision

Cascade v3 rollout: grader sample at 10% for first week, drop to 5%

by Soham Shah usr_router_lead just now

10% sampling tightens the CI on the first 1k production traces enough to catch regressions early; cost overhead is ~$1.5k for that week. Drop to standard 5% after.

alternatives considered: 5% throughout (slower CI) · 20% throughout (cost prohibitive)

14:23:14

agent

The project is mid-stride on cascade v3. Six decisions ratified across taxonomy, cascade lineage, synthetic prompts, and grader choice — most recent is dec_seed_006 (cascade v3 design proceed). The single open question (evt_seed_q1) is grader sampling rate; I've just landed a decision on that. Active claim on router/cascade_v3.py is yours.

3 iter · 12,840 in · 320 out · 1,800 cached · 4.2s

14:23:18

Activity (24h)

This scope

27events recorded

2briefings landed

1claims active

7decisions

The Sessions tab mid-turn. Above the chat, a collapsed system-prompt panel surfaces a green briefing inserted ✓ tag — the substrate's headline check that the briefing actually landed in the agent's context, not just in the audit log. Below, the chat is a typed block model: a green briefing card lands at the top (discovered post-turn from the latest briefing.generated event), then user input, then a tool-call card for read_project_state with structured args + result, then a green decision card (rendered above the generic tool block because the decision is the headline; the tool call is the audit detail), then the agent's prose reply with a per-turn cost/latency footer pulled from llm.completed events. The right rail's sparkline is the project's 24-hour activity profile (1-hour bins); the KPIs increment live from SSE events without waiting for a tab refresh. Auto-refresh of the Project state tab fires on every state-changing tool call (debounced 500ms) so the slice updates while you watch.

The chat is the substrate, typed

Every block variant on the chat surface corresponds to an event type. The block model means a refresh re-paints the conversation exactly as streaming did, because both flow from the same source: the events emitted by the runner are matched against block constructors keyed by the SSE event name. There are ten block variants today.

Block	Trigger	What it shows
`blk-user`	user-message send	The operator's typed input, monospace, accent label.
`blk-agent`	`text_delta` SSE	Streaming agent text painted char-by-char via a 3-char-per-16ms typewriter buffer; finalizes to a rendered markdown block on `tool_executing` or `done`.
`blk-thinking`	`thinking_delta` SSE	Extended-thinking output, collapsed details, mono italic.
`blk-tool`	`tool_executing` SSE	Tool name + args JSON + pulsing "running" pill; result fills in on `tool_result`, truncates at 1200 chars with show-more toggle.
`blk-decision`	`tool_result` for `record_decision`	Green-bordered card with title + actor + rationale + alternatives list. Rendered above the generic tool block — decision is the headline; tool is the audit detail.
`blk-briefing`	Discovered post-turn	Green-bordered card with the briefing's prose preview + suggested_next refs (each ref is a click-to-jump link into the Events tab).
`blk-hitl`	`approval_required` / `question` SSE	Orange-bordered approval card with Approve / Reject buttons or an answer input; POSTs to `/api/sessions/{id}/respond`; marks resolved with a tag after response. Recoverable across refresh via `/turn-state`.
`blk-error`	SSE `error` with data	Red-bordered banner.
`blk-system`	Substrate annotations (Redis warnings, infra checks)	Monospace dim, system-level notes.
`blk-subagent`	`subagent_*` SSE	Sub-agent activity in its own accent-soft border.

Above the composer, two strips of one-click chips: the latest briefing's suggested_next rendered as accent-colored chips (clicking drafts the operator's prompt to act on a substrate-recommended move), plus per-profile scenario templates covering the headline tests for that profile (six for designing-llm-router: status report, propose cascade v4, record a grader decision, audit recent file activity, test duplicate-check, identify blockers). The cross-session memory loop closes here: the briefing recommends a move; the operator clicks once; the agent executes. The substrate's recommendation becomes operator action without anyone typing.

read_project_state — the substrate, queryable

Until this work the substrate flowed into the agent only one-way: the briefing turn read the slice, the LLM wrote prose, the prose got appended to the system prompt. The agent never queried the substrate during a turn. That was the brittle bit — the agent had to trust the briefing's summary and couldn't verify a claim like "did we actually settle on opus as the grader?" without making it up. read_project_state is a built-in tool registered automatically on every substrate-enabled session that returns the live project-state slice JSON — same shape the briefing reads, materialized fresh on every call.

→ Tool signature auto-registered on every substrate-enabled session

read_project_state(
  fields:              list[str]  # optional — restrict to a subset:
                                  #   recent_decisions, recent_questions,
                                  #   recent_file_reads, recent_file_writes,
                                  #   recent_briefings, recent_interventions,
                                  #   in_flight_claim_event_ids, members_active_this_week,
                                  #   canonical_artifacts, open_questions,
                                  #   blockers, suggested_next
  max_items_per_list:  int        # cap each list, most-recent kept (default 20)
)
→ JSON {
    project_id, slice_hash,
    recent_decisions: [
      { decision_id, title, rationale_ref, decided_by_actor_id, decided_at, alternatives },
      ...
    ],
    recent_file_reads:  [{ path, by_actor_id, at, op }, ...],
    recent_questions:   [{ request_id, question, asked_by_actor_id, asked_at }, ...],
    ...
  }

The tool's schema. fields lets the agent scope the response to what it actually needs (saves tokens — most calls only want recent_decisions or recent_questions); max_items_per_list caps each list field, most-recent kept. The profile's base_instructions tells the agent to call this tool before any non-trivial routing/grading/taxonomy decision — verify what the briefing said, cite the actual decision_id in the response, look up prior conflicting decisions before recording a new one. The substrate becomes alive in the agent's reasoning, not a passive context appendage.

The combination of briefing-injection + read_project_state is what makes cross-session memory bidirectional. Briefing flows the project's history into the agent at session-start; the tool lets the agent flow questions back into the substrate during the turn. Both depend on the same projector; both materialize the same slice. There's no second source of truth to drift.

Four write-side tools — the substrate as a destination

Reading the slice is half the loop. The other half is letting the agent commit work-products into the substrate in a shape the next session will load. record_decision covered the lightweight case from day one — title, rationale, alternatives, a single event. The Phase 2 work added four more write-side tools that take a heavier-weight artifact through a state machine without inventing a separate documents subsystem.

Tool	Event(s) emitted	What it does
`emit_artifact`	`artifact.created`	Writes a versioned doc into the project's docs folder. Doc-type allowlist: `narrative` / `comms` / `faq` / `dashboard` / `report` / `html` / `other`. Status starts at `draft`; subsequent calls with the same `name` bump the version, never overwrite.
`review_artifact`	`artifact.reviewed`	Operations: `open` (draft → under_review), `approve` (under_review → approved), `reject` (under_review → draft, with reviewer note). Refuses transitions the state machine doesn't allow.
`publish_artifact`	`artifact.published` + auto-emit `artifact.deprecated` on the prior published version	Approved → published. The partial-unique-index on `(project_id, name, status='published')` guarantees one published version per name; the auto-demote keeps the invariant when a newer version graduates.
`propose_shared_update`	`shared_update.proposed`	Project-local agent surfaces a candidate tenant-wide policy or learning (e.g., "use opus as default grader across all router projects"). Lands as a proposal in the right-rail review queue; tenant-memory write happens only after operator `approve`.

The four tools share a common shape: every one is auto-registered on every substrate-enabled session, every one emits exactly one event per call (plus the auto-deprecate on publish), every one validates state machine transitions before writing, and every one threads caused_by back to the calling iteration's prompt.built. The audit chain walks across tool boundaries without special-casing.

Artifact lifecycle, walked

The lifecycle isn't a separate microservice — it's three event types and a partial unique index on the docs table. The runtime console exposes it as status pills on every artifact in the Project state tab, click-through to a detail card with the version history, and a compare picker that renders colorized unified-text diff plus a JSON-level added/removed/changed key delta. A typical run: agent calls emit_artifact, then review_artifact(open), then review_artifact(approve), then publish_artifact. Four tool calls, four events on the chain, four status pills in the UI.

Docs · cascade-v3 rollout one-pager · 3 versions

v1 · draft

emit_artifact

Soham · 14:08

→

v1 · under_review

review_artifact(open)

Maya · 14:12

→

v1 · approved

review_artifact(approve)

Maya · 14:14

→

v1 · published

publish_artifact

Soham · 14:17

Compare v1 ↔ v2 — text diff

@@ -3,4 +3,5 @@cascade-v3 one-pager · §2 rollout

Cascade v3 escalates from cheap to expensive routes per-token.

- Live grader: claude-opus, sample at 5% throughout rollout.

+ Live grader: claude-opus, sample at 10% for week 1, then 5%.

+ Rationale: 10% tightens week-1 CI enough to catch regressions early; cost ~$1.5k.

Cost ceiling: 1.4× of v2's average per-request cost over the rollout window.

Rollback path: feature flag cascade_v3_enabled can revert to v2 in <30s.

JSON delta: changed: grader_sampling.week_1 (5% → 10%) · added: grader_sampling.rationale · status: approved → approved

Status walks left to right as the agent advances the artifact through the lifecycle. Each step is one event; each event names the actor; the published version inherits the accent color the operator console uses everywhere to mean "canonical right now." The compare card below renders the text-level diff (additions green, removals red) and a JSON-key delta beside it — both fall out of one route (/api/runtime/artifacts/{id}/diff?from=N&to=M). The substrate doesn't have a documents subsystem; it has a docs table with three lifecycle events and an index. The console renders both axes.

Shared-update review queue — projects writing back to tenant memory

Some learnings stay local to a project (cascade v3's grader-sampling decision is router-project-specific). Others belong tenant-wide (the conclusion that opus is the right default grader across every router project). The substrate handles the latter without a separate workflow: propose_shared_update emits a shared_update.proposed event, the right-rail review queue surfaces it to operators with the proposer's actor + rationale + diff against the current tenant_memory state, and the queue's Approve button writes the change through to the tenant_memory artifact (and emits shared_update.approved); Reject closes the proposal with a reason (and emits shared_update.rejected). No proposal mutates tenant scope without an operator click. The whole loop is four tools and three events — same shape as artifact lifecycle, scoped one level up.

Naming note. Part 4's four-scope memory model distinguished Profile memory (cross-project calibrations for a way of working) from Org memory (firm-wide constraints). The substrate's tenant_memory partition currently holds both — the conceptual scope distinction lives in the shared-update's payload, not the storage layer. Splitting the partition along Part 4's lines is a Phase 3 cleanup, not a Phase 1B bug.

Shared-update review queue · tenant: ema · 2 pending

proposed Default grader: claude-opus across all router projects proj_llm_router · 14:21

by Maya Patelusr_grader_ic

The cascade v2 and v3 efforts in proj_llm_router both settled on claude-opus as the live grader after measuring >94% agreement-with-human across 800 traces. Promote to tenant-default so the next router project starts from this baseline instead of relitigating.

refs: dec_seed_005 (proj_llm_router, v3) · dec_v2_grader_pick (proj_llm_router, v2)

Approve → write to tenant_memory Reject…

proposed Synthetic prompts: rare-intent buckets only (don't broaden) proj_llm_router · 14:18

by Alex Chenusr_ml_eval

Broadening synthetic prompts to all intent buckets produced net regression in proj_llm_router (dec_seed_003). Recommend tenant-wide policy: synthetic for rare-only.

Approve Reject…

The review queue lives in the right rail of the console; this view zooms into two pending proposals on the ema tenant. Each one names the proposer, cites the load-bearing decision events from the originating project, and shows the operator the actual tenant-memory diff that approval would write. Approve emits shared_update.approved and writes through to the tenant_memory artifact; Reject prompts for a reason and emits shared_update.rejected. The substrate's promotion path from project-local learning to tenant-wide policy is one tool, one queue, one click.

Multi-actor identity, color-keyed

Every event carries an Actor(kind, id, name). The actor is derived from the session's user identity at create-time and propagates through every event the session emits. The project-state slice rolls per-actor signals up: members_active_this_week, recent_decisions[].decided_by_actor_id, recent_file_reads[].by_actor_id, recent_questions[].asked_by_actor_id, recent_interventions[].matches[].actor. Identity threads through the log without anyone curating it.

The runtime console exposes this via a topbar user switcher pill backed by a curated roster of eight identities (five for the ema tenant covering the router-design program, three for kpmg covering the Champ change-management program). Each actor gets a deterministic color (HSL hash on actor.id) so collaboration becomes visible in the slice rendering — every decision, file touch, question, and intervention carries the color of whoever produced it.

Members active this week

Soham Shahusr_router_lead Maya Patelusr_grader_ic Alex Chenusr_ml_eval

Recent decisions (6, last 14 days)

Cascade v3 spec: cost-aware escalation — proceed to design

by Soham Shahusr_router_lead 2026-05-26 14:08

alternatives: Stay on v2 (rejected — leaves cost-aware gains) · Skip to v4 RL-trained policy (rejected — premature, no data)

Live grader = claude-opus, sample 5% (initial)

by Maya Patelusr_grader_ic 2026-05-25 09:42

alternatives: Sonnet at 10% (rejected — agreement too low) · Ensemble (rejected — complexity not warranted)

Synthetic prompts only for rare-intent buckets

by Alex Chenusr_ml_eval 2026-05-23 16:30

alternatives: All buckets (rejected — net regression) · No synthetic at all (rejected — rare buckets stay sparse)

The project-state slice rendered in the Project state tab. Each actor's id hashes to a deterministic color (HSL with fixed saturation/lightness so colors stay readable against the paper background). Scanning who-did-what is a glance, not a string match: Router Lead's blue threads through cascade-v3 work; Grader IC's red marks the grader-choice decision; ML Eval's green covers the synthetic-prompts call. The substrate doesn't have an opinion about "users" — every event is just an Actor, the projector rolls per-actor signal up by id, and the UI maps id to color. Switching identity in the topbar changes which id every subsequent event carries.

Causal-chain walker — every decision is auditable

The substrate's audit primitive is the caused_by field on every event. The runtime console makes this clickable: every event with a parent gets an inspect chain → link that opens a modal walking backward to root. Each step renders type + actor + timestamp + collapsible payload, so the operator can ask "why did this decision get recorded?" and trace it back through the agent's tool calls, the iteration's prompt build, all the way to session.opened (or the boundary where an ancestor falls out of the loaded scope, marked as such). Every action the agent takes is a walkable record back to root — the audit posture Part 4 promised, in code.

Causal chain · evt_d180f087

▶

decision.recorded Soham Shah 14:23:14

event_id=evt_d180f087 · caused_by=evt_22c21f76

↑

tool.requested record_decision(...) 14:23:13

event_id=evt_22c21f76 · caused_by=evt_a8fa72a6

↑

prompt.built 14:23:11

event_id=evt_a8fa72a6 · caused_by=evt_5fc72371 · prompt_hash=sha256:6f8a…

↑

tool.responded read_project_state · ok 14:23:10

event_id=evt_5fc72371 · caused_by=evt_8fea42c4

↑

session.opened 14:23:02

event_id=evt_8fea42c4 · root

Walking caused_by backward from a recorded decision. Five steps from the headline event to root, each with the actor, the timestamp, and a collapsible payload. The decision was made because the agent called record_decision, which was prompted by a turn whose context was assembled from the prior read_project_state call, which was triggered by the agent's initial reasoning on session-open. Every step is in the same event log; every chain is walkable. When an ancestor falls out of the loaded window the modal notes the boundary instead of silently truncating.

The substrate's audit primitive has one more property the Part 4 design promised but the Phase 1B branch didn't initially deliver: the cause chain has to splice in the originating LLM round-trip, not jump over it. The Tier 2 work closed this gap. Every decision, briefing, and intervention now stamps llm_request_event_id into its payload at emit time (the RuntimeContext tracks last_llm_event_id on every llm.completed); the /api/runtime/events/{id}/chain route resolves the field and inserts the LLM call's prompt-hash, model, latency, and token counts under a llm_provenance key in the chain response. The audit walk no longer skips "the part where the model decided" — it shows it.

Hermetic replay and fork & diff — the substrate as a regression surface

Part 4 framed replay as the deploy-gate primitive: re-execute last quarter's prod sessions against a candidate model or prompt and gate the promotion on diff-under-threshold. The deploy-gate flow itself is still out of scope for Phase 2, but the substrate now ships every primitive that gate would consume. Three pieces, all in the branch:

LLM recording and replay. Every llm.completed event already carries the full response blob — prompt-hash, content, tool-calls, logprobs, model, latency, tokens. The llm_recording module wraps any provider with a RecordingProvider that captures round-trips to a JSON fixture, and llm_replay serves them back through a CachingProvider (cache-hit on prompt-hash) or a RecordedProvider (strict fixture-match). The briefing pipeline runs end-to-end in CI with no API key set, against a 12-event fixture, in tests/test_briefing_hermetic.py. Replay is no longer a future flow; it's the test surface that proves the briefing pipeline is deterministic.

Session fork. fork_session takes a session and a cut-event, copies the event prefix into a fresh project, and returns the new session id. The forked session has its own actor stamping and runs in isolation; the original is untouched. (Part 5 framed this primitive as profile forking; the shipped version forks at the session-level event stream instead, which is the more general case — profile binding is preserved in the child, so a profile-only fork is just a session fork that swaps the profile pointer before resuming.) Pair with a different prompt or a different model and you have a clean A/B for "what would the second turn have looked like if…"

Structural diff. diff_sessions(a, b) compares two sessions across decisions, file ops, briefings, event-type counts, and emits a diverged: bool flag plus a per-axis delta. Two HTTP routes expose this: POST /api/runtime/fork creates the fork; GET /api/runtime/diff?a=X&b=Y returns the diff. The console's right rail has a "Fork & diff" card that wires both into a two-click flow: fork the active session, run a different prompt in the new one, paste both ids, see the delta.

The deploy-gate flow is the policy on top of these primitives — pick a session set, pick a candidate, run replay, threshold the diff, gate the promotion. It's a Phase 3 piece. What the substrate ships today is the underlying machinery that makes the gate cheap to build: every prod call is replayable by construction, every replay is byte-comparable, every comparison runs against a forked-isolated copy of state.

The qualitative test matrix

The substrate's value props are concrete and individually testable. Every claim Parts 1–5 made gets a clickable counterpart in the console. The seeded router project (21 events, 2 canonical docs, 3 simulated sessions, 6 decisions across the program's work-streams, 1 active in-flight claim, 1 unresolved HITL question, 1 peer-session intervention) gives the operator a project that already feels in-flight, not a blank slate.

Value prop	Hard test	Where the answer lives
Cross-session memory survives restart	Record a decision, stop the server, restart, create a new session.	The new session's briefing references the decision. Event lands in the `events` table on Postgres; survives by definition.
The agent actually receives the briefing	Open the collapsed System prompt (live) panel above the chat.	Green briefing inserted ✓ tag; expand to see the `## Session briefing` markdown block appended to `base_instructions`.
Briefings summarize the slice, not boilerplate	Open Project state tab. Note specific decision ids + the unanswered question. Open a new session.	The briefing block at the top of the chat must reference at least one decision id and the open question. The system prompt requires this; failures show as generic "the project is going well" prose.
Duplicate-check prevents concurrent writes	As Router Lead start a session, ask the agent to write `router/cascade_v3.py`. While in flight, as Maya in a second window, ask the same.	Maya's `write_file` returns an `intervention` payload. A new `intervention.surfaced` event lands in the log. The slice's `recent_interventions` ticks.
Decisions are auditable end-to-end	Click any `decision.recorded` event → inspect chain →.	The modal walks backward: decision ← tool.requested (record_decision) ← prompt.built (the iteration) ← prior tool.responded or session.opened. Every step has actor + timestamp + payload.
Agent queries the substrate unprompted	Ask an ambiguous project question without telling the agent to use the tool.	A good agent calls `read_project_state` with `fields=["recent_decisions"]` and quotes the actual rationale from the slice. If it invents an answer, the profile's prompt needs sharpening.
Identity flows into the log	Switch user pill → record a decision → switch user → record another.	Project state shows both decisions in their distinct colors. `members_active_this_week` shows both ids.
Mid-turn disconnect survives	Send a long-running prompt; close the browser tab; reopen and reactivate.	Chat surface paints partial history; if mid-HITL the prompt re-renders via `/turn-state`; if completed, final state is intact.
suggested_next becomes operator action	Open a session, wait for the briefing block, click any orange "→" chip.	Composer fills with the rationale + ref; send → agent acts. Loop closed: briefing recommends, operator agrees, agent executes.
Project pulse reacts to live activity	Open Project state tab in a second window. Send a tool-heavy turn in the first.	The slice auto-refreshes (debounced 500ms) on every state-affecting SSE event. `recent_file_reads`, `recent_decisions`, `in_flight_claim_event_ids` all tick live.
Artifact lifecycle is a state machine, not free-text	Ask the agent to `emit_artifact` a memo, then `review_artifact`(open), then `review_artifact`(approve), then `publish_artifact`.	Status pill walks draft → under_review → approved → published in the Docs panel. Each step emits exactly one event; the partial unique index demotes any prior published version automatically. Invalid transitions (publish from draft, approve from approved) are refused with a state-machine error.
Artifact diff renders both axes	Run `emit_artifact` twice with the same name and different content. Open the artifact, pick v1 ↔ v2 in the compare dropdowns.	Colorized unified-text diff (additions green, removals red, hunk headers grey) plus a JSON-key delta listing added / removed / changed keys. Both come from one server route: `/api/runtime/artifacts/{id}/diff?from=N&to=M`.
Project-local learning promotes to tenant memory	Ask the agent to `propose_shared_update` for a tenant-wide policy. Switch to the right rail's "Shared-update review queue." Click Approve.	Pending card appears with proposer + rationale + refs to load-bearing decisions. Approve writes through to the tenant_memory artifact and emits `shared_update.approved`; the queue clears. Reject prompts for a reason and emits `shared_update.rejected`; tenant_memory is unchanged.
The cause chain shows the model call	Click any `decision.recorded` event → inspect chain →.	The chain modal splices the originating `llm.completed` under a `llm_provenance` key — model id, prompt-hash, latency, input/output tokens. The audit no longer skips "where the model decided"; it's a labelled node on the same chain.
Briefing pipeline is replayable without an API key	Run `uv run pytest tests/test_briefing_hermetic.py -v` with `ANTHROPIC_API_KEY` unset.	The test passes against a 12-event fixture played back through `RecordedProvider`. The briefing turn, the tool-forcing path, and the prose-injection step all run end-to-end without network.

Each row is a one-click test. None of them require staging data beyond the seeded router project (which covers the prerequisites for thirteen of the fifteen) and a one-line pytest invocation (for the hermetic-replay row). The cross-session-memory-across-restart row is a five-second test: record, restart, open. The whole matrix runs on localhost:8420 against a real Postgres, not against a mock — the Part 6 instrument made clickable a phase early.

What ships today, what doesn't

Phase 1B landed every substrate primitive Part 4's Phase 1 spec named — the event log + projector library, the profile / project / memory / artifact subsystems, the three gate integrations — with one exception (external collaboration roundtrip; see the deferred row below). The Phase 2 trio and the Tier 2/3 verification work added more primitives the original Phase 1 spec didn't enumerate. The full inventory below is sorted by status. The cuts are sharp on purpose — shipped means clickable in the console and exercised by the test suite, scaffolded means emit sites exist but no read path consumes them, deferred means design work, not just engineering.

Primitive	Status	Notes
Event log + projector library	shipped	26 event types across session lifecycle, per-turn, briefing, tool-execute, in-flight, artifact lifecycle, decision, HITL ask + resolution, shared-update propose-approve. Two projectors (`project_state`, `in_flight`). In-memory + Postgres backed.
Profile (orchestrator config)	shipped	Versioned, project-scoped; create + publish + archive lifecycle; two seeded profiles (Champ at kpmg, designing-llm-router at ema).
Docs folder (plan, program_memory, generated_doc, tenant_memory)	shipped	Read at session-entry; snapshot attached to `RuntimeContext`. Tenant-memory kind added in Phase 2 as the destination for approved shared-updates.
Session-entry hooks (bind, validate, snapshot)	shipped	Live in `SessionManager._attach_runtime`; missing-required-input returns HTTP 400 with the missing keys; briefing-emitted state rehydrates from event log post-restart.
Briefing turn + insertion	shipped	Slice → LLM (with forced `record_briefing` tool) → prose → markdown injected into both `managed.system_prompt` (persistent via `db.update_system_prompt`) and the live `harness_session.system_prompt`. Failure paths covered: `briefing.failed` with reason fields for provider_error / schema_error / no_tool_call / no_project_state_projector.
Tool-execute hook (briefing-first + duplicate-check)	shipped	Wraps `ToolRegistry.execute` with three layered gates: briefing-first (refuse operational tools until `briefing.generated` lands; `read_project_state` allowlisted), duplicate-check (artifact-creating tools route through the in-flight projector; matches return an `intervention` payload), then run.
Six built-in tools (read + write)	shipped	Two read-side (`record_decision`, `read_project_state`) plus four write-side (`emit_artifact`, `review_artifact`, `publish_artifact`, `propose_shared_update`). Auto-registered on every substrate-enabled session.
Multi-actor identity + console user-switcher	shipped	8-identity roster; `actor` threads through every event; deterministic-color rendering throughout the slice.
Artifact lifecycle (draft → review → approved → published → deprecated)	shipped	Three lifecycle events plus the partial unique index on `(project_id, name, status='published')`. Console renders status pills + click-into version table. Auto-demotes prior published when a new version graduates.
Per-version artifact diff	shipped	Routes: `/api/runtime/artifacts/{id}/versions` + `/diff?from=N&to=M`. Console renders colorized unified-text diff + JSON-key added/removed/changed delta.
Shared-update review queue	shipped	Project-local `propose_shared_update` tool emits `shared_update.proposed`; right-rail queue with Approve / Reject; approve writes through to tenant_memory artifact and emits `shared_update.approved`; reject emits `shared_update.rejected`.
Causal-chain LLM provenance splice	shipped	`RuntimeContext.last_llm_event_id` tracked on every `llm.completed`; decisions / briefings / interventions stamp `llm_request_event_id` in payload. `/events/{id}/chain` route splices the LLM round-trip under `llm_provenance`.
Hermetic LLM recording + replay	shipped	`RecordingProvider` / `RecordedProvider` wrap any provider; `LLMReplayCache.from_events` + `CachingProvider` serve prompt-hash hits from `llm.completed` blobs. Briefing pipeline runs end-to-end in CI with no API key via `tests/test_briefing_hermetic.py`.
Session fork + structural diff	shipped	`fork_session` copies a session prefix into an isolated project; `diff_sessions` emits a structural diff across decisions / files / briefings / event-type counts with a `diverged` flag. Routes: `POST /api/runtime/fork` + `GET /api/runtime/diff`. Console right-rail "Fork & diff" card.
HITL resolution loop + rehydration	shipped	`question.asked` / `question.answered` / `permission.granted` / `permission.denied` events. Pending-set derived from event log (no separate index); cross-restart durable. Slice rolls the answer back into `recent_questions`.
Skill loading from profile `skill_folders`	scaffolded	Profiles declare skill folders; runner doesn't yet load them into the prompt. ~half-day of work to land.
Reviewer pattern (profile `reviewer_model`)	scaffolded	Field exists; no review hook fires per-turn yet.
External collaboration roundtrip (UC6 P0)	deferred	Out of scope. Push to SharePoint / external system, signal-done, fetch-canonical, diff capture. Heaviest scope; needs an external integration target.
Deploy-gate replay flow	deferred	The replay infrastructure ships; the policy on top (pick session set, run candidate, threshold the diff, gate promotion) is a Phase 3 piece.
Knowledge-gap signal + observation pipeline	deferred	`briefing.failed` + low-confidence tool results give the signal substrate; the consumer isn't built.
Cost budget enforcement / auto-reroute	deferred	`llm.completed` events carry the per-iteration token + cost numbers; the policy that acts on them isn't.
Sub-agent chained identity	deferred	Sub-agents work but inherit identity wholesale; the principal-chain story is a Phase 3 design piece.

What remains genuinely deferred is the external-roundtrip integration, the deploy-gate policy that consumes replay, the observation pipeline that consumes the knowledge-gap signal, the policy layer that consumes cost telemetry, and the chained-principal identity work — each one a dedicated build cycle, not a retrofit. The Phase 2 row from the previous draft of this writeup is otherwise gone.

The substrate ships a write-side and a verification surface, not just a read-side. Phase 1B made the project compound (decisions outlive sessions), sessions coordinate (duplicate-check + briefings), every action auditable (caused-by chains walk to root, now including the LLM round-trip), agents introspect (read_project_state), and identity flow (multi-actor rendered in the slice). The Phase 2 trio adds the loop that lets a project promote work-products into canonical state (artifact lifecycle) and lets a project promote findings into tenant-wide policy (shared-update queue), with every artifact version diff-able and every session forkable for A/B replay. The substrate that "improves from its own telemetry" is no longer aspirational — it's two tool calls and an Approve button away from running every day.

The detailed engineering writeup of every commit, every test scenario, and every troubleshooting path lives in the agent-harness repo at docs/runtime-orchestration/phase-1b-runtime-console-guide.md — a 1,049-line operator's manual for the console, with a step-by-step smoke test, the hard use cases (multi-window race-condition tests, restart-survival tests, identity-flow tests, artifact lifecycle walks, shared-update approval), and a troubleshooting section for every error mode encountered during build. PR #429 carries the full diff against main. The targeted test suite runs ~130 tests across 22 substrate-relevant files; test_briefing_hermetic, test_llm_recording, and test_llm_replay all run without a live API key.

Part 8 — honest limits and what "done" means

Things that are harder than they look

Ten places where the architecture has a clean answer on paper and an empirical question in production.

Semantic index quality. False negatives break the trust contract; false positives annoy users. Threshold tuning is project-specific and probably has to learn from user feedback. Acknowledged in Phase 1 by shipping the in_flight registry first and the index second once there are artifacts to index against. Chroma's context-1 work on self-editing search agents applies at the artifact-index level.

Briefing quality is hard to evaluate. The whole synthesis hinges on the briefing being useful. Grounding on a JSON intermediate (Part 3) prevents hallucination of decisions and artifacts that don't exist, but doesn't measure whether the briefing surfaces the right things, in the right order, with the right emphasis. The "Suggested next" list is a recommender with an accepted/ignored/dismissed reward signal; the briefing prose itself needs an eval surface. Open: a per-session "was this briefing useful?" microsurvey, or measuring downstream action latency post-briefing as the proxy.

Tool roster scaling ceiling. Auto-nominating every tenant AIE works at tens of AIEs; at hundreds or thousands (Autopilot scale, a mature HR Suite tenant) even WARM-tier prompt summaries blow out budget. Three options: pay the prompt growth and let caching amortize; embed AIE descriptors and retrieve-on-demand; hot/warm/cold tier by telemetry. Choice is a measurement question that production answers. The substrate's read API handles the implementation; the policy is open.

Service identity for AIE invocations. MVP assumes the invoking user's identity flows through every sub-agent and external call. Real world: an HR workflow needs admin-level read access to data the invoking user can't see, filtered down to user-appropriate output. The pattern is service-identity-with-row-level-filters; design work, not yet built. Real capability, real escalation risk. Solving this and the chained-principal story (Part 4) together is Phase 3.

HITL preemption false-positive rate. Selective bubble-up means the orchestrator can preempt an optional HITL it thinks it has context for. The failure mode is silent wrong behavior: the orchestrator skips an approval the human would have rejected and nothing flags it. The eval shape is F1 against ground-truth human decisions on a held-out set; we don't have that production-grade yet. Conservative default for the first month per profile: bubble unconditionally, learn what's safely preemptable from disagreements.

Skill distillation stability. The bootstrapping loop depends on converting traces into skills without hand-curation. Bad distillation produces unstable feedback. The Ctx2Skill / adversarial-self-play research is promising; not yet at the maturity to drop HITL-gating on auto-distilled skills. Until then, all distilled skills land at unvalidated and require human promotion. Phase 3 unblocks autonomous distillation if the research lands.

Named merge strategies for parallel dispatch. The candidates (dedupe_and_rank, summarize_and_reconcile, compare_and_vote, union_with_provenance) are sketches; the right names emerge from two or three real merge cases in production. Parallel fan-out beyond N=5–7 may require structured merges whose quality degradation curve hasn't been measured. Single-orchestrator-with-named-merge ships in Phase 3; the primitives crystallize after.

Multi-project memory bleed and graduation. Some skills learned in one project should propagate to others (the date-conflict check applies to every rollout the team runs). The graduation path from project to profile to org is propose-approve; something has to notice the pattern fires across projects to propose graduation. The cross-project observation pipeline is a Phase 2+ piece; the cross-project privacy story (one tenant's pattern doesn't leak into another's) is the harder half.

Substrate evolution at scale. Event log schema migrations are a known industry pain. The versioning + projector-composition discipline (Part 4) is the right architectural answer; it doesn't make it cheap. Every new event type added in Phases 2 and 3 has to be replay-safe against the Phase 1 history. Test discipline matters: strict-mode replay against last week's events on every substrate change.

The "Champ MVP IS Phase 1" framing has caveats. Phase 1 is real and shippable for June 15; "real" means a minimal event-sourced project subsystem in production, the briefing gate live, the in_flight registry deployed, the four decision-flow fields in the docs folder. Not "every existing harness subsystem migrated to projections by then." The substrate is partial in scope at June 15, complete in shape. Old subsystems migrate in waves over Phases 2-3 as their owners are ready, with strict-mode replay guarding each cutover.

What we're not doing that others are, and why

Five things the field is building that this design deliberately doesn't.

A competing connector marketplace. MCP's ecosystem and Anthropic's open SKILL.md format already cover the connector and skill layers; the marketplaces around them are noisy enough that adding another is a tax, not a service. The substrate participates in both and adds maturity gates and provenance fields as extension metadata (Part 4). Leverage move: spend engineering on the substrate, not the marketplace.

A competing cross-vendor handoff protocol. The wire layer has more candidates than the field can absorb; adding another would be noise. The substrate adapts through wire-transport adapters (Part 4) and contributes a chained-principal signing extension upstream where it fits.

A visual no-code agent canvas. Visual flow editors are a standing market category. The orchestrator wizard (Surface 7) is for builders configuring profiles, not authoring DAGs; profile training mode and the skill review queue (Surface 5) are for skill curation. The bet is that orchestration is too dynamic to express in a DAG editor and that the few users who want a DAG are better served by saving generated workflows (Phase 3) and promoting them to first-class AIEs.

Tying to one model provider. Hosted agent products tie you to their model. The substrate doesn't. Lock-in is to the operational layer: the substrate, the skills, the project state. Replay only works because the project, not the model, is the asset.

What "done" looks like

For the June 15 KPMG Champ ship: the orchestrator AIE runs every session; the FA's docs folder holds plan + program_memory + generated_docs plus the four new fields (decisions, open_questions, in_flight, blockers); session-entry briefing renders; the duplicate-check intervention fires before any artifact-creating tool; every event flows through the substrate; the audit trail walks. Phase 1 is real. Phase 2 and 3 are reachable from the same substrate without re-architecture.

For the longer arc: a new person joining a six-month-old project experiences it as a coherent, current, accumulating thing, not as chat scrollback plus a folder of artifacts plus a colleague to interrupt for context. The agent is one of several participants in a durable, governed, event-sourced shared object. Skills accumulate. Knowledge gaps close. Replay catches regressions before deploy. Every cross-vendor handoff is signed back to the originating principal. Costs are bounded at the loop, not after the fact. The six things that compound (Part 1) all sit on the same event log and improve whether or not anyone notices.

The agent loop is fungible. The substrate is the moat.

Companion notes: this design builds on existing agent-harness and runtime-orchestration work in flight, with the June 15 Champ MVP pinned to Phase 1. The substrate takes inspiration from event-sourced reactive runtimes without depending on them; the integration is too deep to be a library.

Projects that compound

Three behaviors users see

What specifically compounds

Five durable things, one shape

The project state object

The first-turn reframe

How the briefing is generated

The duplicate check

AIE invocation, in mechanism

How skills evolve, in mechanism

Skills as the project's working knowledge

The reviewer pattern, generalized

The shape, layered

Memory in four scopes (canonical)

Four design positions

The event shape

Capabilities that fall out of this shape

Deterministic replay, four flows

Chained delegated identity through sub-agents

Knowledge gaps as a first-class telemetry signal

Hard budget enforcement at the loop

Bring-your-own-model, bring-your-own-deployment

Confidence-gated execution

Interop — the primitives the wire doesn't own

Substrate evolution

Buildout on a timeline

The three phases, in detail

Phase 1 — substrate basics + the user-visible MVP behaviors (Champ June 15)

Phase 2 — the learning loop closes

Phase 3 — governance, durability, parallel execution

What stays out, deliberately

1. Starting a program — plan generation and approval

2. Opening a project — the briefing

3. Asking the agent — the intervention

4. Your work — task drawer plus HITL inbox

5. Reviewing what the project has learned

6. Investigating why a decision was made

7. Setting up an orchestrator — the builder wizard

8. Closing the loop — the operator console

From mocks to merged

The runtime console at /runtime_console

The chat is the substrate, typed

read_project_state — the substrate, queryable

Four write-side tools — the substrate as a destination

Artifact lifecycle, walked

Shared-update review queue — projects writing back to tenant memory

Multi-actor identity, color-keyed

Causal-chain walker — every decision is auditable

Hermetic replay and fork & diff — the substrate as a regression surface

The qualitative test matrix

What ships today, what doesn't

Things that are harder than they look

What we're not doing that others are, and why

What "done" looks like

The runtime console at `/runtime_console`