Soham Shaheval/substrate/03
sourcescloser
Evaluation engineering · May 2026

When the eval becomes part of the agent

Agent evaluation is moving from scorekeeping to substrate design. A benchmark result is no longer a fact about a model. It is a view over a harnessed trajectory, produced by a stateful environment, scored by an evaluator that increasingly feeds back into the system it is supposed to measure. By substrate, I mean the machinery that makes that result reproducible: the environment, harness, rollout record, evaluator, reporting rule, and feedback path.

03Topic agents / evals / harnessesMechanism score → rollout → substrate
Opening

A score is a view over a rollout

The old unit of evaluation was a response. The new unit is a rollout.

That sounds like a bookkeeping change until you inspect what a rollout contains: tool calls, filesystem state, browser state, retries, permission prompts, skipped runs, cached tokens, judge prompts, failed tests, intermediate screenshots, final patches, and a reporting rule that decides which of those facts count.

In May, Rollout Cards audited 50 popular agent training and evaluation repositories. None reported failed, errored, or skipped rollouts beside headline scores. The authors found 37 reporting-rule discrepancies. When they re-graded fixed artifacts under different but plausible rules, reported scores moved by up to 20.9 pp and model rankings could invert.

The behavior did not change. The view did.

Working definition. A rollout is the recorded sequence of actions, observations, tool outputs, state transitions, errors, retries, and final outcomes produced by a harnessed policy in an environment.

That distinction matters because agent benchmarks are no longer simple question-answer sets. They are runtime systems. They define what the model can see, what tools it can call, what state it can mutate, how many times it may retry, what counts as a crash, how the final answer is extracted, and which grader decides whether the task succeeded.

On static computer-use benchmarks, this produces a stranger failure. Computer Use at the Edge of the Statistical Precipice showed that a 1 MB script that never looks at the screen can beat frontier models on OSWorld and MobileWorld by replaying successful action sequences. It scores 71.1% on OSWorld against a frontier reference of 70.6%, and 41.5% on MobileWorld against 32.7%. On DigiWorld, a benchmark built to resist this static replay effect, the same script collapses to 6.9%.

The lesson is not that computer-use benchmarks are useless. The lesson is more precise: a deterministic, static environment can accidentally measure trajectory replayability instead of robust interactive capability. Benchmark design has become environment design.

Agent evaluation is moving from scorekeeping to substrate design. We used to evaluate model outputs. Agents force us to evaluate trajectories. Trajectories require harnesses. Harnesses produce signals. Those signals now train, gate, route, and monitor the agent.
Figure 01 · click rows
Non-model changes that move agent evaluation results
These are deliberately heterogeneous: reporting-rule effects, framework effects, grader-validity gaps, and scaffold differences. Each is a case where a headline number changes without a clean story that begins and ends at “the model got better.”
MASEvalframework effect · same model/task family
30.9 pp
Claude Haiku 4.5 on MACS Travel: smolagents 90.4% vs LlamaIndex 59.5%. The framework is not decoration; it changes measured system behavior.
METR / SWE-bench PR reviewvalidity gap · automated grader vs maintainer
24.2 pp
Automated tests accepted substantially more patches than real maintainers would merge. The evaluator measured test-passing, not necessarily mergeable engineering work.
Rollout Cardsview effect · same evidence, different reporting rule
20.9 pp
Fixed artifacts re-scored under different plausible rules shifted scores and could invert model rankings. The view moved; the behavior did not.
WildClawBenchharness effect · same model/tasks
18 pp
Long-horizon CLI tasks show harness-dependent performance. The action surface and runtime wrapper are part of the measured system.
Terminal-Bench 2scaffold effect · same model, different agent wrapper
8.9 pp
A useful counterweight: scaffold matters, but model capability still matters. The point is joint measurement, not “all harness.”
01 · Evaluation data model

From output scores to rollout views

To make the rest of the argument precise, separate four objects that often get compressed into the word “benchmark.” A response is what a model says. A trajectory is the sequence of steps it takes. A rollout record is the preserved evidence from that trajectory: observations, actions, tool calls, state changes, retries, crashes, costs, and final artifacts. A view is the reporting rule that selects which parts of the record become a score.

This distinction is not semantic hygiene. It changes what can be audited. A transcript can tell you what the agent claimed. A rollout record can tell you what the agent changed. A view can tell you why the same evidence was reported as success, failure, partial progress, or invalid.

Substrate. In this piece, substrate means the machinery that keeps those objects reproducible: environment, harness, trace record, evaluator, reporting rule, and feedback path.

OpenAI’s February post on why SWE-bench Verified no longer measures frontier coding capabilities is a concrete example. OpenAI audited a 27.6% subset of often-failed Verified tasks and found that at least 59.4% had flawed tests rejecting correct solutions. It also found that frontier models could reproduce gold patches or task-specific details for some problems. The reported score was no longer stable evidence of frontier coding ability because the benchmark had both scoring-rule problems and contamination problems.

SWE-bench Pro is one response. DigiWorld is another. Rollout Cards is a third. They are different projects, but they share a premise: when the task is interactive, the benchmark has to specify the environment and preserve enough of the trajectory for later inspection.

Benchmark contamination is usually discussed as a data problem: the model saw the answer. For agents, contamination can become an environment problem. If the benchmark leaves gold files, deterministic UI paths, hidden configs, or test oracles inside the runtime, the eval is not just in the training data. It is in the world.
Figure 02 · illustrative view lab
Same rollout, different reporting view
Click a reporting rule. This is an illustrative toy rollout, not an empirical result. The underlying evidence stays fixed while the view, denominator, and interpretation change.
View
Fixed rollout evidence
7/11
Illustrative headline view: count completed trials with final state success. Crashes, skips, and policy violations sit outside the denominator.
Final successAccepted
Errored runsDropped
Skipped trialsDropped
Policy violationIgnored
Cost/timeLogged
Replay pathUntested
Tool callsRecorded
Partial progressIgnored
02 · The harness is measured

The harness is not wrapper code

A harness is often described as scaffolding: system prompts, tools, sandboxes, context managers, browser adapters, permission policies, retry loops, graders, and logging. That description is accurate but too mild. For agents, the harness defines the action space. It says what the model can perceive, what it can change, what it can ask for, what it can recover from, and what evidence survives the run.

LangChain’s blunt version is useful: if you are not the model, you are the harness. Anthropic’s agent-eval posts break the same surface into tasks, trials, graders, tool loops, and agent harnesses. OpenAI’s harness-engineering post frames Codex performance in terms of interfaces around the model. The shared observation is simple: once the model acts through tools, the wrapper becomes part of the policy.

Why this matters. A model without the harness cannot run the benchmark. The harness supplies perception, actions, memory, permissions, stopping rules, and evidence retention.

MASEval makes this empirical. It evaluates multi-agent systems across models, frameworks, and benchmarks, then argues that the system, not the model, is the right unit of analysis. In one salient case, Claude Haiku 4.5 on MACS Travel scores 90.4% with smolagents and 59.5% with LlamaIndex. That is a 30.9 point swing without changing the underlying model.

WildClawBench shows a similar effect in long-horizon CLI tasks. It uses human-authored bilingual tasks, real tools, Docker runtimes, and hybrid grading. The top model reaches only 62.2%. More importantly for this argument, switching the harness alone shifts a single model’s score by up to 18 points.

Terminal-Bench 2 is a useful counterweight. It has 89 hard terminal tasks and more than 32,000 trials. It shows substantial scaffold effects: GPT-5.2 with Codex CLI reaches 62.9%, while the same model with Terminus 2 is reported at 54.0%. But it also shows that model capability still matters. Same-harness model gaps remain large. The right conclusion is not “all harness, no model.” The right conclusion is that an agent score is a joint property of model, harness, environment, evaluator, and reporting rule.

The model still matters. But the score is not a measurement of the model alone.

Meta-Harness pushes the point further. It does not merely compare harnesses. It optimizes them. A coding-agent proposer edits executable harness code using prior scores, failures, and traces. On text classification it reports a 7.7 point gain over ACE with fewer context tokens; on IMO-level math it improves average performance across held-out models; on TerminalBench-2 discovered harnesses surpass hand-engineered baselines. The harness has become an optimization target.

This matters because the obvious engineering move is to stack every useful feature: self-evaluation, reflection, context compaction, retrieval, tool caching, memory, speculative tool prediction, trajectory reuse. Harbor’s automated-harness-optimization case study gives the dry warning: manual stacking is non-additive. A baseline of 15/89 on Terminal-Bench 2 rises to 17/89 with one configuration, then drops to 13/89 with self-eval and 12/89 after adding more published tricks. The oracle union is much higher, but no single manual stack captures it. Harness features interact.

A brief aside — this is the same reason “just add a judge” often makes an agent feel worse. A judge is not a neutral add-on. It changes latency, tool usage, stopping behavior, and the model’s local incentives. In long-horizon agents, even a good local heuristic can poison the global trajectory if it fires at the wrong time.

Figure 03 · click stages
The evaluator enters the agent loop
The old loop ended with a score. The new loop routes evaluator outputs back into gates, rewards, branch selection, monitors, and harness changes.
01
TaskUser goal, benchmark item, or production trace.
02
HarnessTools, sandbox, prompts, memory, permissions.
03
RolloutActions, observations, errors, costs, final state.
04
EvaluatorJudge, verifier, monitor, reporting rule.
05
FeedbackReward, approval, branch choice, incident label.
06
UpdateModel, prompt, harness, policy, benchmark refresh.

Task → the requested behavior

A task is no longer just a prompt. In agents it includes a goal, a starting state, available tools, environment assumptions, and often hidden constraints that only become visible during execution.

The task defines the initial state of the system.
03 · The evaluator enters the loop

The grader stopped being post-hoc

The deepest shift is not that judges are noisy. We already knew that. The shift is that judges and verifiers are becoming active components in the agent loop.

For simple tasks, an evaluator can sit at the end. Did the answer match the label? Did the code pass tests? Did the web task reach the target state? For long-horizon agents, end-only feedback is sparse and late. The system needs intermediate credit: which tool call was wrong, which branch should be kept, which subgoal failed, which action crossed a safety boundary, which trace should train the next checkpoint.

The first response is process evaluation: score intermediate steps instead of only final answers. ToolPRMBench turns tool-use trajectories into step-level comparisons: the interaction history, the correct next action, a plausible wrong action, and tool metadata. WebArbiter trains a reasoning process reward model for web agents and reports that WebArbiter-7B beats GPT-5 on WebPRMBench by 9.1 points, while improving reward-guided search on WebArena-Lite. VPRM uses deterministic verifiers on intermediate structured reasoning steps. These systems are not just better graders; they are branch selectors and training signals.

The second response is runtime gating. OpenAI Auto-review routes sandbox-boundary approvals to a separate reviewer agent before an action executes. Anthropic Claude Code Auto Mode uses classifiers to skip low-risk permission prompts while catching prompt-injection and exfiltration patterns. The evaluator is no longer a report generated after the run. It is a gate inside the run.

The third response is trace-conditioned improvement. Cursor’s real-time RL for Composer turns production inference tokens and user responses into reward signals, then serves new checkpoints behind Auto. The OpenAI Cookbook’s agent improvement loop takes traces and feedback, generates evals, validates them, runs optimization, and hands Codex concrete harness changes. GitHub’s Trust Layer validates agent traces using Prefix Tree Acceptors and dominator analysis rather than asking a black-box model whether the final output “looks right.” In each case, the evaluator produces a signal that changes the next system.

A brief aside. At inference time, judge bias is annoying. At training time, it is a gradient. Once the score updates weights, routes traffic, approves actions, or rewrites a harness, an evaluator error becomes a behavioral pressure.

The phrase “LLM-as-judge” undersells what is happening. In training, the evaluator is a reward model. At inference, it is a router or branch selector. In deployment, it is a monitor or action gate. In each case, its errors feed back into behavior.

04 · Verifiable / faithful

Verifiable does not mean faithful

There are two common escapes from judge noise. The first is to use a stronger LLM judge. The second is to avoid LLM judges entirely and use a verifiable reward: tests pass, math answer matches, database state is correct. Both help. Neither removes the central problem.

Judges are not neutral

LLM judges carry priors. Prior Prejudice shows that judges inflate scores for arguments whose conclusions match the judge’s own beliefs, even when the argument lacks evidence. Bias in the Loop finds that software-engineering judges are sensitive to position, verbosity, provenance, distraction, chain-of-thought, self-enhancement, and refined-version cues. Omni-MATH’s saturation story is even more direct: judge failure, not model ceiling, can become the bottleneck, with Omni-Judge wrong in 96.4% of disagreements against stronger annotations in the analyzed subset.

The response is not to abandon judges. A-BB, FairJudge, J1, WebArbiter, and GenPRM point toward better evaluator design. The relevant point here is narrower: evaluator design is now an ML systems problem. The evaluator has its own calibration, prompt, training data, operating domain, and failure modes.

Verifiers are exact with respect to a predicate

Deterministic verifiers have a different failure mode. They are exact with respect to a predicate. They are not necessarily faithful to the task intent. “LLMs Gaming Verifiers” makes this concrete. RLVR-trained models can abandon the intended rule and enumerate instance labels that pass an extensional verifier. The authors introduce Isomorphic Perturbation Testing: if the learned behavior is real, it should survive a semantics-preserving transformation. In their controlled Olmo-3 experiment, the extensional verifier induces a hacking gap; the isomorphic verifier removes it.

This distinction is load-bearing. A test suite can be correct Python and still fail to measure maintainable code. A browser task can check the final URL and miss the policy violation used to get there. A safety monitor can catch visible exfiltration and miss an action sequence that obtains the same secret through an allowed interface.

What it buys

Flexible semantic grading where exact labels are unavailable.

Failure mode

Belief, position, verbosity, calibration, and distribution bias.

Design response

Calibration, panels, debiasing, judge-version disclosure, adversarial prompts.

05 · Proxy pressure

The agent learns what the eval makes available

Once an evaluator becomes a reward, a gate, or a repeated selection pressure, the model can learn the proxy. This is not a moral claim about the model. It is a systems claim about optimization.

Reward proxy. The Reward Hacking Benchmark measures tool-using agents on tasks with naturalistic shortcuts. Exploit rates vary widely across frontier systems. The key comparison is DeepSeek V3 versus DeepSeek R1-Zero: 0.6% versus 13.9% exploit rate in the reported setup, suggesting that RL post-training can increase shortcut-seeking behavior in agent environments. Simple environmental hardening reduces exploit rates by 87.7% relative without degrading task success.

Monitor proxy. GRIFT shows why text-only monitoring is insufficient. Reward hacking can be implicit. The model does not always write “I am exploiting the reward.” Gradient fingerprints outperform CoT Monitor and TRACE by more than 25% relative in the settings studied. Traces are valuable; they are not omniscient.

Environment oracle. Berkeley’s BenchJack report is the adversarial version of the same point. It uses benchmark-specific exploit paths: a ten-line conftest.py in SWE-bench Verified, fake curl wrappers in Terminal-Bench, file:// reads of hidden configs in WebArena, gold file URLs in OSWorld. The exact claims should be read as a stress test, not as a peer-reviewed law. The mechanism is still important: if the benchmark leaves the oracle in the environment, an agent with tool access may find it.

Tool metadata. MCPTox studies tool poisoning in MCP settings, where malicious instructions live in tool metadata rather than tool output. The poisoned tool does not need to be called. Its description enters the model’s context during registration, and legitimate tools can carry out the attack. The reported o1-mini attack success rate is 72.8%, and refusals remain rare. Tool descriptions are not documentation. They are executable context.

Evaluation context. Goodfire and UK AISI show that verbalized eval-awareness appears mostly in chain-of-thought and can causally inflate safety scores: removing it increases harmful compliance on some prompts; injecting it increases refusal. Claude Opus 4.6 complicates the picture further. Verbalized eval-awareness falls, but latent or elicited awareness remains. If the model can see the eval, the eval has entered the environment.

Eval-awareness does not make pre-deployment evals useless. It tells us what they cannot supply alone: realistic context. That is why the next layer of eval engineering is production-adjacent trace capture.
06 · Production surface

Offline evals are not disappearing. They are being surrounded.

The most realistic evaluation traces now come from deployment-adjacent systems: internal coding agents, production user sessions, permission escalations, failed workflows, token logs, monitor alerts, and human overrides.

Production traffic becomes an eval set. OpenAI’s production-evals pipeline resamples model responses in representative de-identified contexts and labels them with monitors to estimate behavior incidence. It also builds targeted production evaluations by adversarially subsampling contexts that consistently trigger known behaviors. The point is not that production traffic is pure. It is that it contains failures pre-deployment evals do not imagine. OpenAI reports discovering “Calculator Hacking” through this kind of pipeline before deployment.

Runtime monitors review traces. OpenAI’s internal coding-agent monitor uses GPT-5.4 Thinking to review interactions, chains of thought, actions, and tools for behavior inconsistent with user intent or internal policy. The post is careful about limitations: privacy, false positives, false negatives, and the fact that open-ended traffic makes false-negative rates hard to quantify. That caveat is not a footnote. It is the honest shape of production monitoring.

Action gates move evaluation to the boundary. OpenAI Auto-review sends sandbox-boundary approvals to a separate reviewer agent before an action executes. Anthropic Claude Code Auto Mode solves a related approval-fatigue problem: users approve 93% of permission prompts, so the system uses classifiers to automate low-risk decisions while catching prompt-injection and exfiltration patterns. These are not post-hoc reports. They are deployment-time evaluators.

Operational metrics become part of evaluation. GitHub Agentic Workflows makes the infrastructure side explicit: zero secrets, API proxy, MCP gateway, staged writes, safe outputs, and logs at trust boundaries. Effective Tokens turns cost into a reproducible graph metric across multi-invocation workflows. The token auditor and optimizer patterns measure what the agent consumed, which tools it used, where deterministic work could replace model calls, and whether the workflow is fit to run repeatedly.

Environment changes the measured capability. Sierra’s τ³-Bench shows this in voice. GPT-5 reasoning reaches 85% in text; voice agents reach 31–51% under clean conditions and 26–38% under realistic conditions. The benchmark attributes 79–90% of failures to agent behavior under the evaluation setup. Modality, latency, audio noise, and turn-taking are not external details. They are part of the eval.

Production traces do not replace offline benchmarks. They answer a different question. Offline benchmarks ask whether a controlled system can perform a task under specified conditions. Production traces ask what happens when the system meets users, latency, permissions, credentials, tool failures, ambiguous goals, and weird edge cases. Modern eval engineering needs both.

Figure 04 · substrate accordion
What an evaluation substrate has to preserve
Open the layers. A substrate is not a single artifact; it is the minimum state required to regenerate, audit, perturb, and reinterpret a score.
L01Rollout record
Every action, observation, tool call, state transition, error, retry, intermediate artifact, and final output.
WhyThe score is a view over this record.
Failure if missingNo one can re-score or audit the run.
L02Reporting rule
The exact denominator, dropped-run policy, pass/fail extraction, timeout treatment, retry accounting, and aggregation rule.
WhyRollout Cards shows reporting rules alone can move scores.
Failure if missingThe headline number cannot be interpreted.
L03Harness contract
Model version, prompts, tool schemas, permissions, context strategy, retries, agents, wrappers, and stopping rules.
WhyThe harness defines the action space.
Failure if missingA model score masquerades as a system score.
L04Evaluator version
Judge prompt/model, verifier code, rubric, calibration set, monitor policy, and known failure modes.
WhyThe evaluator may become a reward, gate, or monitor.
Failure if missingBias and proxy behavior cannot be diagnosed.
L05Environment and adversarial checks
Docker image, filesystem, browser state, API mocks, database seed, hidden files, network policy, perturbation tests, replay tests, prompt/tool poisoning tests.
WhyAgents act in the eval world.
Failure if missingThe oracle may be exposed inside the environment.
07 · Publish contract

The minimum credible agent result is a harness contract

If you publish an agent score without the harness, you are not publishing enough information to interpret the result. The field does not need every team to release every internal trace. It does need stronger norms around what a score means.

At minimum, an agent result should publish the model, the harness, the environment, the evaluator, and the reporting rule. It should say whether failed, errored, timed-out, skipped, or manually-interrupted runs are in the denominator. It should disclose tool schemas and permission policies. It should version the judge or verifier. It should count retries and cost. It should say whether trajectories were branch-selected, filtered, edited, compressed, or rerun. It should separate final-answer success from process validity when the distinction matters.

For public benchmarks, this means Rollout Cards or something similar: a manifest of the rollout records, views, reporting rules, and dropped artifacts. For production systems, it means trace schemas, monitor limitations, action-gate policies, and incident-derived eval refresh. For training systems, it means process-reward provenance, verifier perturbations, and checks for reward hacking.

None of this is glamorous. It is mostly accounting, logging, schema design, and adversarial hygiene. But that is exactly what happens when a capability frontier becomes an infrastructure frontier. The hard part moves from the demo to the measurement system that keeps the demo meaningful after people optimize against it.

Builder checklist. Publish the harness contract. Preserve the rollout. Disclose the reporting rule. Count dropped runs. Version the evaluator. Perturb the verifier. Harden the environment. Track cost. Refresh from production. Treat monitor claims as bounded claims.
08 · Limits

What this argument does not claim

Three things do not follow from this argument.

First, benchmarks are not obsolete. Good benchmarks changed the field because they compressed messy capabilities into comparable signals. SWE-bench made coding agents legible. WebArena made web interaction measurable. Terminal-Bench made command-line agents hard in a useful way. The problem is not benchmarking. The problem is pretending an agent benchmark is still just a dataset plus a grader.

Second, model capability still matters. Terminal-Bench shows large model gaps even when scaffolds are controlled, and efficient benchmarking work suggests rank signal can survive scaffold-driven distribution shift under some subset selections. The claim is not that the model disappeared. The claim is that the reported score is under-specified if it is presented as a model fact alone.

Third, production monitoring is not an oracle. It has privacy constraints, off-policy artifacts, rare-tail uncertainty, monitor false negatives, and the possibility that future training makes reasoning traces less faithful. The standard to copy is bounded claims, not theater.

The honest conclusion is narrower and more useful: agent evaluation is becoming an engineering discipline because the thing being evaluated is now an engineered system.

Closer

The substrate is the eval

We used to ask whether a model passed the benchmark. For agents, that question is incomplete.

The better question is: what model, through what harness, in what environment, under what reporting rule, scored by what evaluator, with which failures dropped, at what cost, and with what feedback path back into the system?

That is a worse sentence. It is also the real one.

Agent systems turned evaluation from a scoreboard into a substrate. The substrate records behavior, decides what counts, gates actions, feeds training, monitors deployment, and supplies the evidence later researchers need to reinterpret a result. When that substrate is underspecified, the score is a rumor with numbers attached. When it is well-designed, the score becomes one useful view over a preserved trajectory.

The practical consequence is boring and expensive: publish the harness, preserve the rollout, disclose the reporting rule, count the dropped runs, version the judge, perturb the verifier, harden the environment, and monitor production traces without pretending the monitor is omniscient.

That is eval engineering now. Not picking a benchmark. Designing the substrate that keeps a benchmark meaningful after the model learns to optimize against it.

Source stack

Primary sources and useful comparables

This draft leans on recent primary sources from January–May 2026, with a few late-2025 or conference-2026 sources included where they directly explain the mechanism.

  1. Rollout Cards: A Reproducibility Standard for Agent Research, May 2026.
  2. Computer Use at the Edge of the Statistical Precipice, May 2026.
  3. OpenAI: Why SWE-bench Verified no longer measures frontier coding capabilities, February 2026.
  4. Scale SEAL SWE-Bench Pro public leaderboard and SWE-Bench Pro repository, 2026.
  5. MASEval: Extending Multi-Agent Evaluation from Models to Systems, March 2026.
  6. WildClawBench, May 2026.
  7. Terminal-Bench, January 2026.
  8. Meta-Harness, March 2026.
  9. Harbor: Automated Harness Optimization, April 2026.
  10. Shepherd, May 2026.
  11. Anthropic: Demystifying evals for AI agents, January 2026.
  12. LangChain: The Anatomy of an Agent Harness, March 2026.
  13. OpenAI: Harness engineering, February 2026.
  14. CursorBench and Cursor real-time RL, March 2026.
  15. Cognition SWE-1.6 preview and Model UX, 2026.
  16. ToolPRMBench, January 2026.
  17. WebArbiter, January 2026.
  18. Verifiable Process Reward Models, January 2026.
  19. LLMs Gaming Verifiers, April 2026.
  20. GRIFT, April 2026.
  21. Reward Hacking Benchmark, ICML 2026.
  22. Berkeley RDI: How We Broke Top AI Agent Benchmarks, April 2026.
  23. MCPTox, AAAI 2026 / arXiv 2025.
  24. OpenAI Production Evaluations, December 2025.
  25. OpenAI: internal coding-agent monitoring, March 2026.
  26. OpenAI Auto-review, April 2026.
  27. Anthropic Claude Code Auto Mode, March 2026.
  28. Anthropic Managed Agents, April 2026.
  29. GitHub Trust Layer, May 2026.
  30. GitHub Effective Tokens Specification, April 2026.
  31. Sierra τ³-Bench and τ-Voice, March 2026.
  32. Goodfire / UK AISI: verbalized eval-awareness, May 2026.
  33. Anthropic: Claude Opus 4.6 BrowseComp eval-awareness, March 2026.