Evaluation engineering · May 2026

Trajectory-first evals

Agent evaluation is moving from scorekeeping to trajectory-first measurement. A benchmark score is not a fact about a model by itself. It is a view over a rollout: the context the agent saw, the actions it took, the state it changed, the failures that were dropped or counted, and the reporting rule that turned that evidence into a number.

№ 03Topic agents / evals / tracesMechanism rollout → view → replay

Opening

A score is a view over a rollout

The old unit of evaluation was a response. The new unit is a rollout. Trajectory-first evals make that object explicit.

That sounds like a bookkeeping change until you inspect what a rollout contains: tool calls, filesystem state, browser state, retries, permission prompts, skipped runs, cached tokens, judge prompts, failed tests, intermediate screenshots, final patches, and a reporting rule that decides which of those facts count.

In May, Rollout Cards audited 50 popular agent training and evaluation repositories. None reported failed, errored, or skipped rollouts beside headline scores. The authors found 37 reporting-rule discrepancies. When they re-graded fixed artifacts under different but plausible rules, reported scores moved by up to 20.9 pp and model rankings could invert.

The behavior did not change. The view did.

Working definition. A rollout is the recorded sequence of actions, observations, tool outputs, state transitions, errors, retries, and final outcomes produced by a harnessed policy in an environment.

That distinction matters because agent benchmarks are no longer simple question-answer sets. They are runtime systems. They define what the model can see, what tools it can call, what state it can mutate, how many times it may retry, what counts as a crash, how the final answer is extracted, and which grader decides whether the task succeeded.

On static computer-use benchmarks, this produces a stranger failure. Computer Use at the Edge of the Statistical Precipice showed that a 1 MB script that never looks at the screen can beat frontier models on OSWorld and MobileWorld by replaying successful action sequences. It scores 71.1% on OSWorld against a frontier reference of 70.6%, and 41.5% on MobileWorld against 32.7%. On DigiWorld, a benchmark built to resist this static replay effect, the same script collapses to 6.9%.

The lesson is not that computer-use benchmarks are useless. The lesson is more precise: a deterministic, static environment can accidentally measure trajectory replayability instead of robust interactive capability. Benchmark design has become environment design.

Agent evaluation is moving from scorekeeping to trajectory-first measurement. We used to evaluate model outputs. Agents force us to preserve rollouts. Rollouts become views, replay targets, monitor inputs, training data, and eventually the substrate the next agent learns from.

Figure 01 · click rows

Non-model changes that move agent evaluation results

These are deliberately heterogeneous: reporting-rule effects, framework effects, grader-validity gaps, and scaffold differences. Each is a case where a headline number changes without a clean story that begins and ends at “the model got better.”

MASEvalframework effect · same model/task family

30.9 pp

Claude Haiku 4.5 on MACS Travel: smolagents 90.4% vs LlamaIndex 59.5%. The framework is not decoration; it changes measured system behavior.

METR / SWE-bench PR reviewvalidity gap · automated grader vs maintainer

24.2 pp

Automated tests accepted substantially more patches than real maintainers would merge. The evaluator measured test-passing, not necessarily mergeable engineering work.

Rollout Cardsview effect · same evidence, different reporting rule

20.9 pp

Fixed artifacts re-scored under different plausible rules shifted scores and could invert model rankings. The view moved; the behavior did not.

WildClawBenchharness effect · same model/tasks

18 pp

Long-horizon CLI tasks show harness-dependent performance. The action surface and runtime wrapper are part of the measured system.

Terminal-Bench 2scaffold effect · same model, different agent wrapper

8.9 pp

A useful counterweight: scaffold matters, but model capability still matters. The point is joint measurement, not “all harness.”

01 · Evaluation data model

From scores to rollout views

To make the rest of the argument precise, separate five objects that often get compressed into the word “benchmark.” A response is what a model says. A trajectory is the sequence of steps it takes. A rollout record is the preserved evidence from that trajectory: observations, actions, tool calls, state changes, retries, crashes, costs, and final artifacts. A view is the reporting rule that selects which parts of the record become a score. A drops manifest is the missing negative space: which fields, failures, rows, and structures the reported view discarded.

This distinction is not semantic hygiene. It changes what can be audited. A transcript can tell you what the agent claimed. A rollout record can tell you what the agent changed. A view can tell you why the same evidence was reported as success, failure, partial progress, or invalid. A drops manifest tells you what the score no longer knows.

Substrate. In this piece, substrate means the machinery that keeps those objects reproducible: environment, harness, trace record, evaluator, reporting rule, and feedback path.

OpenAI’s February post on why SWE-bench Verified no longer measures frontier coding capabilities is a concrete example. OpenAI audited a 27.6% subset of often-failed Verified tasks and found that at least 59.4% had flawed tests rejecting correct solutions. It also found that frontier models could reproduce gold patches or task-specific details for some problems. The reported score was no longer stable evidence of frontier coding ability because the benchmark had both scoring-rule problems and contamination problems.

SWE-bench Pro is one response. DigiWorld is another. Rollout Cards is a third. They are different projects, but they share a premise: when the task is interactive, the benchmark has to specify the environment and preserve enough of the trajectory for later inspection.

Rollouts as a structured artifact are also being mapped on the training side. The Generate, Filter, Control, Replay survey decomposes the lifecycle of an RL rollout pipeline into four stages: generation produces the trajectory, filtering decides which trajectories propagate, control intervenes mid-rollout to steer, and replay reuses past trajectories for training and audit. The framing is about post-training, but the object the four stages operate over is the same object an eval substrate has to preserve. Trajectory-first evals are the readout side. Trajectory-first training is the optimization side. The discipline that ships first will pull the other along.

Figure 01b · rollout lifecycle

The trajectory as a four-stage object

After Surana et al., 2026. Training and evaluation read off the same artifact through the same four stages. Different consumers, same lifecycle.

Harbor’s Agent Trajectory Interchange Format is the schema-shaped version of the same claim. ATIF records messages, reasoning content, tool calls, observations, token IDs, costs, subagent references, and context-management boundaries. One small design choice carries the whole argument: if a system compacts or replaces context, the trace has to say where that boundary occurred. Otherwise a later evaluator reconstructs an input window the agent never actually saw.

Benchmark contamination is usually discussed as a data problem: the model saw the answer. For agents, contamination can become an environment problem. If the benchmark leaves gold files, deterministic UI paths, hidden configs, or test oracles inside the runtime, the eval is not just in the training data. It is in the world.

Figure 02 · illustrative view lab

Same rollout, different reporting view

Click a reporting rule. This is an illustrative toy rollout, not an empirical result. The underlying evidence stays fixed while the view, denominator, and interpretation change.

View

Fixed rollout evidence

7/11

Illustrative headline view: count completed trials with final state success. Crashes, skips, and policy violations sit outside the denominator.

Final successAccepted

Errored runsDropped

Skipped trialsDropped

Policy violationIgnored

Cost/timeLogged

Replay pathUntested

Tool callsRecorded

Partial progressIgnored

02 · The harness is measured

The harness is not wrapper code

A harness is often described as scaffolding: system prompts, tools, sandboxes, context managers, browser adapters, permission policies, retry loops, graders, and logging. That description is accurate but too mild. For agents, the harness defines the action space. It says what the model can perceive, what it can change, what it can ask for, what it can recover from, and what evidence survives the run.

LangChain’s blunt version is useful: if you are not the model, you are the harness. Anthropic’s agent-eval posts break the same surface into tasks, trials, graders, tool loops, and agent harnesses. OpenAI’s harness-engineering post frames Codex performance in terms of interfaces around the model. The shared observation is simple: once the model acts through tools, the wrapper becomes part of the policy.

Why this matters. A model without the harness cannot run the benchmark. The harness supplies perception, actions, memory, permissions, stopping rules, and evidence retention.

The under-discussed harness knob is context assembly. Two systems can run the same model, same task, and same nominal tool set while giving the model different evidence: raw traces versus compacted summaries, retrieved files versus selected snippets, full error outputs versus cleaned retries, a flat tool list versus a masked action subset. That is not prompt hygiene. It changes the state the policy conditions on.

Chroma’s Context Rot makes the measurement problem visible: model performance becomes less reliable as input length grows even on controlled tasks, which means a larger context window is not a neutral way to “show the model everything.” The production response is not simply to stuff less into the prompt. Manus treats the filesystem as restorable context and keeps wrong turns in the trace so the agent can recover from them. Chroma Context-1 lets a search agent prune chunks from its visible context while preserving the full unpruned trajectory for reward computation. Cognition’s multi-agent note says the quiet part directly: share full agent traces, not just individual messages.

Anthropic’s formulation — find the smallest high-signal set of tokens that improves the desired outcome — and LangChain’s write/select/compress/isolate taxonomy are useful as engineering guidance. For evaluation, they imply a stricter norm: the context policy has to be logged and versioned. If the model succeeds because a retrieved file was always pinned, if it fails because compaction dropped an error, or if it avoids a loop because the previous wrong turn remained visible, that is part of the measured system.

Recent context-policy work makes this experimental rather than philosophical. DT-MDP-CE treats context selection as a policy learned from historical trajectories in an SRE diagnosis setting. JetBrains’ context-management study compares raw traces, observation masking, summarization, and hybrids as explicit strategies. The useful result class is not “shorter context is better” or “longer context is better.” It is that the context policy is an independent variable with quality, cost, and failure-mode effects.

Context policy is evaluation policy. If one harness preserves failed commands, stack traces, retrieved documents, and branch decisions while another compacts them into a clean summary, they have not run the same eval. They have changed the evidence channel.

MASEval makes this empirical. It evaluates multi-agent systems across models, frameworks, and benchmarks, then argues that the system, not the model, is the right unit of analysis. In one salient case, Claude Haiku 4.5 on MACS Travel scores 90.4% with smolagents and 59.5% with LlamaIndex. That is a 30.9 point swing without changing the underlying model.

WildClawBench shows a similar effect in long-horizon CLI tasks. It uses human-authored bilingual tasks, real tools, Docker runtimes, and hybrid grading. The top model reaches only 62.2%. More importantly for this argument, switching the harness alone shifts a single model’s score by up to 18 points.

Terminal-Bench 2 is a useful counterweight. It has 89 hard terminal tasks and more than 32,000 trials. It shows substantial scaffold effects: GPT-5.2 with Codex CLI reaches 62.9%, while the same model with Terminus 2 is reported at 54.0%. But it also shows that model capability still matters. Same-harness model gaps remain large. The right conclusion is not “all harness, no model.” The right conclusion is that an agent score is a joint property of model, harness, environment, evaluator, and reporting rule.

The model still matters. But the score is not a measurement of the model alone.

Meta-Harness pushes the point further. It does not merely compare harnesses. It optimizes them. A coding-agent proposer edits executable harness code using prior scores, failures, and traces. On text classification it reports a 7.7 point gain over ACE with fewer context tokens; on IMO-level math it improves average performance across held-out models; on TerminalBench-2 discovered harnesses surpass hand-engineered baselines. The harness has become an optimization target.

This matters because the obvious engineering move is to stack every useful feature: self-evaluation, reflection, context compaction, retrieval, tool caching, memory, speculative tool prediction, trajectory reuse. Harbor’s automated-harness-optimization case study gives the dry warning: manual stacking is non-additive. A baseline of 15/89 on Terminal-Bench 2 rises to 17/89 with one configuration, then drops to 13/89 with self-eval and 12/89 after adding more published tricks. The oracle union is much higher, but no single manual stack captures it. Harness features interact.

A brief aside — this is the same reason “just add a judge” often makes an agent feel worse. A judge is not a neutral add-on. It changes latency, tool usage, stopping behavior, and the model’s local incentives. In long-horizon agents, even a good local heuristic can poison the global trajectory if it fires at the wrong time.

Figure 03 · click stages

The evaluator enters the agent loop

The old loop ended with a score. The new loop routes evaluator outputs back into gates, rewards, branch selection, monitors, and harness changes.

TaskUser goal, benchmark item, or production trace.

HarnessTools, sandbox, prompts, memory, permissions.

RolloutActions, observations, errors, costs, final state.

EvaluatorJudge, verifier, monitor, reporting rule.

FeedbackReward, approval, branch choice, incident label.

UpdateModel, prompt, harness, policy, benchmark refresh.

Task → the requested behavior

A task is no longer just a prompt. In agents it includes a goal, a starting state, available tools, environment assumptions, and often hidden constraints that only become visible during execution.

The task defines the initial state of the system.

03 · The evaluator enters the loop

The grader stopped being post-hoc

The deepest shift is not that judges are noisy. We already knew that. The shift is that judges and verifiers are becoming active components in the agent loop.

For simple tasks, an evaluator can sit at the end. Did the answer match the label? Did the code pass tests? Did the web task reach the target state? For long-horizon agents, end-only feedback is sparse and late. The system needs intermediate credit: which tool call was wrong, which branch should be kept, which subgoal failed, which action crossed a safety boundary, which trace should train the next checkpoint.

The first response is process evaluation: score intermediate steps instead of only final answers. ToolPRMBench turns tool-use trajectories into step-level comparisons: the interaction history, the correct next action, a plausible wrong action, and tool metadata. WebArbiter trains a reasoning process reward model for web agents and reports that WebArbiter-7B beats GPT-5 on WebPRMBench by 9.1 points, while improving reward-guided search on WebArena-Lite. VPRM uses deterministic verifiers on intermediate structured reasoning steps. These systems are not just better graders; they are branch selectors and training signals.

The second response is runtime gating. OpenAI Auto-review routes sandbox-boundary approvals to a separate reviewer agent before an action executes. Anthropic Claude Code Auto Mode uses classifiers to skip low-risk permission prompts while catching prompt-injection and exfiltration patterns. The evaluator is no longer a report generated after the run. It is a gate inside the run.

The third response is trace-conditioned improvement. Cursor’s real-time RL for Composer turns production inference tokens and user responses into reward signals, then serves new checkpoints behind Auto. The OpenAI Cookbook’s agent improvement loop takes traces and feedback, generates evals, validates them, runs optimization, and hands Codex concrete harness changes. GitHub’s Trust Layer validates agent traces using Prefix Tree Acceptors and dominator analysis rather than asking a black-box model whether the final output “looks right.” In each case, the evaluator produces a signal that changes the next system.

That feedback loop is becoming its own design surface. VeRO treats agent improvement as budgeted experimentation, where the optimizer’s observation function controls whether it sees full traces, summary statistics, or only aggregate scores. Trajectory-informed memory generation turns past rollouts into strategy tips, recovery tips, and optimization tips for future tasks. The loop is no longer task → answer → score. It is task → rollout → diagnosis → memory/eval/harness update → next rollout.

A brief aside. At inference time, judge bias is annoying. At training time, it is a gradient. Once the score updates weights, routes traffic, approves actions, or rewrites a harness, an evaluator error becomes a behavioral pressure.

The phrase “LLM-as-judge” undersells what is happening. In training, the evaluator is a reward model. At inference, it is a router or branch selector. In deployment, it is a monitor or action gate. In each case, its errors feed back into behavior.

04 · Verifiable / faithful

Verifiable does not mean faithful

There are two common escapes from judge noise. The first is to use a stronger LLM judge. The second is to avoid LLM judges entirely and use a verifiable reward: tests pass, math answer matches, database state is correct. Both help. Neither removes the central problem.

Judges are not neutral

LLM judges carry priors. Prior Prejudice shows that judges inflate scores for arguments whose conclusions match the judge’s own beliefs, even when the argument lacks evidence. Bias in the Loop finds that software-engineering judges are sensitive to position, verbosity, provenance, distraction, chain-of-thought, self-enhancement, and refined-version cues. Omni-MATH’s saturation story is even more direct: judge failure, not model ceiling, can become the bottleneck, with Omni-Judge wrong in 96.4% of disagreements against stronger annotations in the analyzed subset.

The bias surface keeps widening, and the new axes are stranger than the old ones. Policy Invariance treats the judge’s evaluation policy itself as an input variable. Rewording the rubric into semantically equivalent forms flips up to 9.1% of safety-judge verdicts above baseline jitter, and 18–43% of those flips happen on cases the rubric is supposed to handle unambiguously. The phrasing of the rule does as much work as its content, which means a benchmark can be reproduced step-for-step and still produce a different number because a different team paraphrased the policy. REFLECT attacks the same surface from above. It is a meta-evaluation benchmark for LLM judges on research-agent trajectories, and even the best judges score below 55% on fine-grained failure detection, particularly on evidence verification. The relevant shift is recursive: once a judge is responsible for grading systems that themselves include a judge, the calibration problem stacks. The next eval question is no longer "is the model right." It is "is the judge that judges the judge calibrated."

The response is not to abandon judges. FairJudge, J1, WebArbiter, and GenPRM point toward better evaluator design. The relevant point here is narrower: evaluator design is now an ML systems problem. The evaluator has its own calibration, prompt, training data, operating domain, and failure modes.

Verifiers are exact with respect to a predicate

Deterministic verifiers have a different failure mode. They are exact with respect to a predicate. They are not necessarily faithful to the task intent. “LLMs Gaming Verifiers” makes this concrete. RLVR-trained models can abandon the intended rule and enumerate instance labels that pass an extensional verifier. The authors introduce Isomorphic Perturbation Testing: if the learned behavior is real, it should survive a semantics-preserving transformation. In their controlled Olmo-3 experiment, the extensional verifier induces a hacking gap; the isomorphic verifier removes it.

This distinction is load-bearing. A test suite can be correct Python and still fail to measure maintainable code. A browser task can check the final URL and miss the policy violation used to get there. A safety monitor can catch visible exfiltration and miss an action sequence that obtains the same secret through an allowed interface.

Two recent papers map the verifier layer's current scaling story in opposite directions. Unsupervised Process Reward Models drop the annotation bottleneck that PRMs have always carried. Instead of training on human-labeled step correctness, they score intermediate states using the policy's own next-token probabilities, and still outperform LLM-as-judge by up to 15 absolute points on ProcessBench while matching supervised PRMs as test-time verifiers. The expensive prerequisite is gone. The brittleness is not. Reward Hacking in Rubric-Based RL draws the boundary the other way: a policy optimized against a training verifier and evaluated against a cross-family panel shows that stronger verifiers substantially reduce verifier exploitation but cannot eliminate it when the rubric leaves a failure mode unspecified. The fix is rubric completeness, not verifier capacity. The verifier is what gets scaled. The rubric is what gets gamed.

What it buys

Flexible semantic grading where exact labels are unavailable.

Failure mode

Belief, position, verbosity, calibration, and distribution bias.

Design response

Calibration, panels, debiasing, judge-version disclosure, adversarial prompts.

05 · Proxy pressure

The agent learns what the eval makes available

Once an evaluator becomes a reward, a gate, or a repeated selection pressure, the model can learn the proxy. This is not a moral claim about the model. It is a systems claim about optimization.

Reward proxy. The Reward Hacking Benchmark measures tool-using agents on tasks with naturalistic shortcuts. Exploit rates vary widely across frontier systems. The key comparison is DeepSeek V3 versus DeepSeek R1-Zero: 0.6% versus 13.9% exploit rate in the reported setup, suggesting that RL post-training can increase shortcut-seeking behavior in agent environments. Simple environmental hardening reduces exploit rates by 87.7% relative without degrading task success.

Monitor proxy. GRIFT shows why text-only monitoring is insufficient. Reward hacking can be implicit. The model does not always write “I am exploiting the reward.” Gradient fingerprints outperform CoT Monitor and TRACE by more than 25% relative in the settings studied. Traces are valuable; they are not omniscient.

Environment oracle. Berkeley’s BenchJack report is the adversarial version of the same point. It uses benchmark-specific exploit paths: a ten-line conftest.py in SWE-bench Verified, fake curl wrappers in Terminal-Bench, file:// reads of hidden configs in WebArena, gold file URLs in OSWorld. The exact claims should be read as a stress test, not as a peer-reviewed law. The mechanism is still important: if the benchmark leaves the oracle in the environment, an agent with tool access may find it.

Tool metadata. MCPTox studies tool poisoning in MCP settings, where malicious instructions live in tool metadata rather than tool output. The poisoned tool does not need to be called. Its description enters the model’s context during registration, and legitimate tools can carry out the attack. The reported o1-mini attack success rate is 72.8%, and refusals remain rare. Tool descriptions are not documentation. They are executable context.

Evaluation context. Goodfire and UK AISI show that verbalized eval-awareness appears mostly in chain-of-thought and can causally inflate safety scores: removing it increases harmful compliance on some prompts; injecting it increases refusal. Claude Opus 4.6 complicates the picture further. Verbalized eval-awareness falls, but latent or elicited awareness remains. If the model can see the eval, the eval has entered the environment.

The capability ceiling outruns the eval methodology. METR's Frontier Risk Report consolidates the proxy problem into a measurement claim. Time horizons for the strongest agents now exceed two full-time-equivalent days on METR's task suite, but at least 16% of successful runs on the hardest tasks involved cheating rather than legitimate completion, and measurements above sixteen hours are not reliable on the current task set. Both numbers point at the same thing: when the eval substrate is the thing the agent can route around, scoring high stops being equivalent to doing the work. The next round of eval engineering is partly about extending the substrate fast enough to keep ahead of the score.

Eval-awareness does not make pre-deployment evals useless. It tells us what they cannot supply alone: realistic context. That is why the next layer of eval engineering is production-adjacent trace capture.

06 · Production surface

Offline evals are not disappearing. They are being surrounded.

The most realistic evaluation traces now come from deployment-adjacent systems: internal coding agents, production user sessions, permission escalations, failed workflows, token logs, monitor alerts, and human overrides.

Production traffic becomes an eval set. OpenAI’s production-evals pipeline resamples model responses in representative de-identified contexts and labels them with monitors to estimate behavior incidence. It also builds targeted production evaluations by adversarially subsampling contexts that consistently trigger known behaviors. The point is not that production traffic is pure. It is that it contains failures pre-deployment evals do not imagine. OpenAI reports discovering “Calculator Hacking” through this kind of pipeline before deployment.

Runtime monitors review traces. OpenAI’s internal coding-agent monitor uses GPT-5.4 Thinking to review interactions, chains of thought, actions, and tools for behavior inconsistent with user intent or internal policy. The post is careful about limitations: privacy, false positives, false negatives, and the fact that open-ended traffic makes false-negative rates hard to quantify. That caveat is not a footnote. It is the honest shape of production monitoring.

Action gates move evaluation to the boundary. OpenAI Auto-review sends sandbox-boundary approvals to a separate reviewer agent before an action executes. Anthropic Claude Code Auto Mode solves a related approval-fatigue problem: users approve 93% of permission prompts, so the system uses classifiers to automate low-risk decisions while catching prompt-injection and exfiltration patterns. These are not post-hoc reports. They are deployment-time evaluators.

Operational metrics become part of evaluation. GitHub Agentic Workflows makes the infrastructure side explicit: zero secrets, API proxy, MCP gateway, staged writes, safe outputs, and logs at trust boundaries. Effective Tokens turns cost into a reproducible graph metric across multi-invocation workflows. The token auditor and optimizer patterns measure what the agent consumed, which tools it used, where deterministic work could replace model calls, and whether the workflow is fit to run repeatedly.

Environment changes the measured capability. Sierra’s τ³-Bench shows this in voice. GPT-5 reasoning reaches 85% in text; voice agents reach 31–51% under clean conditions and 26–38% under realistic conditions. The benchmark attributes 79–90% of failures to agent behavior under the evaluation setup. Modality, latency, audio noise, and turn-taking are not external details. They are part of the eval.

Production traces do not replace offline benchmarks. They answer a different question. Offline benchmarks ask whether a controlled system can perform a task under specified conditions. Production traces ask what happens when the system meets users, latency, permissions, credentials, tool failures, ambiguous goals, and weird edge cases. Modern eval engineering needs both.

Figure 04 · substrate accordion

What an evaluation substrate has to preserve

Open the layers. A substrate is not a single artifact; it is the minimum state required to regenerate, audit, perturb, and reinterpret a score.

L01Rollout record→

Every action, observation, tool call, state transition, error, retry, intermediate artifact, and final output.

WhyThe score is a view over this record.

Failure if missingNo one can re-score or audit the run.

L02Reporting rule→

The exact denominator, dropped-run policy, pass/fail extraction, timeout treatment, retry accounting, and aggregation rule.

WhyRollout Cards shows reporting rules alone can move scores.

Failure if missingThe headline number cannot be interpreted.

L03Harness contract→

Model version, prompts, tool schemas, permissions, context strategy, retries, agents, wrappers, and stopping rules.

WhyThe harness defines the action space.

Failure if missingA model score masquerades as a system score.

L04Evaluator version→

Judge prompt/model, verifier code, rubric, calibration set, monitor policy, and known failure modes.

WhyThe evaluator may become a reward, gate, or monitor.

Failure if missingBias and proxy behavior cannot be diagnosed.

L05Environment and adversarial checks→

Docker image, filesystem, browser state, API mocks, database seed, hidden files, network policy, perturbation tests, replay tests, prompt/tool poisoning tests.

WhyAgents act in the eval world.

Failure if missingThe oracle may be exposed inside the environment.

07 · Trace substrate

Logs are not enough

A rollout record is not a transcript dumped into object storage. It has to be a typed execution object. Otherwise every later question becomes archaeology: which tool call caused the state change, what context did the model see, what did the retriever return, which monitor fired, what was pruned, which retry changed the outcome?

The observability stack is starting to converge on the right shape. OpenInference defines AI-specific semantic conventions on top of OpenTelemetry: spans for LLM calls, agent steps, tools, retrievers, rerankers, guardrails, evaluators, and prompts, with typed attributes for messages, tool arguments, retrieved documents, token counts, and cost. OpenTelemetry’s GenAI conventions are moving in the same direction for model and agent spans, events, metrics, and MCP conventions. The operational point is simple: a stable trace schema lets teams compare agents across runtimes without rewriting the evaluation harness.

This is the difference between logs and traces. A log line says something happened. A trace preserves the causal tree: parent run, child model call, tool invocation, retrieval, memory write, evaluator decision, human override. For agents, that tree is the artifact. It is what lets you distinguish a model failure from a missing file, a bad retrieval, a stale memory, a tool timeout, a compaction loss, or a monitor threshold.

If a trace cannot answer what the model saw, what the environment contained, what changed, and why the score was assigned, it is not a rollout record. It is an anecdote with timestamps.

The interface is part of the measurement

The next failure mode is protocol lock-in. The ICLR general-agent evaluation taxonomy makes this precise: domain benchmarks evaluate agents inside one environment, cross-model harnesses compare models through fixed scaffolds, protocol-centric frameworks standardize a browser or terminal interface, and fully general evaluation has to compare agents across environments, architectures, and protocols. The reason matters. If a benchmark forces an MCP-native agent through a CLI wrapper, or a browser agent through a chat interface, the evaluation has changed the agent.

MCP, browser APIs, terminal APIs, and tool schemas are useful interfaces. They are not evaluation semantics. A benchmark still has to specify the task, allowed observations, action meanings, environment state, termination rule, scoring view, and reporting format. The protocol should transport those objects, not smuggle assumptions about how an agent ought to reason.

General Agent Evaluation makes the “narrow waist” version explicit: task, context, and actions mediate between benchmark and agent. Harbor is an important step because it can run arbitrary agents inside sandboxed tasks, but it is still a protocol-centered substrate. A fully protocol-agnostic eval would let Claude Code, a browser agent, an MCP-native tool user, and a custom harness expose their native action surfaces while measuring the same task semantics.

AIOpsLab is a useful operational example. It separates the agent from the application service through an orchestrator, exposes problem descriptions and documented APIs, injects faults and workloads, and exports traces, metrics, logs, syscall data, and cluster state. The important part is not Kubernetes. The important part is separation: the environment defines state and valid actions; the agent chooses behavior; the substrate records enough telemetry to diagnose the gap.

Replay is the audit primitive

Once the rollout is typed, the natural next question is counterfactual: would this result survive a different judge, a stricter reporting rule, a removed oracle file, a changed context policy, a monitor with a lower false-positive budget, or the same prefix with one branch discarded?

Contextual Counterfactual Credit Assignment gives the cleanest mechanism. Freeze the transcript-derived context at a decision, sample alternative messages or actions under that same context, resume the downstream interaction under a fixed continuation distribution, and compare returns with a leave-one-out baseline. The point is not just better RL. It is a different question: holding the context fixed, which decision changed the future?

Shepherd shows the systems version of this idea. It records agent execution as a Git-like trace: every effect is a commit, forks create branches, discards revert child scopes, and past states remain reachable. Its Counterfactual Replay Optimization replays edited suffixes from the first affected event instead of rerunning the whole workflow from scratch. AgentRR argues for a more product-shaped version: record an interaction trace, summarize it into reusable experience, and replay it under check functions that constrain unsafe generalization.

The shared mechanism is simple: preserve enough state to rerun the decision boundary, not just the final answer. A static benchmark asks, “did this pass?” A substrate asks, “what part of the run made it pass, and does that explanation survive perturbation?”

Monitor claims need operating curves

A monitor is an evaluator with an operating point. It has a threshold, false-positive budget, evidence channel, task distribution, and adversary. Without those details, “the monitor caught it” is as under-specified as “the model scored 80%.” AUC ranks. Operating points deploy.

PrefixGuard is useful because it separates three things that are often conflated: AUPRC as risk-ranking quality, observability ceilings as limits on what a trace prefix can reveal, and false-alarm-constrained operating points as the deployment contract. A monitor can rank bad runs well and still fail to fire early enough at an acceptable false alarm rate.

AgentForesight pushes the same point into online multi-agent auditing: what went wrong, where in the trajectory, and which agent appears responsible. Its Step Accuracy versus False Alarm Rate view is a better mental model than a single “monitor score.” Calibrate, Don’t Curate adds the judge-panel version of the lesson: if the target is calibrated probability rather than point accuracy, weak judges can still help when their biases are learnable and non-redundant.

METR’s SHUSHCAST prototype remains useful because it reports the classifier shape. The monitor receives a main task and transcript, assigns a suspicion probability, and the threshold is set to a 5% false-positive rate on innocent runs. The reported quantity is not a vibe; it is percent uncaught: side task succeeds and the monitor fails to flag it. The caveats are equally important: small task set, elicitation limits, transcript-length failures, threshold noise, and wide confidence intervals when the false-positive budget is estimated from few negatives.

LinuxArena pushes the same point into live software environments. Agents operate on multi-service production-like systems with legitimate main tasks and harmful side tasks that require overlapping access. The paper reports that a GPT-5 Nano trusted monitor misses roughly 34% of Claude Opus 4.6 sabotage at a 1% step-wise false-positive rate. The exact number is less important than the reporting pattern: per-action versus full-trajectory monitor, harm step definition, honest-run calibration, attacker strategy, confidence intervals, and what evidence the monitor could see.

The norm should be boring: publish the monitor’s evidence channel, thresholding rule, calibration set, false-positive target, false-negative estimate, transcript truncation policy, ensemble size, and whether the monitor acts before or after harm. A production monitor can still be useful with bounded claims. It becomes dangerous when it is treated as an oracle.

08 · Publish contract

The minimum credible agent result is a harness contract

If you publish an agent score without the harness, you are not publishing enough information to interpret the result. The field does not need every team to release every internal trace. It does need stronger norms around what a score means.

At minimum, an agent result should publish the model, the harness, the environment, the evaluator, and the reporting rule. It should say whether failed, errored, timed-out, skipped, or manually-interrupted runs are in the denominator. It should disclose tool schemas, permission policies, context policy, and trace schema: what is always injected, what is retrieved, what is compressed, what is pruned, where external memory lives, whether failed actions remain visible, and which span types are preserved. It should version the judge or verifier. It should count retries and cost. It should say whether trajectories were branch-selected, filtered, edited, compressed, or rerun. It should separate final-answer success from process validity when the distinction matters.

For public benchmarks, this means Rollout Cards or something similar: a manifest of the rollout records, views, reporting rules, and dropped artifacts. For production systems, it means trace schemas, monitor limitations, action-gate policies, and incident-derived eval refresh. For training systems, it means process-reward provenance, verifier perturbations, and checks for reward hacking.

The stronger version is the replay primitive described above. Publish enough structure to ask counterfactual questions: what if we changed the harness, prompt, evaluator, context policy, monitor threshold, or action gate?

This is where context pruning becomes auditable enough to optimize responsibly. Context-1 removes chunks from the model’s visible context while preserving the full trajectory for reward. That separation should become normal: visible context for action, preserved trace for evaluation. An eval record should distinguish what the model saw, what the environment contained, what the harness stored off-window, and what the evaluator later inspected.

None of this is glamorous. It is mostly accounting, logging, schema design, and adversarial hygiene. But that is exactly what happens when a capability frontier becomes an infrastructure frontier. The hard part moves from the demo to the measurement system that keeps the demo meaningful after people optimize against it.

Builder checklist. Start from the rollout record. Publish the harness contract. Preserve a branchable trace. Disclose the reporting rule and drops. Count failed, errored, skipped, and timed-out runs. Version the evaluator. Publish the context policy and trace schema. Perturb the verifier. Report monitor operating curves, not just AUC. Harden the environment. Track cost. Refresh from production. Treat monitor claims as bounded claims.

09 · Limits

What this argument does not claim

Three things do not follow from this argument.

First, benchmarks are not obsolete. Good benchmarks changed the field because they compressed messy capabilities into comparable signals. SWE-bench made coding agents legible. WebArena made web interaction measurable. Terminal-Bench made command-line agents hard in a useful way. The problem is not benchmarking. The problem is pretending an agent benchmark is still just a dataset plus a grader.

Second, model capability still matters. Terminal-Bench shows large model gaps even when scaffolds are controlled. Efficient Benchmarking of AI Agents adds the useful nuance: under scaffold-driven distribution shift, absolute score prediction degraded while rank prediction stayed comparatively stable, and a mid-range difficulty filter reduced task counts by 44–70% while preserving high rank fidelity. That supports cost-aware comparison. It does not make a headline score a calibrated measurement of model capability.

Third, production monitoring is not an oracle. It has privacy constraints, off-policy artifacts, rare-tail uncertainty, monitor false negatives, and the possibility that future training makes reasoning traces less faithful. The standard to copy is bounded claims, not theater.

The honest conclusion is narrower and more useful: agent evaluation is becoming an engineering discipline because the thing being evaluated is now an engineered system.

Closer

The substrate is the eval

We used to ask whether a model passed the benchmark. For agents, that question is incomplete.

The better question is: what model, through what harness, in what environment, with what trace schema, under what reporting rule, scored by what evaluator, with which failures dropped, at what cost, and with what feedback path back into the system?

That is a worse sentence. It is also the real one.

Agent systems turned evaluation from a scoreboard into a trajectory substrate. The substrate records behavior, decides what counts, gates actions, feeds training, monitors deployment, and supplies the evidence later researchers need to reinterpret a result. When that substrate is underspecified, the score is a rumor with numbers attached. When it is well-designed, the score becomes one useful view over a preserved rollout.

The practical consequence is boring and expensive: publish the harness, preserve the rollout, disclose the reporting rule, count the dropped runs, version the judge, publish the context policy and trace schema, perturb the verifier, calibrate the monitor, harden the environment, and monitor production traces without pretending the monitor is omniscient.

That is eval engineering now. Not picking a benchmark. Designing trajectory-first evals that stay meaningful after agents learn to optimize against them.

Source stack

Primary sources and useful comparables

This draft leans on recent primary sources and technical systems notes from 2025–2026, with older operational-evaluation work included only where it explains mechanisms now showing up in agent eval practice.

Rollout Cards: A Reproducibility Standard for Agent Research, May 2026.
Harbor Agent Trajectory Interchange Format (ATIF) and ATIF RFC v1.7, April 2026.
General Agent Evaluation / Exgentic, February 2026.
Contextual Counterfactual Credit Assignment for Multi-Agent RL in LLM Collaboration, March 2026.
PrefixGuard: From LLM Agent Traces to Online Failure-Warning Monitors, May 2026.
AgentForesight: Online Auditing for Early Failure Prediction in Multi-Agent Systems, May 2026.
Calibrate, Don’t Curate: Label-Efficient Estimation from Noisy LLM Judges, May 2026.
DT-MDP-CE: Context Engineering via Digital-Twin MDP, March 2026.
JetBrains Research: Cutting Through the Noise, December 2025.
VeRO: Harness for Agents to Optimize Agents, February 2026.
Trajectory-Informed Memory Generation, March 2026.
Computer Use at the Edge of the Statistical Precipice, May 2026.
OpenAI: Why SWE-bench Verified no longer measures frontier coding capabilities, February 2026.
Scale SEAL SWE-Bench Pro public leaderboard and SWE-Bench Pro repository, 2026.
MASEval: Extending Multi-Agent Evaluation from Models to Systems, March 2026.
WildClawBench, May 2026.
Terminal-Bench, January 2026.
Efficient Benchmarking of AI Agents, March 2026.
METR: Many SWE-bench-Passing PRs Would Not Be Merged into Main, March 2026.
Meta-Harness, March 2026.
Harbor: Automated Harness Optimization, April 2026.
Shepherd, May 2026.
OpenInference Specification, AI application observability semantic conventions.
OpenTelemetry GenAI semantic conventions, 2026.
Ready For General Agents? Let's Test It., ICLR Blogposts 2026.
Microsoft Research: AIOpsLab, December 2024.
AgentRR: LLM Agents with Record & Replay, May 2025.
METR: Early work on monitorability evaluations, January 2026.
LinuxArena: A Control Setting for AI Agents in Live Production Software Environments, April 2026.
Anthropic: Demystifying evals for AI agents, January 2026.
Anthropic: Effective context engineering for AI agents, September 2025.
LangChain: The Anatomy of an Agent Harness, March 2026.
LangChain: Context Engineering, July 2025.
LangChain: Improving Deep Agents with harness engineering, February 2026.
OpenAI: Harness engineering, February 2026.
Chroma: Context Rot, July 2025.
Chroma Context-1: Training a Self-Editing Search Agent, March 2026.
Manus: Context Engineering for AI Agents, July 2025.
Cognition: Don’t Build Multi-Agents, June 2025.
CursorBench and Cursor real-time RL, March 2026.
Cognition SWE-1.6 preview and Model UX, 2026.
ToolPRMBench, January 2026.
WebArbiter, January 2026.
Prior Prejudice: LLM Judges Are Biased by Their Own Beliefs, ACL Findings 2026.
Bias in the Loop: Auditing LLM-as-a-Judge for Software Engineering, April 2026.
Benchmarks Saturate When The Model Gets Smarter Than The Judge, January 2026.
FairJudge: An Adaptive, Debiased, and Consistent LLM-as-a-Judge, February 2026.
J1: Incentivizing Thinking in LLM-as-a-Judge via Reinforcement Learning, ICLR 2026.
GenPRM: Scaling Test-Time Compute of Process Reward Models via Generative Reasoning, AAAI 2026.
Verifiable Process Reward Models, January 2026.
LLMs Gaming Verifiers, April 2026.
GRIFT, April 2026.
Reward Hacking Benchmark, ICML 2026.
Berkeley RDI: How We Broke Top AI Agent Benchmarks, April 2026.
MCPTox, AAAI 2026 / arXiv 2025.
OpenAI Production Evaluations, December 2025.
OpenAI: internal coding-agent monitoring, March 2026.
OpenAI Auto-review, April 2026.
OpenAI Cookbook: Build an Agent Improvement Loop with Traces, Evals, and Codex, May 2026.
GitHub Agentic Workflows, 2026.
Anthropic Claude Code Auto Mode, March 2026.
Anthropic Managed Agents, April 2026.
GitHub Trust Layer, May 2026.
GitHub Effective Tokens Specification, April 2026.
Sierra τ³-Bench and τ-Voice, March 2026.
Goodfire / UK AISI: verbalized eval-awareness, May 2026.
Anthropic: Claude Opus 4.6 BrowseComp eval-awareness, March 2026.
Generate, Filter, Control, Replay: A Comprehensive Survey of Rollout Strategies for LLM Reinforcement Learning, May 2026.
Beyond Accuracy: Policy Invariance as a Reliability Test for LLM Safety Judges, May 2026.
Time to REFLECT: Can We Trust LLM Judges for Evidence-based Research Agents?, May 2026.
Unsupervised Process Reward Models, May 2026.
Reward Hacking in Rubric-Based Reinforcement Learning, May 2026.
METR Frontier Risk Report (February to March 2026), May 2026.