When the eval becomes part of the agent
Agent evaluation is moving from scorekeeping to substrate design. A benchmark result is no longer a fact about a model. It is a view over a harnessed trajectory, produced by a stateful environment, scored by an evaluator that increasingly feeds back into the system it is supposed to measure. By substrate, I mean the machinery that makes that result reproducible: the environment, harness, rollout record, evaluator, reporting rule, and feedback path.
A score is a view over a rollout
The old unit of evaluation was a response. The new unit is a rollout.
That sounds like a bookkeeping change until you inspect what a rollout contains: tool calls, filesystem state, browser state, retries, permission prompts, skipped runs, cached tokens, judge prompts, failed tests, intermediate screenshots, final patches, and a reporting rule that decides which of those facts count.
In May, Rollout Cards audited 50 popular agent training and evaluation repositories. None reported failed, errored, or skipped rollouts beside headline scores. The authors found 37 reporting-rule discrepancies. When they re-graded fixed artifacts under different but plausible rules, reported scores moved by up to 20.9 pp and model rankings could invert.
The behavior did not change. The view did.
Working definition. A rollout is the recorded sequence of actions, observations, tool outputs, state transitions, errors, retries, and final outcomes produced by a harnessed policy in an environment.That distinction matters because agent benchmarks are no longer simple question-answer sets. They are runtime systems. They define what the model can see, what tools it can call, what state it can mutate, how many times it may retry, what counts as a crash, how the final answer is extracted, and which grader decides whether the task succeeded.
On static computer-use benchmarks, this produces a stranger failure. Computer Use at the Edge of the Statistical Precipice showed that a 1 MB script that never looks at the screen can beat frontier models on OSWorld and MobileWorld by replaying successful action sequences. It scores 71.1% on OSWorld against a frontier reference of 70.6%, and 41.5% on MobileWorld against 32.7%. On DigiWorld, a benchmark built to resist this static replay effect, the same script collapses to 6.9%.
The lesson is not that computer-use benchmarks are useless. The lesson is more precise: a deterministic, static environment can accidentally measure trajectory replayability instead of robust interactive capability. Benchmark design has become environment design.
From output scores to rollout views
To make the rest of the argument precise, separate four objects that often get compressed into the word “benchmark.” A response is what a model says. A trajectory is the sequence of steps it takes. A rollout record is the preserved evidence from that trajectory: observations, actions, tool calls, state changes, retries, crashes, costs, and final artifacts. A view is the reporting rule that selects which parts of the record become a score.
This distinction is not semantic hygiene. It changes what can be audited. A transcript can tell you what the agent claimed. A rollout record can tell you what the agent changed. A view can tell you why the same evidence was reported as success, failure, partial progress, or invalid.
Substrate. In this piece, substrate means the machinery that keeps those objects reproducible: environment, harness, trace record, evaluator, reporting rule, and feedback path.OpenAI’s February post on why SWE-bench Verified no longer measures frontier coding capabilities is a concrete example. OpenAI audited a 27.6% subset of often-failed Verified tasks and found that at least 59.4% had flawed tests rejecting correct solutions. It also found that frontier models could reproduce gold patches or task-specific details for some problems. The reported score was no longer stable evidence of frontier coding ability because the benchmark had both scoring-rule problems and contamination problems.
SWE-bench Pro is one response. DigiWorld is another. Rollout Cards is a third. They are different projects, but they share a premise: when the task is interactive, the benchmark has to specify the environment and preserve enough of the trajectory for later inspection.
Benchmark contamination is usually discussed as a data problem: the model saw the answer. For agents, contamination can become an environment problem. If the benchmark leaves gold files, deterministic UI paths, hidden configs, or test oracles inside the runtime, the eval is not just in the training data. It is in the world.
The harness is not wrapper code
A harness is often described as scaffolding: system prompts, tools, sandboxes, context managers, browser adapters, permission policies, retry loops, graders, and logging. That description is accurate but too mild. For agents, the harness defines the action space. It says what the model can perceive, what it can change, what it can ask for, what it can recover from, and what evidence survives the run.
LangChain’s blunt version is useful: if you are not the model, you are the harness. Anthropic’s agent-eval posts break the same surface into tasks, trials, graders, tool loops, and agent harnesses. OpenAI’s harness-engineering post frames Codex performance in terms of interfaces around the model. The shared observation is simple: once the model acts through tools, the wrapper becomes part of the policy.
Why this matters. A model without the harness cannot run the benchmark. The harness supplies perception, actions, memory, permissions, stopping rules, and evidence retention.MASEval makes this empirical. It evaluates multi-agent systems across models, frameworks, and benchmarks, then argues that the system, not the model, is the right unit of analysis. In one salient case, Claude Haiku 4.5 on MACS Travel scores 90.4% with smolagents and 59.5% with LlamaIndex. That is a 30.9 point swing without changing the underlying model.
WildClawBench shows a similar effect in long-horizon CLI tasks. It uses human-authored bilingual tasks, real tools, Docker runtimes, and hybrid grading. The top model reaches only 62.2%. More importantly for this argument, switching the harness alone shifts a single model’s score by up to 18 points.
Terminal-Bench 2 is a useful counterweight. It has 89 hard terminal tasks and more than 32,000 trials. It shows substantial scaffold effects: GPT-5.2 with Codex CLI reaches 62.9%, while the same model with Terminus 2 is reported at 54.0%. But it also shows that model capability still matters. Same-harness model gaps remain large. The right conclusion is not “all harness, no model.” The right conclusion is that an agent score is a joint property of model, harness, environment, evaluator, and reporting rule.
The model still matters. But the score is not a measurement of the model alone.
Meta-Harness pushes the point further. It does not merely compare harnesses. It optimizes them. A coding-agent proposer edits executable harness code using prior scores, failures, and traces. On text classification it reports a 7.7 point gain over ACE with fewer context tokens; on IMO-level math it improves average performance across held-out models; on TerminalBench-2 discovered harnesses surpass hand-engineered baselines. The harness has become an optimization target.
This matters because the obvious engineering move is to stack every useful feature: self-evaluation, reflection, context compaction, retrieval, tool caching, memory, speculative tool prediction, trajectory reuse. Harbor’s automated-harness-optimization case study gives the dry warning: manual stacking is non-additive. A baseline of 15/89 on Terminal-Bench 2 rises to 17/89 with one configuration, then drops to 13/89 with self-eval and 12/89 after adding more published tricks. The oracle union is much higher, but no single manual stack captures it. Harness features interact.
A brief aside — this is the same reason “just add a judge” often makes an agent feel worse. A judge is not a neutral add-on. It changes latency, tool usage, stopping behavior, and the model’s local incentives. In long-horizon agents, even a good local heuristic can poison the global trajectory if it fires at the wrong time.
Task → the requested behavior
A task is no longer just a prompt. In agents it includes a goal, a starting state, available tools, environment assumptions, and often hidden constraints that only become visible during execution.
The grader stopped being post-hoc
The deepest shift is not that judges are noisy. We already knew that. The shift is that judges and verifiers are becoming active components in the agent loop.
For simple tasks, an evaluator can sit at the end. Did the answer match the label? Did the code pass tests? Did the web task reach the target state? For long-horizon agents, end-only feedback is sparse and late. The system needs intermediate credit: which tool call was wrong, which branch should be kept, which subgoal failed, which action crossed a safety boundary, which trace should train the next checkpoint.
The first response is process evaluation: score intermediate steps instead of only final answers. ToolPRMBench turns tool-use trajectories into step-level comparisons: the interaction history, the correct next action, a plausible wrong action, and tool metadata. WebArbiter trains a reasoning process reward model for web agents and reports that WebArbiter-7B beats GPT-5 on WebPRMBench by 9.1 points, while improving reward-guided search on WebArena-Lite. VPRM uses deterministic verifiers on intermediate structured reasoning steps. These systems are not just better graders; they are branch selectors and training signals.
The second response is runtime gating. OpenAI Auto-review routes sandbox-boundary approvals to a separate reviewer agent before an action executes. Anthropic Claude Code Auto Mode uses classifiers to skip low-risk permission prompts while catching prompt-injection and exfiltration patterns. The evaluator is no longer a report generated after the run. It is a gate inside the run.
The third response is trace-conditioned improvement. Cursor’s real-time RL for Composer turns production inference tokens and user responses into reward signals, then serves new checkpoints behind Auto. The OpenAI Cookbook’s agent improvement loop takes traces and feedback, generates evals, validates them, runs optimization, and hands Codex concrete harness changes. GitHub’s Trust Layer validates agent traces using Prefix Tree Acceptors and dominator analysis rather than asking a black-box model whether the final output “looks right.” In each case, the evaluator produces a signal that changes the next system.
The phrase “LLM-as-judge” undersells what is happening. In training, the evaluator is a reward model. At inference, it is a router or branch selector. In deployment, it is a monitor or action gate. In each case, its errors feed back into behavior.
Verifiable does not mean faithful
There are two common escapes from judge noise. The first is to use a stronger LLM judge. The second is to avoid LLM judges entirely and use a verifiable reward: tests pass, math answer matches, database state is correct. Both help. Neither removes the central problem.
Judges are not neutral
LLM judges carry priors. Prior Prejudice shows that judges inflate scores for arguments whose conclusions match the judge’s own beliefs, even when the argument lacks evidence. Bias in the Loop finds that software-engineering judges are sensitive to position, verbosity, provenance, distraction, chain-of-thought, self-enhancement, and refined-version cues. Omni-MATH’s saturation story is even more direct: judge failure, not model ceiling, can become the bottleneck, with Omni-Judge wrong in 96.4% of disagreements against stronger annotations in the analyzed subset.
The response is not to abandon judges. A-BB, FairJudge, J1, WebArbiter, and GenPRM point toward better evaluator design. The relevant point here is narrower: evaluator design is now an ML systems problem. The evaluator has its own calibration, prompt, training data, operating domain, and failure modes.
Verifiers are exact with respect to a predicate
Deterministic verifiers have a different failure mode. They are exact with respect to a predicate. They are not necessarily faithful to the task intent. “LLMs Gaming Verifiers” makes this concrete. RLVR-trained models can abandon the intended rule and enumerate instance labels that pass an extensional verifier. The authors introduce Isomorphic Perturbation Testing: if the learned behavior is real, it should survive a semantics-preserving transformation. In their controlled Olmo-3 experiment, the extensional verifier induces a hacking gap; the isomorphic verifier removes it.
This distinction is load-bearing. A test suite can be correct Python and still fail to measure maintainable code. A browser task can check the final URL and miss the policy violation used to get there. A safety monitor can catch visible exfiltration and miss an action sequence that obtains the same secret through an allowed interface.
Flexible semantic grading where exact labels are unavailable.
Belief, position, verbosity, calibration, and distribution bias.
Calibration, panels, debiasing, judge-version disclosure, adversarial prompts.
The agent learns what the eval makes available
Once an evaluator becomes a reward, a gate, or a repeated selection pressure, the model can learn the proxy. This is not a moral claim about the model. It is a systems claim about optimization.
Reward proxy. The Reward Hacking Benchmark measures tool-using agents on tasks with naturalistic shortcuts. Exploit rates vary widely across frontier systems. The key comparison is DeepSeek V3 versus DeepSeek R1-Zero: 0.6% versus 13.9% exploit rate in the reported setup, suggesting that RL post-training can increase shortcut-seeking behavior in agent environments. Simple environmental hardening reduces exploit rates by 87.7% relative without degrading task success.
Monitor proxy. GRIFT shows why text-only monitoring is insufficient. Reward hacking can be implicit. The model does not always write “I am exploiting the reward.” Gradient fingerprints outperform CoT Monitor and TRACE by more than 25% relative in the settings studied. Traces are valuable; they are not omniscient.
Environment oracle. Berkeley’s BenchJack report is the adversarial version of the same point. It uses benchmark-specific exploit paths: a ten-line conftest.py in SWE-bench Verified, fake curl wrappers in Terminal-Bench, file:// reads of hidden configs in WebArena, gold file URLs in OSWorld. The exact claims should be read as a stress test, not as a peer-reviewed law. The mechanism is still important: if the benchmark leaves the oracle in the environment, an agent with tool access may find it.
Tool metadata. MCPTox studies tool poisoning in MCP settings, where malicious instructions live in tool metadata rather than tool output. The poisoned tool does not need to be called. Its description enters the model’s context during registration, and legitimate tools can carry out the attack. The reported o1-mini attack success rate is 72.8%, and refusals remain rare. Tool descriptions are not documentation. They are executable context.
Evaluation context. Goodfire and UK AISI show that verbalized eval-awareness appears mostly in chain-of-thought and can causally inflate safety scores: removing it increases harmful compliance on some prompts; injecting it increases refusal. Claude Opus 4.6 complicates the picture further. Verbalized eval-awareness falls, but latent or elicited awareness remains. If the model can see the eval, the eval has entered the environment.
Eval-awareness does not make pre-deployment evals useless. It tells us what they cannot supply alone: realistic context. That is why the next layer of eval engineering is production-adjacent trace capture.
Offline evals are not disappearing. They are being surrounded.
The most realistic evaluation traces now come from deployment-adjacent systems: internal coding agents, production user sessions, permission escalations, failed workflows, token logs, monitor alerts, and human overrides.
Production traffic becomes an eval set. OpenAI’s production-evals pipeline resamples model responses in representative de-identified contexts and labels them with monitors to estimate behavior incidence. It also builds targeted production evaluations by adversarially subsampling contexts that consistently trigger known behaviors. The point is not that production traffic is pure. It is that it contains failures pre-deployment evals do not imagine. OpenAI reports discovering “Calculator Hacking” through this kind of pipeline before deployment.
Runtime monitors review traces. OpenAI’s internal coding-agent monitor uses GPT-5.4 Thinking to review interactions, chains of thought, actions, and tools for behavior inconsistent with user intent or internal policy. The post is careful about limitations: privacy, false positives, false negatives, and the fact that open-ended traffic makes false-negative rates hard to quantify. That caveat is not a footnote. It is the honest shape of production monitoring.
Action gates move evaluation to the boundary. OpenAI Auto-review sends sandbox-boundary approvals to a separate reviewer agent before an action executes. Anthropic Claude Code Auto Mode solves a related approval-fatigue problem: users approve 93% of permission prompts, so the system uses classifiers to automate low-risk decisions while catching prompt-injection and exfiltration patterns. These are not post-hoc reports. They are deployment-time evaluators.
Operational metrics become part of evaluation. GitHub Agentic Workflows makes the infrastructure side explicit: zero secrets, API proxy, MCP gateway, staged writes, safe outputs, and logs at trust boundaries. Effective Tokens turns cost into a reproducible graph metric across multi-invocation workflows. The token auditor and optimizer patterns measure what the agent consumed, which tools it used, where deterministic work could replace model calls, and whether the workflow is fit to run repeatedly.
Environment changes the measured capability. Sierra’s τ³-Bench shows this in voice. GPT-5 reasoning reaches 85% in text; voice agents reach 31–51% under clean conditions and 26–38% under realistic conditions. The benchmark attributes 79–90% of failures to agent behavior under the evaluation setup. Modality, latency, audio noise, and turn-taking are not external details. They are part of the eval.
Production traces do not replace offline benchmarks. They answer a different question. Offline benchmarks ask whether a controlled system can perform a task under specified conditions. Production traces ask what happens when the system meets users, latency, permissions, credentials, tool failures, ambiguous goals, and weird edge cases. Modern eval engineering needs both.
The minimum credible agent result is a harness contract
If you publish an agent score without the harness, you are not publishing enough information to interpret the result. The field does not need every team to release every internal trace. It does need stronger norms around what a score means.
At minimum, an agent result should publish the model, the harness, the environment, the evaluator, and the reporting rule. It should say whether failed, errored, timed-out, skipped, or manually-interrupted runs are in the denominator. It should disclose tool schemas and permission policies. It should version the judge or verifier. It should count retries and cost. It should say whether trajectories were branch-selected, filtered, edited, compressed, or rerun. It should separate final-answer success from process validity when the distinction matters.
For public benchmarks, this means Rollout Cards or something similar: a manifest of the rollout records, views, reporting rules, and dropped artifacts. For production systems, it means trace schemas, monitor limitations, action-gate policies, and incident-derived eval refresh. For training systems, it means process-reward provenance, verifier perturbations, and checks for reward hacking.
None of this is glamorous. It is mostly accounting, logging, schema design, and adversarial hygiene. But that is exactly what happens when a capability frontier becomes an infrastructure frontier. The hard part moves from the demo to the measurement system that keeps the demo meaningful after people optimize against it.
What this argument does not claim
Three things do not follow from this argument.
First, benchmarks are not obsolete. Good benchmarks changed the field because they compressed messy capabilities into comparable signals. SWE-bench made coding agents legible. WebArena made web interaction measurable. Terminal-Bench made command-line agents hard in a useful way. The problem is not benchmarking. The problem is pretending an agent benchmark is still just a dataset plus a grader.
Second, model capability still matters. Terminal-Bench shows large model gaps even when scaffolds are controlled, and efficient benchmarking work suggests rank signal can survive scaffold-driven distribution shift under some subset selections. The claim is not that the model disappeared. The claim is that the reported score is under-specified if it is presented as a model fact alone.
Third, production monitoring is not an oracle. It has privacy constraints, off-policy artifacts, rare-tail uncertainty, monitor false negatives, and the possibility that future training makes reasoning traces less faithful. The standard to copy is bounded claims, not theater.
The honest conclusion is narrower and more useful: agent evaluation is becoming an engineering discipline because the thing being evaluated is now an engineered system.
The substrate is the eval
We used to ask whether a model passed the benchmark. For agents, that question is incomplete.
The better question is: what model, through what harness, in what environment, under what reporting rule, scored by what evaluator, with which failures dropped, at what cost, and with what feedback path back into the system?
That is a worse sentence. It is also the real one.
Agent systems turned evaluation from a scoreboard into a substrate. The substrate records behavior, decides what counts, gates actions, feeds training, monitors deployment, and supplies the evidence later researchers need to reinterpret a result. When that substrate is underspecified, the score is a rumor with numbers attached. When it is well-designed, the score becomes one useful view over a preserved trajectory.
The practical consequence is boring and expensive: publish the harness, preserve the rollout, disclose the reporting rule, count the dropped runs, version the judge, perturb the verifier, harden the environment, and monitor production traces without pretending the monitor is omniscient.
That is eval engineering now. Not picking a benchmark. Designing the substrate that keeps a benchmark meaningful after the model learns to optimize against it.
Primary sources and useful comparables
This draft leans on recent primary sources from January–May 2026, with a few late-2025 or conference-2026 sources included where they directly explain the mechanism.
- Rollout Cards: A Reproducibility Standard for Agent Research, May 2026.
- Computer Use at the Edge of the Statistical Precipice, May 2026.
- OpenAI: Why SWE-bench Verified no longer measures frontier coding capabilities, February 2026.
- Scale SEAL SWE-Bench Pro public leaderboard and SWE-Bench Pro repository, 2026.
- MASEval: Extending Multi-Agent Evaluation from Models to Systems, March 2026.
- WildClawBench, May 2026.
- Terminal-Bench, January 2026.
- Meta-Harness, March 2026.
- Harbor: Automated Harness Optimization, April 2026.
- Shepherd, May 2026.
- Anthropic: Demystifying evals for AI agents, January 2026.
- LangChain: The Anatomy of an Agent Harness, March 2026.
- OpenAI: Harness engineering, February 2026.
- CursorBench and Cursor real-time RL, March 2026.
- Cognition SWE-1.6 preview and Model UX, 2026.
- ToolPRMBench, January 2026.
- WebArbiter, January 2026.
- Verifiable Process Reward Models, January 2026.
- LLMs Gaming Verifiers, April 2026.
- GRIFT, April 2026.
- Reward Hacking Benchmark, ICML 2026.
- Berkeley RDI: How We Broke Top AI Agent Benchmarks, April 2026.
- MCPTox, AAAI 2026 / arXiv 2025.
- OpenAI Production Evaluations, December 2025.
- OpenAI: internal coding-agent monitoring, March 2026.
- OpenAI Auto-review, April 2026.
- Anthropic Claude Code Auto Mode, March 2026.
- Anthropic Managed Agents, April 2026.
- GitHub Trust Layer, May 2026.
- GitHub Effective Tokens Specification, April 2026.
- Sierra τ³-Bench and τ-Voice, March 2026.
- Goodfire / UK AISI: verbalized eval-awareness, May 2026.
- Anthropic: Claude Opus 4.6 BrowseComp eval-awareness, March 2026.