OpenAI · · 17 min read

A shared playbook for trustworthy third party evaluations

Mirrored from OpenAI for archival readability. Support the source by reading on the original site.

May 29, 2026

Safety

A shared playbook for trustworthy third party evaluations

What matters for effective independent evaluations of safeguards and capabilities for frontier models.

Loading…

Independent, trusted third party evaluations play a critical role in strengthening the safety ecosystem. These evaluations are conducted on frontier models to provide additional evidence for claims about critical capabilities and safety mitigations. In this post, we share lessons we’ve learned so far, and recommend approaches for designing evaluations that can validly assess frontier models that we hope help inform emerging standards in the space.

Earlier, many evaluations treated models like chatbots: the evaluation prompted a model as though it were a user asking a question, the model answered, and an evaluator judged the output. Today’s frontier models can do much more: they can use tools, keep track of information across many steps, and act within a larger workflow. This means that performance depends not only on the model, but also on the environment in which the task takes place, and on the setup that facilitates its actions. This surrounding setup, which we call the “harness,” can change key aspects of the system’s performance, including how it uses tools, keeps track of information, or recovers from mistakes.

Diagram comparing a prompt-response workflow with an agentic task workflow, showing how control loops, tools, context, budget, and safeguards enable autonomous task execution.

This changes how evaluations need to be conducted, and what readers should look for in evaluation reports. In our view, the most useful reports explicitly describe two things beyond the result itself: First, they specify what claim the evaluation setup was designed to test, and second, they share the available evidence that the evaluation result is valid.

Claims tested in evaluations typically fall into one of three buckets¹:

  • Capability elicitation: Can a model plausibly produce the capability being evaluated? 
  • Safeguard performance: How robust are the tested safeguards against the behavior or attack being evaluated?
  • Comparison: How do different models perform under equivalent conditions?

Evaluation reports also need to explain how evaluators checked for effects that could impact the validity of a result. These include:

  • Reward hacking: Exploiting shortcuts in the task or scorer, so the system gets credit without demonstrating the behavior the evaluation is meant to measure.
  • Refusals: Refusing in ways that obscure the behavior being tested.
  • Contamination: Overperforming because evaluation tasks, answers, or close variants appeared in training data or were discoverable during the evaluation, such as through browsing.
  • Broken problems: Underperforming because tasks are invalid. Reasons can include unfair scoring (e.g., correct answer requires unstated implementation details) and unsolvable environments (e.g., missing critical files or unreliable tools).
  • Sandbagging: Deliberately underperforming when they show awareness of being evaluated.

Selecting the right harness for an evaluation is crucial for optimal results

We’ve observed that the role of the harness is especially important for systems that act over longer trajectories. When models can use tools, maintain state, and recover from mistakes across many steps, the harness can change the observed level of performance, and even determine whether the capability that’s being assessed appears in the evaluation at all. For example, a harness that preserves state and retries failed actions may let a model finish a multi-step task that the same model never completes in a simpler harness.

In the table below, we separate three kinds of claims evaluators may want to make and the harness we believe each kind of claim requires.

Claim the evaluation is trying to support

Appropriate harness choice

Evidence to report

Capability under strong elicitation: System A can complete tasks of type X when the setup is designed to draw out its strongest credible performance.

Use the strongest credible elicitation setup for the system, including the harness, tools, scaffolding, and budget a capable user would reasonably use.

The harness and tool setup, elicitation guidance, budget/effort allowed, tokens/cost/time, and why the setup is a credible proxy for the claimed capability. If comparing systems under different optimized setups, label it as a system-to-system or strong-elicitation comparison.

Controlled comparison: System A outperforms System B under a shared evaluation setup.

Keep the tasks, scoring, and budget fixed. Use either a shared harness/tool setup or a fixed set of standardized harnesses chosen up front to provide reasonable max elicitation for the systems being compared.

The shared task set, tools, scoring method, harness, budget, token efficiency/cost, and known limitations. For coding-agent evaluations, an open-source harness such as Codex CLI can provide a fixed agent loop and tool interface across systems. The ideal approach for maximum elicitation would be to optimize a bespoke harness for each task and system, but doing so is currently impractical in practice.

Safeguard robustness under elicited attack: System A’s safeguards are sufficient for the relevant model behavior or elicited attack.

Use a safeguard-testing setup designed to elicit the strongest credible attack under the relevant adversary model.

How evaluators characterized the relevant model behavior, the safeguard configuration tested, the elicitation strategy, the harness used to carry it out, and the budget or effort allowed.

Capability claims are only as strong as the elicitation behind them: evaluators need to choose the harness that best fits the task and the capability the evaluation is trying to measure. A standardized harness may be right for comparing systems under identical conditions, but it can understate capability when it leaves out specific harness features that help the model perform the task. For example, GPT‑5.5’s performance on OpenAI’s cyber ranges shows how a harness choice can materially change measured capability on tasks that require long, multi-step tool use: the model performs better when the harness uses compaction to preserve task-relevant context as the interaction gets longer. This demonstrates that for certain models, a harness that omits compaction would under-elicit performance.

Higher success rates are better

Other published evaluations² also show harness and budget choices changing evaluation results. Increasing test-time compute can significantly change what capability an evaluation elicits, especially in domains where success is easy to verify, such as many cyber tasks. In UK AISI’s cyber range evaluation(opens in a new window), increasing the budget from 10M to 100M tokens improved performance by up to 59%, and performance was still increasing at the highest budget tested. Detailing this makes the evaluation more interpretable: it shows readers how the result depends on the tested elicitation setup. When performance is still improving with additional budget, the score should be described as performance under that harness and budget, not as a measured capability ceiling. Capability is often resource-dependent rather than a fixed quantity that can be cleanly measured once and for all. Where success can be measured across repeated attempts, reports should also consider expected cost per successful solve, not just success rate at a fixed token budget. This can make severity easier to interpret: a low success rate may still be practically meaningful if the cost of repeated attempts is within the relevant threat model. For capability claims, avoidable under-elicitation is a measurement failure: if the harness or budget prevents the system from exhibiting behavior it could otherwise produce, the score does not measure the capability being claimed. Where evaluators have pushed elicitation as far as is feasible and performance is still improving, reports should say so clearly and make clear that the result is only a lower-bound estimate.

Safeguard testing can understate whether an attack can succeed, and how severe it could be, when not accounting for the resources available to attackers, including custom harnesses. In UK AISI's GPT‑5.5 cyber evaluation(opens in a new window), their expert red teaming found a universal jailbreak that elicited violative cyber content across the malicious queries OpenAI provided, including in multi-turn agentic settings. They used Codex to create a custom harness to strengthen the model’s attack performance: it embedded a reusable safeguard-bypass pattern into the interaction, preserved that pattern across turns and blocks, and applied it across the malicious cyber queries OpenAI provided. Safeguard testing should match the adversary. If the claim is about robustness to expert misuse, the test should evaluate the strongest credible end-to-end attack strategy under a defined budget, including any harness needed to preserve and reuse that strategy. Otherwise, the results risk miscalibration: they could support only a narrower claim about resistance to simpler prompting, could miss both how severe the attack becomes and its probability of success once the elicitation method is operationalized, and could also overstate how likely or severe a problem is if given too much budget.

There is a time and place for standardized harness comparisons, but evaluators should be explicit about why using a consistent set of harnesses is appropriate and what claim it can support. METR's time-horizon evaluation(opens in a new window) is an example of a broader, appropriately fixed evaluation setup: it is designed to produce comparable results across the systems it evaluates. METR defines a common outcome, the typical duration for a human task at which an AI agent is predicted to succeed at a given reliability level. It applies a shared task suite, scoring method, fitting method, and a small set of reusable scaffolds such as Triframe and ReAct(opens in a new window) within each batch of estimates reported together. When METR expanded the task suite and moved evaluation infrastructure from a framework called Vivaria to one called Inspect, it reported the change (Time Horizon 1.1 update(opens in a new window)) and re-evaluated models under the new evaluation setup. That is the value of a standardized evaluation setup, including a consistent harness set: it can make readers confident that a difference in scores really reflects a difference between the systems being compared, rather than a change in the measurement setup.

We recommend that third party evaluation reports state what kind of claim their evaluation setup is meant to support; describe how closely what was tested reflects that broader claim; describe the harness choices that shaped the result; detail when those choices change between evaluations; and include supporting evidence to show how the result was produced and how well it generalizes to the claim.

Assess validity by checking for known hazards that can distort results

As models become more capable, evaluation scores become easier to misinterpret. Relative to real capabilities, evaluation scores can be artificially reduced if a model recognizes it is being evaluated and strategically underperforms. They can be inflated if the model exploits a shortcut in the task, prompt, scorer, or harness. They can also be distorted by contamination (where a model already knows or can find an answer without solving the task) or by “broken” problems that are ambiguous, incorrectly scored, unsolvable, or vulnerable to unintended shortcuts. Evaluation reports should therefore pair headline scores with a discussion of these hazards, so readers can assess whether the scores reflect the intended behavior.

Harnesses, budgets, tools, scoring rules, monitors, and review procedures all affect whether an agent is solving the intended task, avoiding it, memorizing it, or finding a path around it. A trustworthy report makes those checks visible: evaluators should review samples for these behaviors every time an assessment is run.

Reward hacking

Reward hacking means achieving high evaluation scores in ways that do not reflect the intended capability. Here, the concern is that the system gets credit by exploiting the task, scorer, prompt, or harness rather than by doing the work the evaluation was meant to measure. METR’s evaluation of GPT 5.4(opens in a new window) shows why this matters: despite the model succeeding on tasks at a rate that would have registered as a roughly 13-hour time horizon at first pass, human review showed that some of those successes came from reward hacking, and revising the results to account for only those instances without reward hacking lowered the estimate to about 6 hours. Evaluators should assess the need for such adjustments and, when they are needed, report them clearly: a capability estimate is much more useful when readers can see which apparent successes were disqualified, why they were disqualified, and how much the result depends on that judgment.

Chart showing AI model performance over time with a trend line and confidence intervals.

Refusals

Models can also underperform capability evaluations due to safeguards. A model may have lower evaluation performance than it is capable of due to refusing evaluation tasks instead of completing them. Reports should therefore explain if refusals were part of the evaluation results and if yes, then how many samples were compromised because of them.

Contamination

Contamination matters most for public or reused benchmarks. If tasks, answers, or close variants appear in training data or can be found by an agent with browsing, performance may reflect memorization or retrieval rather than model performance. Evaluators should prefer private or newly constructed tasks where possible and test whether models reproduce task-specific details before solving.

Broken problems

Broken problems include situations such as a problem that accidentally cannot be solved. Broken problems are common enough(opens in a new window) that they should be treated as a standard validity risk. They include incorrect ground truth, ambiguous prompts, missing files, flaky services, unfair scoring criteria, and environments that expose shortcuts such as hidden answer files or repository history containing the ground-truth patch.

Sandbagging

Sandbagging by a model or system means strategically underperforming when strong performance would conflict with some objective. Apollo’s GPT‑5.5 evaluation(opens in a new window) tested this by creating a context where the model understood that high evaluation scores might trigger unlearning or block deployment. GPT‑5.5 still answered correctly, suggesting little behavioral evidence of strategic underperformance³. But their review of reasoning traces added important context: Apollo found moderate-or-higher signs of alignment-evaluation awareness in 52% of sandbagging-condition samples, compared with 0% in counterfactuals. That evidence complicates interpretation of the result, but ultimately gives readers more signal: readers can see both the reassuring behavioral result and the reason to be cautious about overinterpreting it.

How we are supporting stronger evaluations

Strong claims require both the right harness to elicit the behavior and validity checks to show the result is sound. Our view that harnesses and validity checks are part of the evaluation result shapes how we support third party evaluations in practice: 

  • We are sharing specific maximum-elicitation guidance with evaluators.
  • We are asking capability evaluators to use Codex as a common floor for OpenAI models: tests should at least run a baseline through the same agentic interface users are likely to rely on, rather than only through a stripped-down model interface.
  • We are also making reasoning traces and other intermediate artifacts available where they are needed to assess deception, sandbagging, or evaluation awareness. METR and Apollo have used this access in OpenAI evaluations since GPT‑5. 
  • Finally, we are prioritizing research to more deeply understand when and how harness choices materially change results, from context management and tool access to retry behavior, scoring, and resource budgets.

What this means for evaluation standards and future research directions 

These recommendations are intended not only to improve individual evaluation reports, but also to inform emerging national (opens in a new window)and international (opens in a new window)standards for frontier AI evaluation and reporting. Going forward, third party evaluation standards should require enough detail for decision makers to understand what claims the specific evaluations support, what system was tested, how the result was elicited, and how evaluators checked its validity. For frontier systems being tested on tasks where agentic capabilities matter, details should include (subject to any security or confidentiality concerns):

  • The claim: whether the evaluation compares systems, estimates a capability ceiling, or tests safeguards.
  • Evaluation content: enough detail about the tasks or task distribution for readers to understand what skills, behaviors, or failure modes the evaluation is actually testing.
  • The tested system: the model, reasoning setting, tool access, harness, and safeguards.
  • The budget: turns, tokens, attempts/retries, wall-clock time, inference cost, and where applicable expected cost per successful solve.
  • Elicitation methods: harness choices used to draw out the result, and how closely what was tested reflects the broader claim being made.
  • Validity checks: how assessors looked for reward hacking, evaluation awareness, contamination, refusals, sandbagging and other behaviors that could undermine the result, including how confirmed cases affected scoring or interpretation.

Standards that leave out harness choices or validity checks can understate what a system can do or overstate confidence in a safety claim. Building strong harnesses and elicitation methods remains an open research area and should be a focus for further investigation and investment.

Author

OpenAI

Glossary

Because we use a number of terms of art in this post, we have included a glossary below that provides a plain-language explanation of what we are referring to:

  • Agentic system: A system that can work through a task over multiple steps, using tools, maintaining task state, and acting in an environment, rather than only returning a single response to a prompt.

  • Assessment: A broader judgment about whether evidence supports a claim, risk conclusion, or assurance position which may be based on evaluation data, document review, interview, process review, and other relevant artifacts.

  • Compaction: Method for preserving task-relevant context during long runs.

  • Configuration: Exact tested system and evaluation conditions, beyond the model name.

  • Contamination: When evaluation tasks, answers, or close variants appear in a model’s training data or are discoverable during the evaluation (e.g., via tools like browsing), making performance overstate the model’s true generalization.

  • Elicitation: Process of trying to draw out a capability or behavior from a system during an assessment.

  • Environment: Task setting in which a system is tested. This includes things like the external state the agent interacts with and modifies during an evaluation, such as a terminal environment or a video game.

  • Evaluation: Particular test or measurement within an assessment.

  • Evaluation awareness: Evaluation awareness refers to a model recognizing, or appearing to recognize, that it is being evaluated and potentially adjusting its behavior in response to that context. This can look like the model explicitly reasoning about being tested, inferring the purpose of the evaluation, or changing its behavior because it expects the result to affect how it is judged or deployed.

  • Harness: Model-facing structure that lets a model carry out a task: prompts, tools, interfaces, control logic, memory, retries, validators, and other supporting structures around the model.

  • Maximum elicitation: Testing aimed at finding the strongest credible performance or failure mode a system can produce under a defined budget, rather than simply running the system once through a standardized harness.

  • Reasoning traces: Records of the model's intermediate reasoning during a test.

  • Reward hacking: Achieving a high score through a shortcut or behavior outside the evaluator's intent.

  • Safeguards: Filters, monitors, blocking systems, and other protections applied around a model or product.

  • Sandbagging: Strategic underperformance in an evaluation in a way that undermines the result.

  • Scoring: Method used to decide how performance is measured or whether a task succeeded.

  • Standardized harness: Harness kept the same across systems rather than customized to a particular model or task, so differences in results are easier to attribute to the tested model.

  • Time horizon: Length of task a system can complete with a specified reliability, often expressed as how long the same task would take a human.

  • Tool access: External tools available to the model during the assessment.

  • Trajectories: Step-by-step paths a system follows while working through a task.

  • Universal jailbreak: Single attack pattern that causes a system to bypass safeguards across many prompts or tasks.

Footnotes

¹ This post does not try to resolve how third parties should evaluate misalignment or propensity-related claims. More work is needed to understand how harness choices affect those evaluations, especially because harnesses that support longer-horizon behavior may also change the kinds of unintended behavior evaluators need to monitor.

² HAL(opens in a new window) provides an additional example from agent evaluations: on CORE-Bench(opens in a new window), changing the harness to Claude Code changed both measured performance and the cost profile on the same benchmark. Other public examples include SWE-agent(opens in a new window), SeeAct(opens in a new window), ScienceAgentBench(opens in a new window), CORE-Bench(opens in a new window), and Inspect Cyber(opens in a new window); each tells readers what interface or harness produced the score.

³  Scoring 100% on both Sandbagging QA variants and 99.6% on the strategic-deception sandbagging task

Keep reading

View all

Discussion (0)

Sign in to join the discussion. Free account, 30 seconds — email code or GitHub.

Sign in →

No comments yet. Sign in and be the first to say something.

More from OpenAI