test: update QA parity fixtures for GPT-5.5

This commit is contained in:
Peter Steinberger
2026-04-25 18:05:13 +01:00
parent 39343088ed
commit 6b3e4b88d6
59 changed files with 407 additions and 399 deletions

View File

@@ -52,6 +52,14 @@
]
},
"redirects": [
{
"source": "/help/gpt54-codex-agentic-parity",
"destination": "/help/gpt55-codex-agentic-parity"
},
{
"source": "/help/gpt54-codex-agentic-parity-maintainers",
"destination": "/help/gpt55-codex-agentic-parity-maintainers"
},
{
"source": "/mcp",
"destination": "/cli/mcp"
@@ -1649,8 +1657,8 @@
"concepts/typing-indicators",
"concepts/usage-tracking",
"concepts/timezone",
"help/gpt54-codex-agentic-parity",
"help/gpt54-codex-agentic-parity-maintainers"
"help/gpt55-codex-agentic-parity",
"help/gpt55-codex-agentic-parity-maintainers"
]
},
{

View File

@@ -1,12 +1,12 @@
---
summary: "How to review the GPT-5.4 / Codex parity program as four merge units"
title: "GPT-5.4 / Codex parity maintainer notes"
summary: "How to review the GPT-5.5 / Codex parity program as four merge units"
title: "GPT-5.5 / Codex parity maintainer notes"
read_when:
- Reviewing the GPT-5.4 / Codex parity PR series
- Reviewing the GPT-5.5 / Codex parity PR series
- Maintaining the six-contract agentic architecture behind the parity program
---
This note explains how to review the GPT-5.4 / Codex parity program as four merge units without losing the original six-contract architecture.
This note explains how to review the GPT-5.5 / Codex parity program as four merge units without losing the original six-contract architecture.
## Merge units
@@ -59,7 +59,7 @@ Does not own:
Owns:
- first-wave GPT-5.4 vs Opus 4.6 scenario pack
- first-wave GPT-5.5 vs Opus 4.6 scenario pack
- parity documentation
- parity report and release-gate mechanics
@@ -123,7 +123,7 @@ Expected artifacts from PR D:
## Release gate
Do not claim GPT-5.4 parity or superiority over Opus 4.6 until:
Do not claim GPT-5.5 parity or superiority over Opus 4.6 until:
- PR A, PR B, and PR C are merged
- PR D runs the first-wave parity pack cleanly
@@ -132,7 +132,7 @@ Do not claim GPT-5.4 parity or superiority over Opus 4.6 until:
```mermaid
flowchart LR
A["PR A-C merged"] --> B["Run GPT-5.4 parity pack"]
A["PR A-C merged"] --> B["Run GPT-5.5 parity pack"]
A --> C["Run Opus 4.6 parity pack"]
B --> D["qa-suite-summary.json"]
C --> E["qa-suite-summary.json"]
@@ -146,7 +146,7 @@ flowchart LR
The parity harness is not the only evidence source. Keep this split explicit in review:
- PR D owns the scenario-based GPT-5.4 vs Opus 4.6 comparison
- PR D owns the scenario-based GPT-5.5 vs Opus 4.6 comparison
- PR B deterministic suites still own auth/proxy/DNS and full-access truthfulness evidence
## Quick maintainer merge workflow
@@ -179,13 +179,13 @@ If any one of the evidence bar items is missing, request changes instead of merg
| No fake progress or fake tool completion | PR A + PR D | parity fake-success count plus scenario-level report details |
| No false `/elevated full` guidance | PR B | deterministic runtime-truthfulness suites |
| Replay/liveness failures remain explicit | PR C + PR D | lifecycle/replay suites plus `compaction-retry-mutating-tool` |
| GPT-5.4 matches or beats Opus 4.6 | PR D | `qa-agentic-parity-report.md` and `qa-agentic-parity-summary.json` |
| GPT-5.5 matches or beats Opus 4.6 | PR D | `qa-agentic-parity-report.md` and `qa-agentic-parity-summary.json` |
## Reviewer shorthand: before vs after
| User-visible problem before | Review signal after |
| ----------------------------------------------------------- | --------------------------------------------------------------------------------------- |
| GPT-5.4 stopped after planning | PR A shows act-or-block behavior instead of commentary-only completion |
| GPT-5.5 stopped after planning | PR A shows act-or-block behavior instead of commentary-only completion |
| Tool use felt brittle with strict OpenAI/Codex schemas | PR C keeps tool registration and parameter-free invocation predictable |
| `/elevated full` hints were sometimes misleading | PR B ties guidance to actual runtime capability and blocked reasons |
| Long tasks could disappear into replay/compaction ambiguity | PR C emits explicit paused, blocked, abandoned, and replay-invalid state |
@@ -193,4 +193,4 @@ If any one of the evidence bar items is missing, request changes instead of merg
## Related
- [GPT-5.4 / Codex agentic parity](/help/gpt54-codex-agentic-parity)
- [GPT-5.5 / Codex agentic parity](/help/gpt55-codex-agentic-parity)

View File

@@ -1,15 +1,15 @@
---
summary: "How OpenClaw closes agentic execution gaps for GPT-5.4 and Codex-style models"
title: "GPT-5.4 / Codex agentic parity"
summary: "How OpenClaw closes agentic execution gaps for GPT-5.5 and Codex-style models"
title: "GPT-5.5 / Codex agentic parity"
read_when:
- Debugging GPT-5.4 or Codex agent behavior
- Debugging GPT-5.5 or Codex agent behavior
- Comparing OpenClaw agentic behavior across frontier models
- Reviewing the strict-agentic, tool-schema, elevation, and replay fixes
---
# GPT-5.4 / Codex Agentic Parity in OpenClaw
# GPT-5.5 / Codex Agentic Parity in OpenClaw
OpenClaw already worked well with tool-using frontier models, but GPT-5.4 and Codex-style models were still underperforming in a few practical ways:
OpenClaw already worked well with tool-using frontier models, but GPT-5.5 and Codex-style models were still underperforming in a few practical ways:
- they could stop after planning instead of doing the work
- they could use strict OpenAI/Codex tool schemas incorrectly
@@ -27,7 +27,7 @@ This slice adds an opt-in `strict-agentic` execution contract for embedded Pi GP
When enabled, OpenClaw stops accepting plan-only turns as “good enough” completion. If the model only says what it intends to do and does not actually use tools or make progress, OpenClaw retries with an act-now steer and then fails closed with an explicit blocked state instead of silently ending the task.
This improves the GPT-5.4 experience most on:
This improves the GPT-5.5 experience most on:
- short “ok do it” follow-ups
- code tasks where the first step is obvious
@@ -40,7 +40,7 @@ This slice makes OpenClaw tell the truth about two things:
- why the provider/runtime call failed
- whether `/elevated full` is actually available
That means GPT-5.4 gets better runtime signals for missing scope, auth refresh failures, HTML 403 auth failures, proxy issues, DNS or timeout failures, and blocked full-access modes. The model is less likely to hallucinate the wrong remediation or keep asking for a permission mode the runtime cannot provide.
That means GPT-5.5 gets better runtime signals for missing scope, auth refresh failures, HTML 403 auth failures, proxy issues, DNS or timeout failures, and blocked full-access modes. The model is less likely to hallucinate the wrong remediation or keep asking for a permission mode the runtime cannot provide.
### PR C: execution correctness
@@ -53,7 +53,7 @@ The tool-compat work reduces schema friction for strict OpenAI/Codex tool regist
### PR D: parity harness
This slice adds the first-wave QA-lab parity pack so GPT-5.4 and Opus 4.6 can be exercised through the same scenarios and compared using shared evidence.
This slice adds the first-wave QA-lab parity pack so GPT-5.5 and Opus 4.6 can be exercised through the same scenarios and compared using shared evidence.
The parity pack is the proof layer. It does not change runtime behavior by itself.
@@ -62,7 +62,7 @@ After you have two `qa-suite-summary.json` artifacts, generate the release-gate
```bash
pnpm openclaw qa parity-report \
--repo-root . \
--candidate-summary .artifacts/qa-e2e/gpt54/qa-suite-summary.json \
--candidate-summary .artifacts/qa-e2e/gpt55/qa-suite-summary.json \
--baseline-summary .artifacts/qa-e2e/opus46/qa-suite-summary.json \
--output-dir .artifacts/qa-e2e/parity
```
@@ -73,16 +73,16 @@ That command writes:
- a machine-readable JSON verdict
- an explicit `pass` / `fail` gate result
## Why this improves GPT-5.4 in practice
## Why this improves GPT-5.5 in practice
Before this work, GPT-5.4 on OpenClaw could feel less agentic than Opus in real coding sessions because the runtime tolerated behaviors that are especially harmful for GPT-5-style models:
Before this work, GPT-5.5 on OpenClaw could feel less agentic than Opus in real coding sessions because the runtime tolerated behaviors that are especially harmful for GPT-5-style models:
- commentary-only turns
- schema friction around tools
- vague permission feedback
- silent replay or compaction breakage
The goal is not to make GPT-5.4 imitate Opus. The goal is to give GPT-5.4 a runtime contract that rewards real progress, supplies cleaner tool and permission semantics, and turns failure modes into explicit machine- and human-readable states.
The goal is not to make GPT-5.5 imitate Opus. The goal is to give GPT-5.5 a runtime contract that rewards real progress, supplies cleaner tool and permission semantics, and turns failure modes into explicit machine- and human-readable states.
That changes the user experience from:
@@ -92,15 +92,15 @@ to:
- “the model either acted, or OpenClaw surfaced the exact reason it could not”
## Before vs after for GPT-5.4 users
## Before vs after for GPT-5.5 users
| Before this program | After PR A-D |
| ---------------------------------------------------------------------------------------------- | ---------------------------------------------------------------------------------------- |
| GPT-5.4 could stop after a reasonable plan without taking the next tool step | PR A turns “plan only” into “act now or surface a blocked state” |
| GPT-5.5 could stop after a reasonable plan without taking the next tool step | PR A turns “plan only” into “act now or surface a blocked state” |
| Strict tool schemas could reject parameter-free or OpenAI/Codex-shaped tools in confusing ways | PR C makes provider-owned tool registration and invocation more predictable |
| `/elevated full` guidance could be vague or wrong in blocked runtimes | PR B gives GPT-5.4 and the user truthful runtime and permission hints |
| `/elevated full` guidance could be vague or wrong in blocked runtimes | PR B gives GPT-5.5 and the user truthful runtime and permission hints |
| Replay or compaction failures could feel like the task silently disappeared | PR C surfaces paused, blocked, abandoned, and replay-invalid outcomes explicitly |
| “GPT-5.4 feels worse than Opus” was mostly anecdotal | PR D turns that into the same scenario pack, the same metrics, and a hard pass/fail gate |
| “GPT-5.5 feels worse than Opus” was mostly anecdotal | PR D turns that into the same scenario pack, the same metrics, and a hard pass/fail gate |
## Architecture
@@ -123,7 +123,7 @@ flowchart TD
```mermaid
flowchart LR
A["Merged runtime slices (PR A-C)"] --> B["Run GPT-5.4 parity pack"]
A["Merged runtime slices (PR A-C)"] --> B["Run GPT-5.5 parity pack"]
A --> C["Run Opus 4.6 parity pack"]
B --> D["qa-suite-summary.json"]
C --> E["qa-suite-summary.json"]
@@ -162,7 +162,7 @@ Checks that a task with a real mutating write keeps replay-unsafety explicit ins
## Scenario matrix
| Scenario | What it tests | Good GPT-5.4 behavior | Failure signal |
| Scenario | What it tests | Good GPT-5.5 behavior | Failure signal |
| ---------------------------------- | --------------------------------------- | ------------------------------------------------------------------------------ | ------------------------------------------------------------------------------ |
| `approval-turn-tool-followthrough` | Short approval turns after a plan | Starts the first concrete tool action immediately instead of restating intent | plan-only follow-up, no tool activity, or blocked turn without a real blocker |
| `model-switch-tool-continuity` | Runtime/model switching under tool use | Preserves task context and continues acting coherently | resets into commentary, loses tool context, or stops after switch |
@@ -172,7 +172,7 @@ Checks that a task with a real mutating write keeps replay-unsafety explicit ins
## Release gate
GPT-5.4 can only be considered at parity or better when the merged runtime passes the parity pack and the runtime-truthfulness regressions at the same time.
GPT-5.5 can only be considered at parity or better when the merged runtime passes the parity pack and the runtime-truthfulness regressions at the same time.
Required outcomes:
@@ -191,24 +191,24 @@ For the first-wave harness, the gate compares:
Parity evidence is intentionally split across two layers:
- PR D proves same-scenario GPT-5.4 vs Opus 4.6 behavior with QA-lab
- PR D proves same-scenario GPT-5.5 vs Opus 4.6 behavior with QA-lab
- PR B deterministic suites prove auth, proxy, DNS, and `/elevated full` truthfulness outside the harness
## Goal-to-evidence matrix
| Completion gate item | Owning PR | Evidence source | Pass signal |
| -------------------------------------------------------- | ----------- | ------------------------------------------------------------------ | ---------------------------------------------------------------------------------------- |
| GPT-5.4 no longer stalls after planning | PR A | `approval-turn-tool-followthrough` plus PR A runtime suites | approval turns trigger real work or an explicit blocked state |
| GPT-5.4 no longer fakes progress or fake tool completion | PR A + PR D | parity report scenario outcomes and fake-success count | no suspicious pass results and no commentary-only completion |
| GPT-5.4 no longer gives false `/elevated full` guidance | PR B | deterministic truthfulness suites | blocked reasons and full-access hints stay runtime-accurate |
| GPT-5.5 no longer stalls after planning | PR A | `approval-turn-tool-followthrough` plus PR A runtime suites | approval turns trigger real work or an explicit blocked state |
| GPT-5.5 no longer fakes progress or fake tool completion | PR A + PR D | parity report scenario outcomes and fake-success count | no suspicious pass results and no commentary-only completion |
| GPT-5.5 no longer gives false `/elevated full` guidance | PR B | deterministic truthfulness suites | blocked reasons and full-access hints stay runtime-accurate |
| Replay/liveness failures stay explicit | PR C + PR D | PR C lifecycle/replay suites plus `compaction-retry-mutating-tool` | mutating work keeps replay-unsafety explicit instead of silently disappearing |
| GPT-5.4 matches or beats Opus 4.6 on the agreed metrics | PR D | `qa-agentic-parity-report.md` and `qa-agentic-parity-summary.json` | same scenario coverage and no regression on completion, stop behavior, or valid tool use |
| GPT-5.5 matches or beats Opus 4.6 on the agreed metrics | PR D | `qa-agentic-parity-report.md` and `qa-agentic-parity-summary.json` | same scenario coverage and no regression on completion, stop behavior, or valid tool use |
## How to read the parity verdict
Use the verdict in `qa-agentic-parity-summary.json` as the final machine-readable decision for the first-wave parity pack.
- `pass` means GPT-5.4 covered the same scenarios as Opus 4.6 and did not regress on the agreed aggregate metrics.
- `pass` means GPT-5.5 covered the same scenarios as Opus 4.6 and did not regress on the agreed aggregate metrics.
- `fail` means at least one hard gate tripped: weaker completion, worse unintended stops, weaker valid tool use, any fake-success case, or mismatched scenario coverage.
- “shared/base CI issue” is not itself a parity result. If CI noise outside PR D blocks a run, the verdict should wait for a clean merged-runtime execution instead of being inferred from branch-era logs.
- Auth, proxy, DNS, and `/elevated full` truthfulness still come from PR Bs deterministic suites, so the final release claim needs both: a passing PR D parity verdict and green PR B truthfulness coverage.
@@ -218,7 +218,7 @@ Use the verdict in `qa-agentic-parity-summary.json` as the final machine-readabl
Use `strict-agentic` when:
- the agent is expected to act immediately when a next step is obvious
- GPT-5.4 or Codex-family models are the primary runtime
- GPT-5.5 or Codex-family models are the primary runtime
- you prefer explicit blocked states over “helpful” recap-only replies
Keep the default contract when:
@@ -229,4 +229,4 @@ Keep the default contract when:
## Related
- [GPT-5.4 / Codex parity maintainer notes](/help/gpt54-codex-agentic-parity-maintainers)
- [GPT-5.5 / Codex parity maintainer notes](/help/gpt55-codex-agentic-parity-maintainers)