test: update QA parity fixtures for GPT-5.5

2026-05-06 21:00:44 +00:00 · 2026-04-25 18:05:13 +01:00
parent 39343088ed
commit 6b3e4b88d6
59 changed files with 407 additions and 399 deletions
--- a/docs/docs.json
+++ b/docs/docs.json
@@ -52,6 +52,14 @@
    ]
  },
  "redirects": [
+    {
+      "source": "/help/gpt54-codex-agentic-parity",
+      "destination": "/help/gpt55-codex-agentic-parity"
+    },
+    {
+      "source": "/help/gpt54-codex-agentic-parity-maintainers",
+      "destination": "/help/gpt55-codex-agentic-parity-maintainers"
+    },
    {
      "source": "/mcp",
      "destination": "/cli/mcp"
@@ -1649,8 +1657,8 @@
                  "concepts/typing-indicators",
                  "concepts/usage-tracking",
                  "concepts/timezone",
-                  "help/gpt54-codex-agentic-parity",
-                  "help/gpt54-codex-agentic-parity-maintainers"
+                  "help/gpt55-codex-agentic-parity",
+                  "help/gpt55-codex-agentic-parity-maintainers"
                ]
              },
              {
--- a/docs/help/gpt55-codex-agentic-parity-maintainers.md
+++ b/docs/help/gpt55-codex-agentic-parity-maintainers.md
@@ -1,12 +1,12 @@
 ---
-summary: "How to review the GPT-5.4 / Codex parity program as four merge units"
-title: "GPT-5.4 / Codex parity maintainer notes"
+summary: "How to review the GPT-5.5 / Codex parity program as four merge units"
+title: "GPT-5.5 / Codex parity maintainer notes"
 read_when:
-  - Reviewing the GPT-5.4 / Codex parity PR series
+  - Reviewing the GPT-5.5 / Codex parity PR series
  - Maintaining the six-contract agentic architecture behind the parity program
 ---

-This note explains how to review the GPT-5.4 / Codex parity program as four merge units without losing the original six-contract architecture.
+This note explains how to review the GPT-5.5 / Codex parity program as four merge units without losing the original six-contract architecture.

 ## Merge units

@@ -59,7 +59,7 @@ Does not own:

 Owns:

- first-wave GPT-5.4 vs Opus 4.6 scenario pack
+- first-wave GPT-5.5 vs Opus 4.6 scenario pack
 - parity documentation
 - parity report and release-gate mechanics

@@ -123,7 +123,7 @@ Expected artifacts from PR D:

 ## Release gate

-Do not claim GPT-5.4 parity or superiority over Opus 4.6 until:
+Do not claim GPT-5.5 parity or superiority over Opus 4.6 until:

 - PR A, PR B, and PR C are merged
 - PR D runs the first-wave parity pack cleanly
@@ -132,7 +132,7 @@ Do not claim GPT-5.4 parity or superiority over Opus 4.6 until:

 ```mermaid
 flowchart LR
-    A["PR A-C merged"] --> B["Run GPT-5.4 parity pack"]
+    A["PR A-C merged"] --> B["Run GPT-5.5 parity pack"]
    A --> C["Run Opus 4.6 parity pack"]
    B --> D["qa-suite-summary.json"]
    C --> E["qa-suite-summary.json"]
@@ -146,7 +146,7 @@ flowchart LR

 The parity harness is not the only evidence source. Keep this split explicit in review:

- PR D owns the scenario-based GPT-5.4 vs Opus 4.6 comparison
+- PR D owns the scenario-based GPT-5.5 vs Opus 4.6 comparison
 - PR B deterministic suites still own auth/proxy/DNS and full-access truthfulness evidence

 ## Quick maintainer merge workflow
@@ -179,13 +179,13 @@ If any one of the evidence bar items is missing, request changes instead of merg
 | No fake progress or fake tool completion | PR A + PR D   | parity fake-success count plus scenario-level report details        |
 | No false `/elevated full` guidance       | PR B          | deterministic runtime-truthfulness suites                           |
 | Replay/liveness failures remain explicit | PR C + PR D   | lifecycle/replay suites plus `compaction-retry-mutating-tool`       |
-| GPT-5.4 matches or beats Opus 4.6        | PR D          | `qa-agentic-parity-report.md` and `qa-agentic-parity-summary.json`  |
+| GPT-5.5 matches or beats Opus 4.6        | PR D          | `qa-agentic-parity-report.md` and `qa-agentic-parity-summary.json`  |

 ## Reviewer shorthand: before vs after

 | User-visible problem before                                 | Review signal after                                                                     |
 | ----------------------------------------------------------- | --------------------------------------------------------------------------------------- |
-| GPT-5.4 stopped after planning                              | PR A shows act-or-block behavior instead of commentary-only completion                  |
+| GPT-5.5 stopped after planning                              | PR A shows act-or-block behavior instead of commentary-only completion                  |
 | Tool use felt brittle with strict OpenAI/Codex schemas      | PR C keeps tool registration and parameter-free invocation predictable                  |
 | `/elevated full` hints were sometimes misleading            | PR B ties guidance to actual runtime capability and blocked reasons                     |
 | Long tasks could disappear into replay/compaction ambiguity | PR C emits explicit paused, blocked, abandoned, and replay-invalid state                |
@@ -193,4 +193,4 @@ If any one of the evidence bar items is missing, request changes instead of merg

 ## Related

- [GPT-5.4 / Codex agentic parity](/help/gpt54-codex-agentic-parity)
+- [GPT-5.5 / Codex agentic parity](/help/gpt55-codex-agentic-parity)
--- a/docs/help/gpt55-codex-agentic-parity.md
+++ b/docs/help/gpt55-codex-agentic-parity.md
@@ -1,15 +1,15 @@
 ---
-summary: "How OpenClaw closes agentic execution gaps for GPT-5.4 and Codex-style models"
-title: "GPT-5.4 / Codex agentic parity"
+summary: "How OpenClaw closes agentic execution gaps for GPT-5.5 and Codex-style models"
+title: "GPT-5.5 / Codex agentic parity"
 read_when:
-  - Debugging GPT-5.4 or Codex agent behavior
+  - Debugging GPT-5.5 or Codex agent behavior
  - Comparing OpenClaw agentic behavior across frontier models
  - Reviewing the strict-agentic, tool-schema, elevation, and replay fixes
 ---

-# GPT-5.4 / Codex Agentic Parity in OpenClaw
+# GPT-5.5 / Codex Agentic Parity in OpenClaw

-OpenClaw already worked well with tool-using frontier models, but GPT-5.4 and Codex-style models were still underperforming in a few practical ways:
+OpenClaw already worked well with tool-using frontier models, but GPT-5.5 and Codex-style models were still underperforming in a few practical ways:

 - they could stop after planning instead of doing the work
 - they could use strict OpenAI/Codex tool schemas incorrectly
@@ -27,7 +27,7 @@ This slice adds an opt-in `strict-agentic` execution contract for embedded Pi GP

 When enabled, OpenClaw stops accepting plan-only turns as “good enough” completion. If the model only says what it intends to do and does not actually use tools or make progress, OpenClaw retries with an act-now steer and then fails closed with an explicit blocked state instead of silently ending the task.

-This improves the GPT-5.4 experience most on:
+This improves the GPT-5.5 experience most on:

 - short “ok do it” follow-ups
 - code tasks where the first step is obvious
@@ -40,7 +40,7 @@ This slice makes OpenClaw tell the truth about two things:
 - why the provider/runtime call failed
 - whether `/elevated full` is actually available

-That means GPT-5.4 gets better runtime signals for missing scope, auth refresh failures, HTML 403 auth failures, proxy issues, DNS or timeout failures, and blocked full-access modes. The model is less likely to hallucinate the wrong remediation or keep asking for a permission mode the runtime cannot provide.
+That means GPT-5.5 gets better runtime signals for missing scope, auth refresh failures, HTML 403 auth failures, proxy issues, DNS or timeout failures, and blocked full-access modes. The model is less likely to hallucinate the wrong remediation or keep asking for a permission mode the runtime cannot provide.

 ### PR C: execution correctness

@@ -53,7 +53,7 @@ The tool-compat work reduces schema friction for strict OpenAI/Codex tool regist

 ### PR D: parity harness

-This slice adds the first-wave QA-lab parity pack so GPT-5.4 and Opus 4.6 can be exercised through the same scenarios and compared using shared evidence.
+This slice adds the first-wave QA-lab parity pack so GPT-5.5 and Opus 4.6 can be exercised through the same scenarios and compared using shared evidence.

 The parity pack is the proof layer. It does not change runtime behavior by itself.

@@ -62,7 +62,7 @@ After you have two `qa-suite-summary.json` artifacts, generate the release-gate
 ```bash
 pnpm openclaw qa parity-report \
  --repo-root . \
-  --candidate-summary .artifacts/qa-e2e/gpt54/qa-suite-summary.json \
+  --candidate-summary .artifacts/qa-e2e/gpt55/qa-suite-summary.json \
  --baseline-summary .artifacts/qa-e2e/opus46/qa-suite-summary.json \
  --output-dir .artifacts/qa-e2e/parity
 ```
@@ -73,16 +73,16 @@ That command writes:
 - a machine-readable JSON verdict
 - an explicit `pass` / `fail` gate result

-## Why this improves GPT-5.4 in practice
+## Why this improves GPT-5.5 in practice

-Before this work, GPT-5.4 on OpenClaw could feel less agentic than Opus in real coding sessions because the runtime tolerated behaviors that are especially harmful for GPT-5-style models:
+Before this work, GPT-5.5 on OpenClaw could feel less agentic than Opus in real coding sessions because the runtime tolerated behaviors that are especially harmful for GPT-5-style models:

 - commentary-only turns
 - schema friction around tools
 - vague permission feedback
 - silent replay or compaction breakage

-The goal is not to make GPT-5.4 imitate Opus. The goal is to give GPT-5.4 a runtime contract that rewards real progress, supplies cleaner tool and permission semantics, and turns failure modes into explicit machine- and human-readable states.
+The goal is not to make GPT-5.5 imitate Opus. The goal is to give GPT-5.5 a runtime contract that rewards real progress, supplies cleaner tool and permission semantics, and turns failure modes into explicit machine- and human-readable states.

 That changes the user experience from:

@@ -92,15 +92,15 @@ to:

 - “the model either acted, or OpenClaw surfaced the exact reason it could not”

-## Before vs after for GPT-5.4 users
+## Before vs after for GPT-5.5 users

 | Before this program                                                                            | After PR A-D                                                                             |
 | ---------------------------------------------------------------------------------------------- | ---------------------------------------------------------------------------------------- |
-| GPT-5.4 could stop after a reasonable plan without taking the next tool step                   | PR A turns “plan only” into “act now or surface a blocked state”                         |
+| GPT-5.5 could stop after a reasonable plan without taking the next tool step                   | PR A turns “plan only” into “act now or surface a blocked state”                         |
 | Strict tool schemas could reject parameter-free or OpenAI/Codex-shaped tools in confusing ways | PR C makes provider-owned tool registration and invocation more predictable              |
-| `/elevated full` guidance could be vague or wrong in blocked runtimes                          | PR B gives GPT-5.4 and the user truthful runtime and permission hints                    |
+| `/elevated full` guidance could be vague or wrong in blocked runtimes                          | PR B gives GPT-5.5 and the user truthful runtime and permission hints                    |
 | Replay or compaction failures could feel like the task silently disappeared                    | PR C surfaces paused, blocked, abandoned, and replay-invalid outcomes explicitly         |
-| “GPT-5.4 feels worse than Opus” was mostly anecdotal                                           | PR D turns that into the same scenario pack, the same metrics, and a hard pass/fail gate |
+| “GPT-5.5 feels worse than Opus” was mostly anecdotal                                           | PR D turns that into the same scenario pack, the same metrics, and a hard pass/fail gate |

 ## Architecture

@@ -123,7 +123,7 @@ flowchart TD

 ```mermaid
 flowchart LR
-    A["Merged runtime slices (PR A-C)"] --> B["Run GPT-5.4 parity pack"]
+    A["Merged runtime slices (PR A-C)"] --> B["Run GPT-5.5 parity pack"]
    A --> C["Run Opus 4.6 parity pack"]
    B --> D["qa-suite-summary.json"]
    C --> E["qa-suite-summary.json"]
@@ -162,7 +162,7 @@ Checks that a task with a real mutating write keeps replay-unsafety explicit ins

 ## Scenario matrix

-| Scenario                           | What it tests                           | Good GPT-5.4 behavior                                                          | Failure signal                                                                 |
+| Scenario                           | What it tests                           | Good GPT-5.5 behavior                                                          | Failure signal                                                                 |
 | ---------------------------------- | --------------------------------------- | ------------------------------------------------------------------------------ | ------------------------------------------------------------------------------ |
 | `approval-turn-tool-followthrough` | Short approval turns after a plan       | Starts the first concrete tool action immediately instead of restating intent  | plan-only follow-up, no tool activity, or blocked turn without a real blocker  |
 | `model-switch-tool-continuity`     | Runtime/model switching under tool use  | Preserves task context and continues acting coherently                         | resets into commentary, loses tool context, or stops after switch              |
@@ -172,7 +172,7 @@ Checks that a task with a real mutating write keeps replay-unsafety explicit ins

 ## Release gate

-GPT-5.4 can only be considered at parity or better when the merged runtime passes the parity pack and the runtime-truthfulness regressions at the same time.
+GPT-5.5 can only be considered at parity or better when the merged runtime passes the parity pack and the runtime-truthfulness regressions at the same time.

 Required outcomes:

@@ -191,24 +191,24 @@ For the first-wave harness, the gate compares:

 Parity evidence is intentionally split across two layers:

- PR D proves same-scenario GPT-5.4 vs Opus 4.6 behavior with QA-lab
+- PR D proves same-scenario GPT-5.5 vs Opus 4.6 behavior with QA-lab
 - PR B deterministic suites prove auth, proxy, DNS, and `/elevated full` truthfulness outside the harness

 ## Goal-to-evidence matrix

 | Completion gate item                                     | Owning PR   | Evidence source                                                    | Pass signal                                                                              |
 | -------------------------------------------------------- | ----------- | ------------------------------------------------------------------ | ---------------------------------------------------------------------------------------- |
-| GPT-5.4 no longer stalls after planning                  | PR A        | `approval-turn-tool-followthrough` plus PR A runtime suites        | approval turns trigger real work or an explicit blocked state                            |
-| GPT-5.4 no longer fakes progress or fake tool completion | PR A + PR D | parity report scenario outcomes and fake-success count             | no suspicious pass results and no commentary-only completion                             |
-| GPT-5.4 no longer gives false `/elevated full` guidance  | PR B        | deterministic truthfulness suites                                  | blocked reasons and full-access hints stay runtime-accurate                              |
+| GPT-5.5 no longer stalls after planning                  | PR A        | `approval-turn-tool-followthrough` plus PR A runtime suites        | approval turns trigger real work or an explicit blocked state                            |
+| GPT-5.5 no longer fakes progress or fake tool completion | PR A + PR D | parity report scenario outcomes and fake-success count             | no suspicious pass results and no commentary-only completion                             |
+| GPT-5.5 no longer gives false `/elevated full` guidance  | PR B        | deterministic truthfulness suites                                  | blocked reasons and full-access hints stay runtime-accurate                              |
 | Replay/liveness failures stay explicit                   | PR C + PR D | PR C lifecycle/replay suites plus `compaction-retry-mutating-tool` | mutating work keeps replay-unsafety explicit instead of silently disappearing            |
-| GPT-5.4 matches or beats Opus 4.6 on the agreed metrics  | PR D        | `qa-agentic-parity-report.md` and `qa-agentic-parity-summary.json` | same scenario coverage and no regression on completion, stop behavior, or valid tool use |
+| GPT-5.5 matches or beats Opus 4.6 on the agreed metrics  | PR D        | `qa-agentic-parity-report.md` and `qa-agentic-parity-summary.json` | same scenario coverage and no regression on completion, stop behavior, or valid tool use |

 ## How to read the parity verdict

 Use the verdict in `qa-agentic-parity-summary.json` as the final machine-readable decision for the first-wave parity pack.

- `pass` means GPT-5.4 covered the same scenarios as Opus 4.6 and did not regress on the agreed aggregate metrics.
+- `pass` means GPT-5.5 covered the same scenarios as Opus 4.6 and did not regress on the agreed aggregate metrics.
 - `fail` means at least one hard gate tripped: weaker completion, worse unintended stops, weaker valid tool use, any fake-success case, or mismatched scenario coverage.
 - “shared/base CI issue” is not itself a parity result. If CI noise outside PR D blocks a run, the verdict should wait for a clean merged-runtime execution instead of being inferred from branch-era logs.
 - Auth, proxy, DNS, and `/elevated full` truthfulness still come from PR B’s deterministic suites, so the final release claim needs both: a passing PR D parity verdict and green PR B truthfulness coverage.
@@ -218,7 +218,7 @@ Use the verdict in `qa-agentic-parity-summary.json` as the final machine-readabl
 Use `strict-agentic` when:

 - the agent is expected to act immediately when a next step is obvious
- GPT-5.4 or Codex-family models are the primary runtime
+- GPT-5.5 or Codex-family models are the primary runtime
 - you prefer explicit blocked states over “helpful” recap-only replies

 Keep the default contract when:
@@ -229,4 +229,4 @@ Keep the default contract when:

 ## Related

- [GPT-5.4 / Codex parity maintainer notes](/help/gpt54-codex-agentic-parity-maintainers)
+- [GPT-5.5 / Codex parity maintainer notes](/help/gpt55-codex-agentic-parity-maintainers)