docs: clarify parity verdict interpretation

2026-04-12 01:31:08 +00:00 · 2026-04-11 03:30:59 +07:00
parent db09edacfc
commit c73d005c7a
2 changed files with 29 additions and 0 deletions
--- a/docs/help/gpt54-codex-agentic-parity-maintainers.md
+++ b/docs/help/gpt54-codex-agentic-parity-maintainers.md
@@ -141,3 +141,13 @@ The parity harness is not the only evidence source. Keep this split explicit in

 - PR D owns the scenario-based GPT-5.4 vs Opus 4.6 comparison
 - PR B deterministic suites still own auth/proxy/DNS and full-access truthfulness evidence
+
+## Reviewer shorthand: before vs after
+
+| User-visible problem before                                 | Review signal after                                                                     |
+| ----------------------------------------------------------- | --------------------------------------------------------------------------------------- |
+| GPT-5.4 stopped after planning                              | PR A shows act-or-block behavior instead of commentary-only completion                  |
+| Tool use felt brittle with strict OpenAI/Codex schemas      | PR C keeps tool registration and parameter-free invocation predictable                  |
+| `/elevated full` hints were sometimes misleading            | PR B ties guidance to actual runtime capability and blocked reasons                     |
+| Long tasks could disappear into replay/compaction ambiguity | PR C emits explicit paused, blocked, abandoned, and replay-invalid state                |
+| Parity claims were anecdotal                                | PR D produces a report plus JSON verdict with the same scenario coverage on both models |
--- a/docs/help/gpt54-codex-agentic-parity.md
+++ b/docs/help/gpt54-codex-agentic-parity.md
@@ -83,6 +83,16 @@ to:

 - “the model either acted, or OpenClaw surfaced the exact reason it could not”

+## Before vs after for GPT-5.4 users
+
+| Before this program                                                                            | After PR A-D                                                                             |
+| ---------------------------------------------------------------------------------------------- | ---------------------------------------------------------------------------------------- |
+| GPT-5.4 could stop after a reasonable plan without taking the next tool step                   | PR A turns “plan only” into “act now or surface a blocked state”                         |
+| Strict tool schemas could reject parameter-free or OpenAI/Codex-shaped tools in confusing ways | PR C makes provider-owned tool registration and invocation more predictable              |
+| `/elevated full` guidance could be vague or wrong in blocked runtimes                          | PR B gives GPT-5.4 and the user truthful runtime and permission hints                    |
+| Replay or compaction failures could feel like the task silently disappeared                    | PR C surfaces paused, blocked, abandoned, and replay-invalid outcomes explicitly         |
+| “GPT-5.4 feels worse than Opus” was mostly anecdotal                                           | PR D turns that into the same scenario pack, the same metrics, and a hard pass/fail gate |
+
 ## Architecture

 ```mermaid
@@ -170,6 +180,15 @@ Parity evidence is intentionally split across two layers:
 - PR D proves same-scenario GPT-5.4 vs Opus 4.6 behavior with QA-lab
 - PR B deterministic suites prove auth, proxy, DNS, and `/elevated full` truthfulness outside the harness

+## How to read the parity verdict
+
+Use the verdict in `qa-agentic-parity-summary.json` as the final machine-readable decision for the first-wave parity pack.
+
+- `pass` means GPT-5.4 covered the same scenarios as Opus 4.6 and did not regress on the agreed aggregate metrics.
+- `fail` means at least one hard gate tripped: weaker completion, worse unintended stops, weaker valid tool use, any fake-success case, or mismatched scenario coverage.
+- “shared/base CI issue” is not itself a parity result. If CI noise outside PR D blocks a run, the verdict should wait for a clean merged-runtime execution instead of being inferred from branch-era logs.
+- Auth, proxy, DNS, and `/elevated full` truthfulness still come from PR B’s deterministic suites, so the final release claim needs both: a passing PR D parity verdict and green PR B truthfulness coverage.
+
 ## Who should enable `strict-agentic`

 Use `strict-agentic` when: