test(qa): add compaction retry parity scenario

This commit is contained in:
Eva
2026-04-11 05:35:08 +07:00
committed by Peter Steinberger
parent 3211aa2540
commit fd45ea2bf1
9 changed files with 230 additions and 8 deletions

View File

@@ -105,6 +105,7 @@ PR D is the proof layer. It should not be the reason runtime-correctness PRs are
### PR D
- the scenario pack is understandable and reproducible
- the pack includes a mutating replay-safety lane, not only read-only flows
- reports are readable by humans and automation
- parity claims are evidence-backed, not anecdotal
@@ -142,6 +143,16 @@ The parity harness is not the only evidence source. Keep this split explicit in
- PR D owns the scenario-based GPT-5.4 vs Opus 4.6 comparison
- PR B deterministic suites still own auth/proxy/DNS and full-access truthfulness evidence
## Goal-to-evidence map
| Completion gate item | Primary owner | Review artifact |
| ---------------------------------------- | ------------- | ------------------------------------------------------------------- |
| No plan-only stalls | PR A | strict-agentic runtime tests and `approval-turn-tool-followthrough` |
| No fake progress or fake tool completion | PR A + PR D | parity fake-success count plus scenario-level report details |
| No false `/elevated full` guidance | PR B | deterministic runtime-truthfulness suites |
| Replay/liveness failures remain explicit | PR C + PR D | lifecycle/replay suites plus `compaction-retry-mutating-tool` |
| GPT-5.4 matches or beats Opus 4.6 | PR D | `qa-agentic-parity-report.md` and `qa-agentic-parity-summary.json` |
## Reviewer shorthand: before vs after
| User-visible problem before | Review signal after |

View File

@@ -129,7 +129,7 @@ flowchart LR
## Scenario pack
The first-wave parity pack currently covers four scenarios:
The first-wave parity pack currently covers five scenarios:
### `approval-turn-tool-followthrough`
@@ -147,14 +147,19 @@ Checks that the model can read source and docs, synthesize findings, and continu
Checks that mixed-mode tasks involving attachments remain actionable and do not collapse into vague narration.
### `compaction-retry-mutating-tool`
Checks that a task with a real mutating write keeps replay-unsafety explicit instead of quietly looking replay-safe if the run compacts, retries, or loses reply state under pressure.
## Scenario matrix
| Scenario | What it tests | Good GPT-5.4 behavior | Failure signal |
| ---------------------------------- | -------------------------------------- | ----------------------------------------------------------------------------- | ----------------------------------------------------------------------------- |
| `approval-turn-tool-followthrough` | Short approval turns after a plan | Starts the first concrete tool action immediately instead of restating intent | plan-only follow-up, no tool activity, or blocked turn without a real blocker |
| `model-switch-tool-continuity` | Runtime/model switching under tool use | Preserves task context and continues acting coherently | resets into commentary, loses tool context, or stops after switch |
| `source-docs-discovery-report` | Source reading + synthesis + action | Finds sources, uses tools, and produces a useful report without stalling | thin summary, missing tool work, or incomplete-turn stop |
| `image-understanding-attachment` | Attachment-driven agentic work | Interprets the attachment, connects it to tools, and continues the task | vague narration, attachment ignored, or no concrete next action |
| Scenario | What it tests | Good GPT-5.4 behavior | Failure signal |
| ---------------------------------- | --------------------------------------- | ------------------------------------------------------------------------------ | ------------------------------------------------------------------------------ |
| `approval-turn-tool-followthrough` | Short approval turns after a plan | Starts the first concrete tool action immediately instead of restating intent | plan-only follow-up, no tool activity, or blocked turn without a real blocker |
| `model-switch-tool-continuity` | Runtime/model switching under tool use | Preserves task context and continues acting coherently | resets into commentary, loses tool context, or stops after switch |
| `source-docs-discovery-report` | Source reading + synthesis + action | Finds sources, uses tools, and produces a useful report without stalling | thin summary, missing tool work, or incomplete-turn stop |
| `image-understanding-attachment` | Attachment-driven agentic work | Interprets the attachment, connects it to tools, and continues the task | vague narration, attachment ignored, or no concrete next action |
| `compaction-retry-mutating-tool` | Mutating work under compaction pressure | Performs a real write and keeps replay-unsafety explicit after the side effect | mutating write happens but replay safety is implied, missing, or contradictory |
## Release gate
@@ -180,6 +185,16 @@ Parity evidence is intentionally split across two layers:
- PR D proves same-scenario GPT-5.4 vs Opus 4.6 behavior with QA-lab
- PR B deterministic suites prove auth, proxy, DNS, and `/elevated full` truthfulness outside the harness
## Goal-to-evidence matrix
| Completion gate item | Owning PR | Evidence source | Pass signal |
| -------------------------------------------------------- | ----------- | ------------------------------------------------------------------ | ---------------------------------------------------------------------------------------- |
| GPT-5.4 no longer stalls after planning | PR A | `approval-turn-tool-followthrough` plus PR A runtime suites | approval turns trigger real work or an explicit blocked state |
| GPT-5.4 no longer fakes progress or fake tool completion | PR A + PR D | parity report scenario outcomes and fake-success count | no suspicious pass results and no commentary-only completion |
| GPT-5.4 no longer gives false `/elevated full` guidance | PR B | deterministic truthfulness suites | blocked reasons and full-access hints stay runtime-accurate |
| Replay/liveness failures stay explicit | PR C + PR D | PR C lifecycle/replay suites plus `compaction-retry-mutating-tool` | mutating work keeps replay-unsafety explicit instead of silently disappearing |
| GPT-5.4 matches or beats Opus 4.6 on the agreed metrics | PR D | `qa-agentic-parity-report.md` and `qa-agentic-parity-summary.json` | same scenario coverage and no regression on completion, stop behavior, or valid tool use |
## How to read the parity verdict
Use the verdict in `qa-agentic-parity-summary.json` as the final machine-readable decision for the first-wave parity pack.