test(qa): add compaction retry parity scenario

2026-06-11 03:22:53 +00:00 · 2026-04-11 05:35:08 +07:00
parent 3211aa2540
commit fd45ea2bf1
9 changed files with 230 additions and 8 deletions
--- a/docs/help/gpt54-codex-agentic-parity-maintainers.md
+++ b/docs/help/gpt54-codex-agentic-parity-maintainers.md
@@ -105,6 +105,7 @@ PR D is the proof layer. It should not be the reason runtime-correctness PRs are
 ### PR D

 - the scenario pack is understandable and reproducible
+- the pack includes a mutating replay-safety lane, not only read-only flows
 - reports are readable by humans and automation
 - parity claims are evidence-backed, not anecdotal

@@ -142,6 +143,16 @@ The parity harness is not the only evidence source. Keep this split explicit in
 - PR D owns the scenario-based GPT-5.4 vs Opus 4.6 comparison
 - PR B deterministic suites still own auth/proxy/DNS and full-access truthfulness evidence

+## Goal-to-evidence map
+
+| Completion gate item                     | Primary owner | Review artifact                                                     |
+| ---------------------------------------- | ------------- | ------------------------------------------------------------------- |
+| No plan-only stalls                      | PR A          | strict-agentic runtime tests and `approval-turn-tool-followthrough` |
+| No fake progress or fake tool completion | PR A + PR D   | parity fake-success count plus scenario-level report details        |
+| No false `/elevated full` guidance       | PR B          | deterministic runtime-truthfulness suites                           |
+| Replay/liveness failures remain explicit | PR C + PR D   | lifecycle/replay suites plus `compaction-retry-mutating-tool`       |
+| GPT-5.4 matches or beats Opus 4.6        | PR D          | `qa-agentic-parity-report.md` and `qa-agentic-parity-summary.json`  |
+
 ## Reviewer shorthand: before vs after

 | User-visible problem before                                 | Review signal after                                                                     |
--- a/docs/help/gpt54-codex-agentic-parity.md
+++ b/docs/help/gpt54-codex-agentic-parity.md
@@ -129,7 +129,7 @@ flowchart LR

 ## Scenario pack

-The first-wave parity pack currently covers four scenarios:
+The first-wave parity pack currently covers five scenarios:

 ### `approval-turn-tool-followthrough`

@@ -147,14 +147,19 @@ Checks that the model can read source and docs, synthesize findings, and continu

 Checks that mixed-mode tasks involving attachments remain actionable and do not collapse into vague narration.

+### `compaction-retry-mutating-tool`
+
+Checks that a task with a real mutating write keeps replay-unsafety explicit instead of quietly looking replay-safe if the run compacts, retries, or loses reply state under pressure.
+
 ## Scenario matrix

-| Scenario                           | What it tests                          | Good GPT-5.4 behavior                                                         | Failure signal                                                                |
-| ---------------------------------- | -------------------------------------- | ----------------------------------------------------------------------------- | ----------------------------------------------------------------------------- |
-| `approval-turn-tool-followthrough` | Short approval turns after a plan      | Starts the first concrete tool action immediately instead of restating intent | plan-only follow-up, no tool activity, or blocked turn without a real blocker |
-| `model-switch-tool-continuity`     | Runtime/model switching under tool use | Preserves task context and continues acting coherently                        | resets into commentary, loses tool context, or stops after switch             |
-| `source-docs-discovery-report`     | Source reading + synthesis + action    | Finds sources, uses tools, and produces a useful report without stalling      | thin summary, missing tool work, or incomplete-turn stop                      |
-| `image-understanding-attachment`   | Attachment-driven agentic work         | Interprets the attachment, connects it to tools, and continues the task       | vague narration, attachment ignored, or no concrete next action               |
+| Scenario                           | What it tests                           | Good GPT-5.4 behavior                                                          | Failure signal                                                                 |
+| ---------------------------------- | --------------------------------------- | ------------------------------------------------------------------------------ | ------------------------------------------------------------------------------ |
+| `approval-turn-tool-followthrough` | Short approval turns after a plan       | Starts the first concrete tool action immediately instead of restating intent  | plan-only follow-up, no tool activity, or blocked turn without a real blocker  |
+| `model-switch-tool-continuity`     | Runtime/model switching under tool use  | Preserves task context and continues acting coherently                         | resets into commentary, loses tool context, or stops after switch              |
+| `source-docs-discovery-report`     | Source reading + synthesis + action     | Finds sources, uses tools, and produces a useful report without stalling       | thin summary, missing tool work, or incomplete-turn stop                       |
+| `image-understanding-attachment`   | Attachment-driven agentic work          | Interprets the attachment, connects it to tools, and continues the task        | vague narration, attachment ignored, or no concrete next action                |
+| `compaction-retry-mutating-tool`   | Mutating work under compaction pressure | Performs a real write and keeps replay-unsafety explicit after the side effect | mutating write happens but replay safety is implied, missing, or contradictory |

 ## Release gate

@@ -180,6 +185,16 @@ Parity evidence is intentionally split across two layers:
 - PR D proves same-scenario GPT-5.4 vs Opus 4.6 behavior with QA-lab
 - PR B deterministic suites prove auth, proxy, DNS, and `/elevated full` truthfulness outside the harness

+## Goal-to-evidence matrix
+
+| Completion gate item                                     | Owning PR   | Evidence source                                                    | Pass signal                                                                              |
+| -------------------------------------------------------- | ----------- | ------------------------------------------------------------------ | ---------------------------------------------------------------------------------------- |
+| GPT-5.4 no longer stalls after planning                  | PR A        | `approval-turn-tool-followthrough` plus PR A runtime suites        | approval turns trigger real work or an explicit blocked state                            |
+| GPT-5.4 no longer fakes progress or fake tool completion | PR A + PR D | parity report scenario outcomes and fake-success count             | no suspicious pass results and no commentary-only completion                             |
+| GPT-5.4 no longer gives false `/elevated full` guidance  | PR B        | deterministic truthfulness suites                                  | blocked reasons and full-access hints stay runtime-accurate                              |
+| Replay/liveness failures stay explicit                   | PR C + PR D | PR C lifecycle/replay suites plus `compaction-retry-mutating-tool` | mutating work keeps replay-unsafety explicit instead of silently disappearing            |
+| GPT-5.4 matches or beats Opus 4.6 on the agreed metrics  | PR D        | `qa-agentic-parity-report.md` and `qa-agentic-parity-summary.json` | same scenario coverage and no regression on completion, stop behavior, or valid tool use |
+
 ## How to read the parity verdict

 Use the verdict in `qa-agentic-parity-summary.json` as the final machine-readable decision for the first-wave parity pack.
--- a/extensions/qa-lab/src/agentic-parity-report.test.ts
+++ b/extensions/qa-lab/src/agentic-parity-report.test.ts
@@ -137,6 +137,7 @@ describe("qa agentic parity report", () => {
      candidateSummary: {
        scenarios: [
          { name: "Approval turn tool followthrough", status: "pass" },
+          { name: "Compaction retry after mutating tool", status: "pass" },
          { name: "Model switch with tool continuity", status: "pass" },
          { name: "Source and docs discovery report", status: "pass" },
          { name: "Image understanding from attachment", status: "pass" },
@@ -145,6 +146,7 @@ describe("qa agentic parity report", () => {
      baselineSummary: {
        scenarios: [
          { name: "Approval turn tool followthrough", status: "pass" },
+          { name: "Compaction retry after mutating tool", status: "pass" },
          { name: "Model switch with tool continuity", status: "pass" },
          { name: "Source and docs discovery report", status: "pass" },
          { name: "Image understanding from attachment", status: "pass" },
--- a/extensions/qa-lab/src/agentic-parity.ts
+++ b/extensions/qa-lab/src/agentic-parity.ts
@@ -17,6 +17,10 @@ export const QA_AGENTIC_PARITY_SCENARIOS = [
    id: "image-understanding-attachment",
    title: "Image understanding from attachment",
  },
+  {
+    id: "compaction-retry-mutating-tool",
+    title: "Compaction retry after mutating tool",
+  },
 ] as const;

 export const QA_AGENTIC_PARITY_SCENARIO_IDS = QA_AGENTIC_PARITY_SCENARIOS.map(({ id }) => id);
--- a/extensions/qa-lab/src/cli.runtime.test.ts
+++ b/extensions/qa-lab/src/cli.runtime.test.ts
@@ -334,6 +334,7 @@ describe("qa cli runtime", () => {
          "model-switch-tool-continuity",
          "source-docs-discovery-report",
          "image-understanding-attachment",
+          "compaction-retry-mutating-tool",
        ],
      }),
    );
--- a/extensions/qa-lab/src/mock-openai-server.test.ts
+++ b/extensions/qa-lab/src/mock-openai-server.test.ts
@@ -169,6 +169,77 @@ describe("qa mock openai server", () => {
    ]);
  });

+  it("drives the compaction retry mutating tool parity flow", async () => {
+    const server = await startQaMockOpenAiServer({
+      host: "127.0.0.1",
+      port: 0,
+    });
+    cleanups.push(async () => {
+      await server.stop();
+    });
+
+    const writePlan = await fetch(`${server.baseUrl}/v1/responses`, {
+      method: "POST",
+      headers: {
+        "content-type": "application/json",
+      },
+      body: JSON.stringify({
+        stream: true,
+        model: "gpt-5.4",
+        input: [
+          {
+            role: "user",
+            content: [
+              {
+                type: "input_text",
+                text: "Compaction retry mutating tool check: read COMPACTION_RETRY_CONTEXT.md, then create compaction-retry-summary.txt and keep replay safety explicit.",
+              },
+            ],
+          },
+          {
+            type: "function_call_output",
+            output: "compaction retry evidence block 0000\ncompaction retry evidence block 0001",
+          },
+        ],
+      }),
+    });
+    expect(writePlan.status).toBe(200);
+    const writePlanBody = await writePlan.text();
+    expect(writePlanBody).toContain('"name":"write"');
+    expect(writePlanBody).toContain("compaction-retry-summary.txt");
+
+    const finalReply = await fetch(`${server.baseUrl}/v1/responses`, {
+      method: "POST",
+      headers: {
+        "content-type": "application/json",
+      },
+      body: JSON.stringify({
+        stream: false,
+        model: "gpt-5.4",
+        input: [
+          {
+            role: "user",
+            content: [
+              {
+                type: "input_text",
+                text: "Compaction retry mutating tool check: read COMPACTION_RETRY_CONTEXT.md, then create compaction-retry-summary.txt and keep replay safety explicit.",
+              },
+            ],
+          },
+          {
+            type: "function_call_output",
+            output: "Replay safety: unsafe after write.\n",
+          },
+        ],
+      }),
+    });
+    expect(finalReply.status).toBe(200);
+    const finalPayload = (await finalReply.json()) as {
+      output?: Array<{ content?: Array<{ text?: string }> }>;
+    };
+    expect(finalPayload.output?.[0]?.content?.[0]?.text).toContain("replay unsafe after write");
+  });
+
  it("supports exact reply memory prompts and embeddings requests", async () => {
    const server = await startQaMockOpenAiServer({
      host: "127.0.0.1",
--- a/extensions/qa-lab/src/mock-openai-server.ts
+++ b/extensions/qa-lab/src/mock-openai-server.ts
@@ -452,6 +452,12 @@ function buildAssistantText(input: ResponsesInputItem[], body: Record<string, un
    }
    return `Protocol note: Lobster Invaders built at lobster-invaders.html.`;
  }
+  if (toolOutput && /compaction retry mutating tool check/i.test(prompt)) {
+    if (toolOutput.includes("Replay safety: unsafe after write.")) {
+      return "Protocol note: replay unsafe after write.";
+    }
+    return "";
+  }
  if (toolOutput) {
    const snippet = toolOutput.replace(/\s+/g, " ").trim().slice(0, 220);
    return `Protocol note: I reviewed the requested material. Evidence snippet: ${snippet || "no content"}`;
@@ -541,6 +547,17 @@ async function buildResponsesPayload(body: Record<string, unknown>) {
      });
    }
  }
+  if (/compaction retry mutating tool check/i.test(prompt)) {
+    if (!toolOutput) {
+      return buildToolCallEventsWithArgs("read", { path: "COMPACTION_RETRY_CONTEXT.md" });
+    }
+    if (toolOutput.includes("compaction retry evidence")) {
+      return buildToolCallEventsWithArgs("write", {
+        path: "compaction-retry-summary.txt",
+        content: "Replay safety: unsafe after write.\n",
+      });
+    }
+  }
  if (/memory tools check/i.test(prompt)) {
    if (!toolOutput) {
      return buildToolCallEventsWithArgs("memory_search", {
--- a/qa/frontier-harness-plan.md
+++ b/qa/frontier-harness-plan.md
@@ -7,6 +7,7 @@ Use this when tuning the harness on frontier models before the small-model pass.
 - verify tool-first behavior on short approval turns
 - verify model switching does not kill tool use
 - verify repo-reading / discovery still finishes with a concrete report
+- verify mutating work keeps replay-unsafety explicit under compaction pressure
 - collect manual notes on personality without letting style hide execution regressions

 ## Frontier subset
@@ -19,6 +20,7 @@ Run this subset first on every harness tweak:

 Longer spot-check after that:

+- `compaction-retry-mutating-tool`
 - `subagent-handoff`

 ## Baseline order
@@ -84,6 +86,7 @@ Use the QA Lab runner catalog or `openclaw models list --all` to pick the curren
 - empty-promise rate
 - tool continuity after model switch
 - discovery report completeness and specificity
+- replay-safety truth after a mutating write
 - scope drift: unrelated scenario updates, grand wrap-ups, or invented completion tallies
 - latency / obvious stall behavior
 - token cost notes if a change makes the prompt materially heavier
@@ -126,4 +129,4 @@ Score it on:

 ## Deferred

- post-compaction next-action continuity should become an executable lane once we have a deterministic compaction trigger in QA
+- deterministic mock compaction triggering is still deferred; the current replay-safety lane is a live-frontier-first executable scenario
--- a/qa/scenarios/compaction-retry-mutating-tool.md
+++ b/qa/scenarios/compaction-retry-mutating-tool.md
@@ -0,0 +1,98 @@
+# Compaction retry after mutating tool
+
+```yaml qa-scenario
+id: compaction-retry-mutating-tool
+title: Compaction retry after mutating tool
+surface: runtime
+objective: Verify a real mutating tool step keeps replay-unsafety explicit instead of disappearing into a clean-looking success if the run compacts or retries.
+successCriteria:
+  - Agent reads the seeded large context before it writes.
+  - Agent performs a real mutating write in the workspace.
+  - Final reply explicitly says replay is unsafe after the write instead of implying replay-safe completion.
+  - Scenario details preserve the observed compaction count for review context.
+docsRefs:
+  - docs/help/testing.md
+  - docs/help/gpt54-codex-agentic-parity.md
+codeRefs:
+  - extensions/qa-lab/src/suite.ts
+  - extensions/qa-lab/src/mock-openai-server.ts
+  - src/agents/pi-embedded-subscribe.ts
+  - src/agents/pi-embedded-subscribe.handlers.lifecycle.ts
+execution:
+  kind: flow
+  summary: Verify a mutating tool step keeps replay-unsafety explicit through compaction or retry pressure.
+  config:
+    contextFile: COMPACTION_RETRY_CONTEXT.md
+    outputFile: compaction-retry-summary.txt
+    promptSnippet: Compaction retry mutating tool check
+    replayUnsafeNeedle: replay unsafe after write
+    prompt: |-
+      Compaction retry mutating tool check: read `COMPACTION_RETRY_CONTEXT.md`, then create `compaction-retry-summary.txt` in the workspace containing exactly this line:
+      Replay safety: unsafe after write.
+
+      After the write, reply with one short sentence that includes the exact phrase `replay unsafe after write`.
+      If the run gets compacted or retried, keep that replay-safety truth explicit and do not claim the task stayed replay-safe.
+```
+
+```yaml qa-flow
+steps:
+  - name: keeps replay-unsafety explicit after a mutating write under compaction pressure
+    actions:
+      - call: waitForGatewayHealthy
+        args:
+          - ref: env
+          - 60000
+      - call: reset
+      - call: fs.writeFile
+        args:
+          - expr: "path.join(env.gateway.workspaceDir, config.contextFile)"
+          - expr: "Array.from({ length: env.providerMode === 'mock-openai' ? 160 : 2200 }, (_, index) => `compaction retry evidence block ${String(index).padStart(4, '0')} :: keep replay truth explicit after side effects`).join('\\n')"
+          - utf8
+      - set: sessionKey
+        value:
+          expr: "`agent:qa:compaction-retry:${randomUUID().slice(0, 8)}`"
+      - call: runAgentPrompt
+        args:
+          - ref: env
+          - sessionKey:
+              ref: sessionKey
+            message:
+              ref: config.prompt
+            timeoutMs:
+              expr: liveTurnTimeoutMs(env, 90000)
+      - call: waitForCondition
+        saveAs: outbound
+        args:
+          - lambda:
+              expr: "state.getSnapshot().messages.filter((candidate) => candidate.direction === 'outbound' && candidate.conversation.id === 'qa-operator' && normalizeLowercaseStringOrEmpty(candidate.text).includes(config.replayUnsafeNeedle)).at(-1)"
+          - expr: liveTurnTimeoutMs(env, 45000)
+          - expr: "env.providerMode === 'mock-openai' ? 100 : 250"
+      - call: fs.readFile
+        saveAs: writtenSummary
+        args:
+          - expr: "path.join(env.gateway.workspaceDir, config.outputFile)"
+          - utf8
+      - assert:
+          expr: "writtenSummary.includes('Replay safety: unsafe after write.')"
+          message:
+            expr: "`summary file missed replay marker: ${writtenSummary}`"
+      - if:
+          expr: "Boolean(env.mock)"
+          then:
+            - assert:
+                expr: "!env.mock || ([...(await fetchJson(`${env.mock.baseUrl}/debug/requests`))].toReversed().find((request) => String(request.allInputText ?? '').includes(config.promptSnippet) && String(request.toolOutput ?? '').includes('compaction retry evidence block'))?.plannedToolName === 'write')"
+                message:
+                  expr: "`expected write after seeded context read, got ${String(([...(await fetchJson(`${env.mock.baseUrl}/debug/requests`))].toReversed().find((request) => String(request.allInputText ?? '').includes(config.promptSnippet) && String(request.toolOutput ?? '').includes('compaction retry evidence block'))?.plannedToolName ?? '')}`"
+      - call: readRawQaSessionStore
+        saveAs: store
+        args:
+          - ref: env
+      - set: sessionEntry
+        value:
+          expr: "store[sessionKey]"
+      - assert:
+          expr: "Boolean(sessionEntry)"
+          message:
+            expr: "`missing QA session entry for ${sessionKey}`"
+    detailsExpr: "`${outbound.text}\\ncompactionCount=${String(sessionEntry?.compactionCount ?? 0)}\\nstatus=${String(sessionEntry?.status ?? 'unknown')}`"
+```