test(qa): add compaction retry parity scenario

This commit is contained in:
Eva
2026-04-11 05:35:08 +07:00
committed by Peter Steinberger
parent 3211aa2540
commit fd45ea2bf1
9 changed files with 230 additions and 8 deletions

View File

@@ -105,6 +105,7 @@ PR D is the proof layer. It should not be the reason runtime-correctness PRs are
### PR D
- the scenario pack is understandable and reproducible
- the pack includes a mutating replay-safety lane, not only read-only flows
- reports are readable by humans and automation
- parity claims are evidence-backed, not anecdotal
@@ -142,6 +143,16 @@ The parity harness is not the only evidence source. Keep this split explicit in
- PR D owns the scenario-based GPT-5.4 vs Opus 4.6 comparison
- PR B deterministic suites still own auth/proxy/DNS and full-access truthfulness evidence
## Goal-to-evidence map
| Completion gate item | Primary owner | Review artifact |
| ---------------------------------------- | ------------- | ------------------------------------------------------------------- |
| No plan-only stalls | PR A | strict-agentic runtime tests and `approval-turn-tool-followthrough` |
| No fake progress or fake tool completion | PR A + PR D | parity fake-success count plus scenario-level report details |
| No false `/elevated full` guidance | PR B | deterministic runtime-truthfulness suites |
| Replay/liveness failures remain explicit | PR C + PR D | lifecycle/replay suites plus `compaction-retry-mutating-tool` |
| GPT-5.4 matches or beats Opus 4.6 | PR D | `qa-agentic-parity-report.md` and `qa-agentic-parity-summary.json` |
## Reviewer shorthand: before vs after
| User-visible problem before | Review signal after |

View File

@@ -129,7 +129,7 @@ flowchart LR
## Scenario pack
The first-wave parity pack currently covers four scenarios:
The first-wave parity pack currently covers five scenarios:
### `approval-turn-tool-followthrough`
@@ -147,14 +147,19 @@ Checks that the model can read source and docs, synthesize findings, and continu
Checks that mixed-mode tasks involving attachments remain actionable and do not collapse into vague narration.
### `compaction-retry-mutating-tool`
Checks that a task with a real mutating write keeps replay-unsafety explicit instead of quietly looking replay-safe if the run compacts, retries, or loses reply state under pressure.
## Scenario matrix
| Scenario | What it tests | Good GPT-5.4 behavior | Failure signal |
| ---------------------------------- | -------------------------------------- | ----------------------------------------------------------------------------- | ----------------------------------------------------------------------------- |
| `approval-turn-tool-followthrough` | Short approval turns after a plan | Starts the first concrete tool action immediately instead of restating intent | plan-only follow-up, no tool activity, or blocked turn without a real blocker |
| `model-switch-tool-continuity` | Runtime/model switching under tool use | Preserves task context and continues acting coherently | resets into commentary, loses tool context, or stops after switch |
| `source-docs-discovery-report` | Source reading + synthesis + action | Finds sources, uses tools, and produces a useful report without stalling | thin summary, missing tool work, or incomplete-turn stop |
| `image-understanding-attachment` | Attachment-driven agentic work | Interprets the attachment, connects it to tools, and continues the task | vague narration, attachment ignored, or no concrete next action |
| Scenario | What it tests | Good GPT-5.4 behavior | Failure signal |
| ---------------------------------- | --------------------------------------- | ------------------------------------------------------------------------------ | ------------------------------------------------------------------------------ |
| `approval-turn-tool-followthrough` | Short approval turns after a plan | Starts the first concrete tool action immediately instead of restating intent | plan-only follow-up, no tool activity, or blocked turn without a real blocker |
| `model-switch-tool-continuity` | Runtime/model switching under tool use | Preserves task context and continues acting coherently | resets into commentary, loses tool context, or stops after switch |
| `source-docs-discovery-report` | Source reading + synthesis + action | Finds sources, uses tools, and produces a useful report without stalling | thin summary, missing tool work, or incomplete-turn stop |
| `image-understanding-attachment` | Attachment-driven agentic work | Interprets the attachment, connects it to tools, and continues the task | vague narration, attachment ignored, or no concrete next action |
| `compaction-retry-mutating-tool` | Mutating work under compaction pressure | Performs a real write and keeps replay-unsafety explicit after the side effect | mutating write happens but replay safety is implied, missing, or contradictory |
## Release gate
@@ -180,6 +185,16 @@ Parity evidence is intentionally split across two layers:
- PR D proves same-scenario GPT-5.4 vs Opus 4.6 behavior with QA-lab
- PR B deterministic suites prove auth, proxy, DNS, and `/elevated full` truthfulness outside the harness
## Goal-to-evidence matrix
| Completion gate item | Owning PR | Evidence source | Pass signal |
| -------------------------------------------------------- | ----------- | ------------------------------------------------------------------ | ---------------------------------------------------------------------------------------- |
| GPT-5.4 no longer stalls after planning | PR A | `approval-turn-tool-followthrough` plus PR A runtime suites | approval turns trigger real work or an explicit blocked state |
| GPT-5.4 no longer fakes progress or fake tool completion | PR A + PR D | parity report scenario outcomes and fake-success count | no suspicious pass results and no commentary-only completion |
| GPT-5.4 no longer gives false `/elevated full` guidance | PR B | deterministic truthfulness suites | blocked reasons and full-access hints stay runtime-accurate |
| Replay/liveness failures stay explicit | PR C + PR D | PR C lifecycle/replay suites plus `compaction-retry-mutating-tool` | mutating work keeps replay-unsafety explicit instead of silently disappearing |
| GPT-5.4 matches or beats Opus 4.6 on the agreed metrics | PR D | `qa-agentic-parity-report.md` and `qa-agentic-parity-summary.json` | same scenario coverage and no regression on completion, stop behavior, or valid tool use |
## How to read the parity verdict
Use the verdict in `qa-agentic-parity-summary.json` as the final machine-readable decision for the first-wave parity pack.

View File

@@ -137,6 +137,7 @@ describe("qa agentic parity report", () => {
candidateSummary: {
scenarios: [
{ name: "Approval turn tool followthrough", status: "pass" },
{ name: "Compaction retry after mutating tool", status: "pass" },
{ name: "Model switch with tool continuity", status: "pass" },
{ name: "Source and docs discovery report", status: "pass" },
{ name: "Image understanding from attachment", status: "pass" },
@@ -145,6 +146,7 @@ describe("qa agentic parity report", () => {
baselineSummary: {
scenarios: [
{ name: "Approval turn tool followthrough", status: "pass" },
{ name: "Compaction retry after mutating tool", status: "pass" },
{ name: "Model switch with tool continuity", status: "pass" },
{ name: "Source and docs discovery report", status: "pass" },
{ name: "Image understanding from attachment", status: "pass" },

View File

@@ -17,6 +17,10 @@ export const QA_AGENTIC_PARITY_SCENARIOS = [
id: "image-understanding-attachment",
title: "Image understanding from attachment",
},
{
id: "compaction-retry-mutating-tool",
title: "Compaction retry after mutating tool",
},
] as const;
export const QA_AGENTIC_PARITY_SCENARIO_IDS = QA_AGENTIC_PARITY_SCENARIOS.map(({ id }) => id);

View File

@@ -334,6 +334,7 @@ describe("qa cli runtime", () => {
"model-switch-tool-continuity",
"source-docs-discovery-report",
"image-understanding-attachment",
"compaction-retry-mutating-tool",
],
}),
);

View File

@@ -169,6 +169,77 @@ describe("qa mock openai server", () => {
]);
});
it("drives the compaction retry mutating tool parity flow", async () => {
const server = await startQaMockOpenAiServer({
host: "127.0.0.1",
port: 0,
});
cleanups.push(async () => {
await server.stop();
});
const writePlan = await fetch(`${server.baseUrl}/v1/responses`, {
method: "POST",
headers: {
"content-type": "application/json",
},
body: JSON.stringify({
stream: true,
model: "gpt-5.4",
input: [
{
role: "user",
content: [
{
type: "input_text",
text: "Compaction retry mutating tool check: read COMPACTION_RETRY_CONTEXT.md, then create compaction-retry-summary.txt and keep replay safety explicit.",
},
],
},
{
type: "function_call_output",
output: "compaction retry evidence block 0000\ncompaction retry evidence block 0001",
},
],
}),
});
expect(writePlan.status).toBe(200);
const writePlanBody = await writePlan.text();
expect(writePlanBody).toContain('"name":"write"');
expect(writePlanBody).toContain("compaction-retry-summary.txt");
const finalReply = await fetch(`${server.baseUrl}/v1/responses`, {
method: "POST",
headers: {
"content-type": "application/json",
},
body: JSON.stringify({
stream: false,
model: "gpt-5.4",
input: [
{
role: "user",
content: [
{
type: "input_text",
text: "Compaction retry mutating tool check: read COMPACTION_RETRY_CONTEXT.md, then create compaction-retry-summary.txt and keep replay safety explicit.",
},
],
},
{
type: "function_call_output",
output: "Replay safety: unsafe after write.\n",
},
],
}),
});
expect(finalReply.status).toBe(200);
const finalPayload = (await finalReply.json()) as {
output?: Array<{ content?: Array<{ text?: string }> }>;
};
expect(finalPayload.output?.[0]?.content?.[0]?.text).toContain("replay unsafe after write");
});
it("supports exact reply memory prompts and embeddings requests", async () => {
const server = await startQaMockOpenAiServer({
host: "127.0.0.1",

View File

@@ -452,6 +452,12 @@ function buildAssistantText(input: ResponsesInputItem[], body: Record<string, un
}
return `Protocol note: Lobster Invaders built at lobster-invaders.html.`;
}
if (toolOutput && /compaction retry mutating tool check/i.test(prompt)) {
if (toolOutput.includes("Replay safety: unsafe after write.")) {
return "Protocol note: replay unsafe after write.";
}
return "";
}
if (toolOutput) {
const snippet = toolOutput.replace(/\s+/g, " ").trim().slice(0, 220);
return `Protocol note: I reviewed the requested material. Evidence snippet: ${snippet || "no content"}`;
@@ -541,6 +547,17 @@ async function buildResponsesPayload(body: Record<string, unknown>) {
});
}
}
if (/compaction retry mutating tool check/i.test(prompt)) {
if (!toolOutput) {
return buildToolCallEventsWithArgs("read", { path: "COMPACTION_RETRY_CONTEXT.md" });
}
if (toolOutput.includes("compaction retry evidence")) {
return buildToolCallEventsWithArgs("write", {
path: "compaction-retry-summary.txt",
content: "Replay safety: unsafe after write.\n",
});
}
}
if (/memory tools check/i.test(prompt)) {
if (!toolOutput) {
return buildToolCallEventsWithArgs("memory_search", {

View File

@@ -7,6 +7,7 @@ Use this when tuning the harness on frontier models before the small-model pass.
- verify tool-first behavior on short approval turns
- verify model switching does not kill tool use
- verify repo-reading / discovery still finishes with a concrete report
- verify mutating work keeps replay-unsafety explicit under compaction pressure
- collect manual notes on personality without letting style hide execution regressions
## Frontier subset
@@ -19,6 +20,7 @@ Run this subset first on every harness tweak:
Longer spot-check after that:
- `compaction-retry-mutating-tool`
- `subagent-handoff`
## Baseline order
@@ -84,6 +86,7 @@ Use the QA Lab runner catalog or `openclaw models list --all` to pick the curren
- empty-promise rate
- tool continuity after model switch
- discovery report completeness and specificity
- replay-safety truth after a mutating write
- scope drift: unrelated scenario updates, grand wrap-ups, or invented completion tallies
- latency / obvious stall behavior
- token cost notes if a change makes the prompt materially heavier
@@ -126,4 +129,4 @@ Score it on:
## Deferred
- post-compaction next-action continuity should become an executable lane once we have a deterministic compaction trigger in QA
- deterministic mock compaction triggering is still deferred; the current replay-safety lane is a live-frontier-first executable scenario

View File

@@ -0,0 +1,98 @@
# Compaction retry after mutating tool
```yaml qa-scenario
id: compaction-retry-mutating-tool
title: Compaction retry after mutating tool
surface: runtime
objective: Verify a real mutating tool step keeps replay-unsafety explicit instead of disappearing into a clean-looking success if the run compacts or retries.
successCriteria:
- Agent reads the seeded large context before it writes.
- Agent performs a real mutating write in the workspace.
- Final reply explicitly says replay is unsafe after the write instead of implying replay-safe completion.
- Scenario details preserve the observed compaction count for review context.
docsRefs:
- docs/help/testing.md
- docs/help/gpt54-codex-agentic-parity.md
codeRefs:
- extensions/qa-lab/src/suite.ts
- extensions/qa-lab/src/mock-openai-server.ts
- src/agents/pi-embedded-subscribe.ts
- src/agents/pi-embedded-subscribe.handlers.lifecycle.ts
execution:
kind: flow
summary: Verify a mutating tool step keeps replay-unsafety explicit through compaction or retry pressure.
config:
contextFile: COMPACTION_RETRY_CONTEXT.md
outputFile: compaction-retry-summary.txt
promptSnippet: Compaction retry mutating tool check
replayUnsafeNeedle: replay unsafe after write
prompt: |-
Compaction retry mutating tool check: read `COMPACTION_RETRY_CONTEXT.md`, then create `compaction-retry-summary.txt` in the workspace containing exactly this line:
Replay safety: unsafe after write.
After the write, reply with one short sentence that includes the exact phrase `replay unsafe after write`.
If the run gets compacted or retried, keep that replay-safety truth explicit and do not claim the task stayed replay-safe.
```
```yaml qa-flow
steps:
- name: keeps replay-unsafety explicit after a mutating write under compaction pressure
actions:
- call: waitForGatewayHealthy
args:
- ref: env
- 60000
- call: reset
- call: fs.writeFile
args:
- expr: "path.join(env.gateway.workspaceDir, config.contextFile)"
- expr: "Array.from({ length: env.providerMode === 'mock-openai' ? 160 : 2200 }, (_, index) => `compaction retry evidence block ${String(index).padStart(4, '0')} :: keep replay truth explicit after side effects`).join('\\n')"
- utf8
- set: sessionKey
value:
expr: "`agent:qa:compaction-retry:${randomUUID().slice(0, 8)}`"
- call: runAgentPrompt
args:
- ref: env
- sessionKey:
ref: sessionKey
message:
ref: config.prompt
timeoutMs:
expr: liveTurnTimeoutMs(env, 90000)
- call: waitForCondition
saveAs: outbound
args:
- lambda:
expr: "state.getSnapshot().messages.filter((candidate) => candidate.direction === 'outbound' && candidate.conversation.id === 'qa-operator' && normalizeLowercaseStringOrEmpty(candidate.text).includes(config.replayUnsafeNeedle)).at(-1)"
- expr: liveTurnTimeoutMs(env, 45000)
- expr: "env.providerMode === 'mock-openai' ? 100 : 250"
- call: fs.readFile
saveAs: writtenSummary
args:
- expr: "path.join(env.gateway.workspaceDir, config.outputFile)"
- utf8
- assert:
expr: "writtenSummary.includes('Replay safety: unsafe after write.')"
message:
expr: "`summary file missed replay marker: ${writtenSummary}`"
- if:
expr: "Boolean(env.mock)"
then:
- assert:
expr: "!env.mock || ([...(await fetchJson(`${env.mock.baseUrl}/debug/requests`))].toReversed().find((request) => String(request.allInputText ?? '').includes(config.promptSnippet) && String(request.toolOutput ?? '').includes('compaction retry evidence block'))?.plannedToolName === 'write')"
message:
expr: "`expected write after seeded context read, got ${String(([...(await fetchJson(`${env.mock.baseUrl}/debug/requests`))].toReversed().find((request) => String(request.allInputText ?? '').includes(config.promptSnippet) && String(request.toolOutput ?? '').includes('compaction retry evidence block'))?.plannedToolName ?? '')}`"
- call: readRawQaSessionStore
saveAs: store
args:
- ref: env
- set: sessionEntry
value:
expr: "store[sessionKey]"
- assert:
expr: "Boolean(sessionEntry)"
message:
expr: "`missing QA session entry for ${sessionKey}`"
detailsExpr: "`${outbound.text}\\ncompactionCount=${String(sessionEntry?.compactionCount ?? 0)}\\nstatus=${String(sessionEntry?.status ?? 'unknown')}`"
```