mirror of
https://github.com/openclaw/openclaw.git
synced 2026-04-12 01:31:08 +00:00
test(qa): add compaction retry parity scenario
This commit is contained in:
@@ -105,6 +105,7 @@ PR D is the proof layer. It should not be the reason runtime-correctness PRs are
|
||||
### PR D
|
||||
|
||||
- the scenario pack is understandable and reproducible
|
||||
- the pack includes a mutating replay-safety lane, not only read-only flows
|
||||
- reports are readable by humans and automation
|
||||
- parity claims are evidence-backed, not anecdotal
|
||||
|
||||
@@ -142,6 +143,16 @@ The parity harness is not the only evidence source. Keep this split explicit in
|
||||
- PR D owns the scenario-based GPT-5.4 vs Opus 4.6 comparison
|
||||
- PR B deterministic suites still own auth/proxy/DNS and full-access truthfulness evidence
|
||||
|
||||
## Goal-to-evidence map
|
||||
|
||||
| Completion gate item | Primary owner | Review artifact |
|
||||
| ---------------------------------------- | ------------- | ------------------------------------------------------------------- |
|
||||
| No plan-only stalls | PR A | strict-agentic runtime tests and `approval-turn-tool-followthrough` |
|
||||
| No fake progress or fake tool completion | PR A + PR D | parity fake-success count plus scenario-level report details |
|
||||
| No false `/elevated full` guidance | PR B | deterministic runtime-truthfulness suites |
|
||||
| Replay/liveness failures remain explicit | PR C + PR D | lifecycle/replay suites plus `compaction-retry-mutating-tool` |
|
||||
| GPT-5.4 matches or beats Opus 4.6 | PR D | `qa-agentic-parity-report.md` and `qa-agentic-parity-summary.json` |
|
||||
|
||||
## Reviewer shorthand: before vs after
|
||||
|
||||
| User-visible problem before | Review signal after |
|
||||
|
||||
@@ -129,7 +129,7 @@ flowchart LR
|
||||
|
||||
## Scenario pack
|
||||
|
||||
The first-wave parity pack currently covers four scenarios:
|
||||
The first-wave parity pack currently covers five scenarios:
|
||||
|
||||
### `approval-turn-tool-followthrough`
|
||||
|
||||
@@ -147,14 +147,19 @@ Checks that the model can read source and docs, synthesize findings, and continu
|
||||
|
||||
Checks that mixed-mode tasks involving attachments remain actionable and do not collapse into vague narration.
|
||||
|
||||
### `compaction-retry-mutating-tool`
|
||||
|
||||
Checks that a task with a real mutating write keeps replay-unsafety explicit instead of quietly looking replay-safe if the run compacts, retries, or loses reply state under pressure.
|
||||
|
||||
## Scenario matrix
|
||||
|
||||
| Scenario | What it tests | Good GPT-5.4 behavior | Failure signal |
|
||||
| ---------------------------------- | -------------------------------------- | ----------------------------------------------------------------------------- | ----------------------------------------------------------------------------- |
|
||||
| `approval-turn-tool-followthrough` | Short approval turns after a plan | Starts the first concrete tool action immediately instead of restating intent | plan-only follow-up, no tool activity, or blocked turn without a real blocker |
|
||||
| `model-switch-tool-continuity` | Runtime/model switching under tool use | Preserves task context and continues acting coherently | resets into commentary, loses tool context, or stops after switch |
|
||||
| `source-docs-discovery-report` | Source reading + synthesis + action | Finds sources, uses tools, and produces a useful report without stalling | thin summary, missing tool work, or incomplete-turn stop |
|
||||
| `image-understanding-attachment` | Attachment-driven agentic work | Interprets the attachment, connects it to tools, and continues the task | vague narration, attachment ignored, or no concrete next action |
|
||||
| Scenario | What it tests | Good GPT-5.4 behavior | Failure signal |
|
||||
| ---------------------------------- | --------------------------------------- | ------------------------------------------------------------------------------ | ------------------------------------------------------------------------------ |
|
||||
| `approval-turn-tool-followthrough` | Short approval turns after a plan | Starts the first concrete tool action immediately instead of restating intent | plan-only follow-up, no tool activity, or blocked turn without a real blocker |
|
||||
| `model-switch-tool-continuity` | Runtime/model switching under tool use | Preserves task context and continues acting coherently | resets into commentary, loses tool context, or stops after switch |
|
||||
| `source-docs-discovery-report` | Source reading + synthesis + action | Finds sources, uses tools, and produces a useful report without stalling | thin summary, missing tool work, or incomplete-turn stop |
|
||||
| `image-understanding-attachment` | Attachment-driven agentic work | Interprets the attachment, connects it to tools, and continues the task | vague narration, attachment ignored, or no concrete next action |
|
||||
| `compaction-retry-mutating-tool` | Mutating work under compaction pressure | Performs a real write and keeps replay-unsafety explicit after the side effect | mutating write happens but replay safety is implied, missing, or contradictory |
|
||||
|
||||
## Release gate
|
||||
|
||||
@@ -180,6 +185,16 @@ Parity evidence is intentionally split across two layers:
|
||||
- PR D proves same-scenario GPT-5.4 vs Opus 4.6 behavior with QA-lab
|
||||
- PR B deterministic suites prove auth, proxy, DNS, and `/elevated full` truthfulness outside the harness
|
||||
|
||||
## Goal-to-evidence matrix
|
||||
|
||||
| Completion gate item | Owning PR | Evidence source | Pass signal |
|
||||
| -------------------------------------------------------- | ----------- | ------------------------------------------------------------------ | ---------------------------------------------------------------------------------------- |
|
||||
| GPT-5.4 no longer stalls after planning | PR A | `approval-turn-tool-followthrough` plus PR A runtime suites | approval turns trigger real work or an explicit blocked state |
|
||||
| GPT-5.4 no longer fakes progress or fake tool completion | PR A + PR D | parity report scenario outcomes and fake-success count | no suspicious pass results and no commentary-only completion |
|
||||
| GPT-5.4 no longer gives false `/elevated full` guidance | PR B | deterministic truthfulness suites | blocked reasons and full-access hints stay runtime-accurate |
|
||||
| Replay/liveness failures stay explicit | PR C + PR D | PR C lifecycle/replay suites plus `compaction-retry-mutating-tool` | mutating work keeps replay-unsafety explicit instead of silently disappearing |
|
||||
| GPT-5.4 matches or beats Opus 4.6 on the agreed metrics | PR D | `qa-agentic-parity-report.md` and `qa-agentic-parity-summary.json` | same scenario coverage and no regression on completion, stop behavior, or valid tool use |
|
||||
|
||||
## How to read the parity verdict
|
||||
|
||||
Use the verdict in `qa-agentic-parity-summary.json` as the final machine-readable decision for the first-wave parity pack.
|
||||
|
||||
@@ -137,6 +137,7 @@ describe("qa agentic parity report", () => {
|
||||
candidateSummary: {
|
||||
scenarios: [
|
||||
{ name: "Approval turn tool followthrough", status: "pass" },
|
||||
{ name: "Compaction retry after mutating tool", status: "pass" },
|
||||
{ name: "Model switch with tool continuity", status: "pass" },
|
||||
{ name: "Source and docs discovery report", status: "pass" },
|
||||
{ name: "Image understanding from attachment", status: "pass" },
|
||||
@@ -145,6 +146,7 @@ describe("qa agentic parity report", () => {
|
||||
baselineSummary: {
|
||||
scenarios: [
|
||||
{ name: "Approval turn tool followthrough", status: "pass" },
|
||||
{ name: "Compaction retry after mutating tool", status: "pass" },
|
||||
{ name: "Model switch with tool continuity", status: "pass" },
|
||||
{ name: "Source and docs discovery report", status: "pass" },
|
||||
{ name: "Image understanding from attachment", status: "pass" },
|
||||
|
||||
@@ -17,6 +17,10 @@ export const QA_AGENTIC_PARITY_SCENARIOS = [
|
||||
id: "image-understanding-attachment",
|
||||
title: "Image understanding from attachment",
|
||||
},
|
||||
{
|
||||
id: "compaction-retry-mutating-tool",
|
||||
title: "Compaction retry after mutating tool",
|
||||
},
|
||||
] as const;
|
||||
|
||||
export const QA_AGENTIC_PARITY_SCENARIO_IDS = QA_AGENTIC_PARITY_SCENARIOS.map(({ id }) => id);
|
||||
|
||||
@@ -334,6 +334,7 @@ describe("qa cli runtime", () => {
|
||||
"model-switch-tool-continuity",
|
||||
"source-docs-discovery-report",
|
||||
"image-understanding-attachment",
|
||||
"compaction-retry-mutating-tool",
|
||||
],
|
||||
}),
|
||||
);
|
||||
|
||||
@@ -169,6 +169,77 @@ describe("qa mock openai server", () => {
|
||||
]);
|
||||
});
|
||||
|
||||
it("drives the compaction retry mutating tool parity flow", async () => {
|
||||
const server = await startQaMockOpenAiServer({
|
||||
host: "127.0.0.1",
|
||||
port: 0,
|
||||
});
|
||||
cleanups.push(async () => {
|
||||
await server.stop();
|
||||
});
|
||||
|
||||
const writePlan = await fetch(`${server.baseUrl}/v1/responses`, {
|
||||
method: "POST",
|
||||
headers: {
|
||||
"content-type": "application/json",
|
||||
},
|
||||
body: JSON.stringify({
|
||||
stream: true,
|
||||
model: "gpt-5.4",
|
||||
input: [
|
||||
{
|
||||
role: "user",
|
||||
content: [
|
||||
{
|
||||
type: "input_text",
|
||||
text: "Compaction retry mutating tool check: read COMPACTION_RETRY_CONTEXT.md, then create compaction-retry-summary.txt and keep replay safety explicit.",
|
||||
},
|
||||
],
|
||||
},
|
||||
{
|
||||
type: "function_call_output",
|
||||
output: "compaction retry evidence block 0000\ncompaction retry evidence block 0001",
|
||||
},
|
||||
],
|
||||
}),
|
||||
});
|
||||
expect(writePlan.status).toBe(200);
|
||||
const writePlanBody = await writePlan.text();
|
||||
expect(writePlanBody).toContain('"name":"write"');
|
||||
expect(writePlanBody).toContain("compaction-retry-summary.txt");
|
||||
|
||||
const finalReply = await fetch(`${server.baseUrl}/v1/responses`, {
|
||||
method: "POST",
|
||||
headers: {
|
||||
"content-type": "application/json",
|
||||
},
|
||||
body: JSON.stringify({
|
||||
stream: false,
|
||||
model: "gpt-5.4",
|
||||
input: [
|
||||
{
|
||||
role: "user",
|
||||
content: [
|
||||
{
|
||||
type: "input_text",
|
||||
text: "Compaction retry mutating tool check: read COMPACTION_RETRY_CONTEXT.md, then create compaction-retry-summary.txt and keep replay safety explicit.",
|
||||
},
|
||||
],
|
||||
},
|
||||
{
|
||||
type: "function_call_output",
|
||||
output: "Replay safety: unsafe after write.\n",
|
||||
},
|
||||
],
|
||||
}),
|
||||
});
|
||||
expect(finalReply.status).toBe(200);
|
||||
const finalPayload = (await finalReply.json()) as {
|
||||
output?: Array<{ content?: Array<{ text?: string }> }>;
|
||||
};
|
||||
expect(finalPayload.output?.[0]?.content?.[0]?.text).toContain("replay unsafe after write");
|
||||
});
|
||||
|
||||
it("supports exact reply memory prompts and embeddings requests", async () => {
|
||||
const server = await startQaMockOpenAiServer({
|
||||
host: "127.0.0.1",
|
||||
|
||||
@@ -452,6 +452,12 @@ function buildAssistantText(input: ResponsesInputItem[], body: Record<string, un
|
||||
}
|
||||
return `Protocol note: Lobster Invaders built at lobster-invaders.html.`;
|
||||
}
|
||||
if (toolOutput && /compaction retry mutating tool check/i.test(prompt)) {
|
||||
if (toolOutput.includes("Replay safety: unsafe after write.")) {
|
||||
return "Protocol note: replay unsafe after write.";
|
||||
}
|
||||
return "";
|
||||
}
|
||||
if (toolOutput) {
|
||||
const snippet = toolOutput.replace(/\s+/g, " ").trim().slice(0, 220);
|
||||
return `Protocol note: I reviewed the requested material. Evidence snippet: ${snippet || "no content"}`;
|
||||
@@ -541,6 +547,17 @@ async function buildResponsesPayload(body: Record<string, unknown>) {
|
||||
});
|
||||
}
|
||||
}
|
||||
if (/compaction retry mutating tool check/i.test(prompt)) {
|
||||
if (!toolOutput) {
|
||||
return buildToolCallEventsWithArgs("read", { path: "COMPACTION_RETRY_CONTEXT.md" });
|
||||
}
|
||||
if (toolOutput.includes("compaction retry evidence")) {
|
||||
return buildToolCallEventsWithArgs("write", {
|
||||
path: "compaction-retry-summary.txt",
|
||||
content: "Replay safety: unsafe after write.\n",
|
||||
});
|
||||
}
|
||||
}
|
||||
if (/memory tools check/i.test(prompt)) {
|
||||
if (!toolOutput) {
|
||||
return buildToolCallEventsWithArgs("memory_search", {
|
||||
|
||||
@@ -7,6 +7,7 @@ Use this when tuning the harness on frontier models before the small-model pass.
|
||||
- verify tool-first behavior on short approval turns
|
||||
- verify model switching does not kill tool use
|
||||
- verify repo-reading / discovery still finishes with a concrete report
|
||||
- verify mutating work keeps replay-unsafety explicit under compaction pressure
|
||||
- collect manual notes on personality without letting style hide execution regressions
|
||||
|
||||
## Frontier subset
|
||||
@@ -19,6 +20,7 @@ Run this subset first on every harness tweak:
|
||||
|
||||
Longer spot-check after that:
|
||||
|
||||
- `compaction-retry-mutating-tool`
|
||||
- `subagent-handoff`
|
||||
|
||||
## Baseline order
|
||||
@@ -84,6 +86,7 @@ Use the QA Lab runner catalog or `openclaw models list --all` to pick the curren
|
||||
- empty-promise rate
|
||||
- tool continuity after model switch
|
||||
- discovery report completeness and specificity
|
||||
- replay-safety truth after a mutating write
|
||||
- scope drift: unrelated scenario updates, grand wrap-ups, or invented completion tallies
|
||||
- latency / obvious stall behavior
|
||||
- token cost notes if a change makes the prompt materially heavier
|
||||
@@ -126,4 +129,4 @@ Score it on:
|
||||
|
||||
## Deferred
|
||||
|
||||
- post-compaction next-action continuity should become an executable lane once we have a deterministic compaction trigger in QA
|
||||
- deterministic mock compaction triggering is still deferred; the current replay-safety lane is a live-frontier-first executable scenario
|
||||
|
||||
98
qa/scenarios/compaction-retry-mutating-tool.md
Normal file
98
qa/scenarios/compaction-retry-mutating-tool.md
Normal file
@@ -0,0 +1,98 @@
|
||||
# Compaction retry after mutating tool
|
||||
|
||||
```yaml qa-scenario
|
||||
id: compaction-retry-mutating-tool
|
||||
title: Compaction retry after mutating tool
|
||||
surface: runtime
|
||||
objective: Verify a real mutating tool step keeps replay-unsafety explicit instead of disappearing into a clean-looking success if the run compacts or retries.
|
||||
successCriteria:
|
||||
- Agent reads the seeded large context before it writes.
|
||||
- Agent performs a real mutating write in the workspace.
|
||||
- Final reply explicitly says replay is unsafe after the write instead of implying replay-safe completion.
|
||||
- Scenario details preserve the observed compaction count for review context.
|
||||
docsRefs:
|
||||
- docs/help/testing.md
|
||||
- docs/help/gpt54-codex-agentic-parity.md
|
||||
codeRefs:
|
||||
- extensions/qa-lab/src/suite.ts
|
||||
- extensions/qa-lab/src/mock-openai-server.ts
|
||||
- src/agents/pi-embedded-subscribe.ts
|
||||
- src/agents/pi-embedded-subscribe.handlers.lifecycle.ts
|
||||
execution:
|
||||
kind: flow
|
||||
summary: Verify a mutating tool step keeps replay-unsafety explicit through compaction or retry pressure.
|
||||
config:
|
||||
contextFile: COMPACTION_RETRY_CONTEXT.md
|
||||
outputFile: compaction-retry-summary.txt
|
||||
promptSnippet: Compaction retry mutating tool check
|
||||
replayUnsafeNeedle: replay unsafe after write
|
||||
prompt: |-
|
||||
Compaction retry mutating tool check: read `COMPACTION_RETRY_CONTEXT.md`, then create `compaction-retry-summary.txt` in the workspace containing exactly this line:
|
||||
Replay safety: unsafe after write.
|
||||
|
||||
After the write, reply with one short sentence that includes the exact phrase `replay unsafe after write`.
|
||||
If the run gets compacted or retried, keep that replay-safety truth explicit and do not claim the task stayed replay-safe.
|
||||
```
|
||||
|
||||
```yaml qa-flow
|
||||
steps:
|
||||
- name: keeps replay-unsafety explicit after a mutating write under compaction pressure
|
||||
actions:
|
||||
- call: waitForGatewayHealthy
|
||||
args:
|
||||
- ref: env
|
||||
- 60000
|
||||
- call: reset
|
||||
- call: fs.writeFile
|
||||
args:
|
||||
- expr: "path.join(env.gateway.workspaceDir, config.contextFile)"
|
||||
- expr: "Array.from({ length: env.providerMode === 'mock-openai' ? 160 : 2200 }, (_, index) => `compaction retry evidence block ${String(index).padStart(4, '0')} :: keep replay truth explicit after side effects`).join('\\n')"
|
||||
- utf8
|
||||
- set: sessionKey
|
||||
value:
|
||||
expr: "`agent:qa:compaction-retry:${randomUUID().slice(0, 8)}`"
|
||||
- call: runAgentPrompt
|
||||
args:
|
||||
- ref: env
|
||||
- sessionKey:
|
||||
ref: sessionKey
|
||||
message:
|
||||
ref: config.prompt
|
||||
timeoutMs:
|
||||
expr: liveTurnTimeoutMs(env, 90000)
|
||||
- call: waitForCondition
|
||||
saveAs: outbound
|
||||
args:
|
||||
- lambda:
|
||||
expr: "state.getSnapshot().messages.filter((candidate) => candidate.direction === 'outbound' && candidate.conversation.id === 'qa-operator' && normalizeLowercaseStringOrEmpty(candidate.text).includes(config.replayUnsafeNeedle)).at(-1)"
|
||||
- expr: liveTurnTimeoutMs(env, 45000)
|
||||
- expr: "env.providerMode === 'mock-openai' ? 100 : 250"
|
||||
- call: fs.readFile
|
||||
saveAs: writtenSummary
|
||||
args:
|
||||
- expr: "path.join(env.gateway.workspaceDir, config.outputFile)"
|
||||
- utf8
|
||||
- assert:
|
||||
expr: "writtenSummary.includes('Replay safety: unsafe after write.')"
|
||||
message:
|
||||
expr: "`summary file missed replay marker: ${writtenSummary}`"
|
||||
- if:
|
||||
expr: "Boolean(env.mock)"
|
||||
then:
|
||||
- assert:
|
||||
expr: "!env.mock || ([...(await fetchJson(`${env.mock.baseUrl}/debug/requests`))].toReversed().find((request) => String(request.allInputText ?? '').includes(config.promptSnippet) && String(request.toolOutput ?? '').includes('compaction retry evidence block'))?.plannedToolName === 'write')"
|
||||
message:
|
||||
expr: "`expected write after seeded context read, got ${String(([...(await fetchJson(`${env.mock.baseUrl}/debug/requests`))].toReversed().find((request) => String(request.allInputText ?? '').includes(config.promptSnippet) && String(request.toolOutput ?? '').includes('compaction retry evidence block'))?.plannedToolName ?? '')}`"
|
||||
- call: readRawQaSessionStore
|
||||
saveAs: store
|
||||
args:
|
||||
- ref: env
|
||||
- set: sessionEntry
|
||||
value:
|
||||
expr: "store[sessionKey]"
|
||||
- assert:
|
||||
expr: "Boolean(sessionEntry)"
|
||||
message:
|
||||
expr: "`missing QA session entry for ${sessionKey}`"
|
||||
detailsExpr: "`${outbound.text}\\ncompactionCount=${String(sessionEntry?.compactionCount ?? 0)}\\nstatus=${String(sessionEntry?.status ?? 'unknown')}`"
|
||||
```
|
||||
Reference in New Issue
Block a user