mirror of
https://github.com/openclaw/openclaw.git
synced 2026-05-06 09:10:45 +00:00
qa: salvage GPT-5.4 parity proof slice (#65664)
* test(qa): gate parity prose scenarios on real tool calls Closes criterion 2 of the GPT-5.4 parity completion gate in #64227 ('no fake progress / fake tool completion') for the two first/second-wave parity scenarios that can currently pass with a prose-only reply. Background: the scenario framework already exposes tool-call assertions via /debug/requests on the mock server (see approval-turn-tool-followthrough for the pattern). Most parity scenarios use this seam to require a specific plannedToolName, but source-docs-discovery-report and subagent-handoff only checked the assistant's prose text, which means a model could fabricate: - a Worked / Failed / Blocked / Follow-up report without ever calling the read tool on the docs / source files the prompt named - three labeled 'Delegated task', 'Result', 'Evidence' sections without ever calling sessions_spawn to delegate Both gaps are fake-progress loopholes for the parity gate. Changes: - source-docs-discovery-report: require at least one read tool call tied to the 'worked, failed, blocked' prompt in /debug/requests. Failure message dumps the observed plannedToolName list for debugging. - subagent-handoff: require at least one sessions_spawn tool call tied to the 'delegate' / 'subagent handoff' prompt in /debug/requests. Same debug-friendly failure message. Both assertions are gated behind !env.mock so they no-op in live-frontier mode where the real provider exposes plannedToolName through a different channel (or not at all). Not touched: memory-recall is also in the parity pack but its pass path is legitimately 'read the fact from prior-turn context'. That is a valid recall strategy, not fake progress, so it is out of scope for this PR. memory-recall's fake-progress story (no real memory_search call) would require bigger mock-server changes and belongs in a follow-up that extends the mock memory pipeline. Validation: - pnpm test extensions/qa-lab/src/scenario-catalog.test.ts Refs #64227 * test(qa): fix case-sensitive tool-call assertions and dedupe debug fetch Addresses loop-6 review feedback on PR #64681: 1. Copilot / Greptile / codex-connector all flagged that the discovery scenario's .includes('worked, failed, blocked') assertion is case-sensitive but the real prompt says 'Worked, Failed, Blocked...', so the mock-mode assertion never matches. Fix: lowercase-normalize allInputText before the contains check. 2. Greptile P2: the expr and message.expr each called fetchJson separately, incurring two round-trips to /debug/requests. Fix: hoist the fetch to a set step (discoveryDebugRequests / subagentDebugRequests) and reuse the snapshot. 3. Copilot: the subagent-handoff assertion scanned the entire request log and matched the first request with 'delegate' in its input text, which could false-pass on a stale prior scenario. Fix: reverse the array and take the most recent matching request instead. Validation: pnpm test extensions/qa-lab/src/scenario-catalog.test.ts (4/4 pass). Refs #64227 * test(qa): narrow subagent-handoff tool-call assertion to pre-tool requests Pass-2 codex-connector P1 finding on #64681: the reverse-find pattern I used on pass 1 usually lands on the FOLLOW-UP request after the mock runs sessions_spawn, not the pre-tool planning request that actually has plannedToolName === 'sessions_spawn'. The mock only plans that tool on requests with !toolOutput (mock-openai-server.ts:662), so the post-tool request has plannedToolName unset and the assertion fails even when the handoff succeeded. Fix: switch the assertion back to a forward .some() match but add a !request.toolOutput filter so the match is pinned to the pre-tool planning phase. The case-insensitive regex, the fetchJson dedupe, and the failure-message diagnostic from pass 1 are unchanged. Validation: pnpm test extensions/qa-lab/src/scenario-catalog.test.ts (4/4 pass). Refs #64227 * test(qa): pin subagent-handoff tool-call assertion to scenario prompt Addresses the pass-3 codex-connector P1 on #64681: the pass-2 fix filtered to pre-tool requests but still used a broad `/delegate|subagent handoff/i` regex. The `subagent-fanout-synthesis` scenario runs BEFORE `subagent-handoff` in catalog order (scenarios are sorted by path), and the fanout prompt reads 'Subagent fanout synthesis check: delegate exactly two bounded subagents sequentially' — which contains 'delegate' and also plans sessions_spawn pre-tool. That produces a cross-scenario false pass where the fanout's earlier sessions_spawn request satisfies the handoff assertion even when the handoff run never delegates. Fix: tighten the input-text match from `/delegate|subagent handoff/i` to `/delegate one bounded qa task/i`, which is the exact scenario- unique substring from the `subagent-handoff` config.prompt. That pins the assertion to this scenario's request window and closes the cross-scenario false positive. Validation: pnpm test extensions/qa-lab/src/scenario-catalog.test.ts (4/4 pass). Refs #64227 * test(qa): align parity assertion comments with actual filter logic Addresses two loop-7 Copilot findings on PR #64681: 1. source-docs-discovery-report.md: the explanatory comment said the debug request log was 'lowercased for case-insensitive matching', but the code actually lowercases each request's allInputText inline inside the .some() predicate, not the discoveryDebugRequests snapshot. Rewrite the comment to describe the inline-lowercase pattern so a future reader matches the code they see. 2. subagent-handoff.md: the comment said the assertion 'must be pinned to THIS scenario's request window' but the implementation actually relies on matching a scenario-unique prompt substring (/delegate one bounded qa task/i), not a request-window. Rewrite the comment to describe the substring pinning and keep the pre-tool filter rationale intact. No runtime change; comment-only fix to keep reviewer expectations aligned with the actual assertion shape. Validation: pnpm test extensions/qa-lab/src/scenario-catalog.test.ts (4/4 pass). Refs #64227 * test(qa): extend tool-call assertions to image-understanding, subagent-fanout, and capability-flip scenarios * Guard mock-only image parity assertions * Expand agentic parity second wave * test(qa): pad parity suspicious-pass isolation to second wave * qa-lab: parametrize parity report title and drop stale first-wave comment Addresses two loop-7 Copilot findings on PR #64662: 1. Hard-coded 'GPT-5.4 / Opus 4.6' markdown H1: the renderer now uses a template string that interpolates candidateLabel and baselineLabel, so any parity run (not only gpt-5.4 vs opus 4.6) renders an accurate title in saved reports. Default CLI flags still produce openai/gpt-5.4 vs anthropic/claude-opus-4-6 as the baseline pair. 2. Stale 'declared first-wave parity scenarios' comment in scopeSummaryToParityPack: the parity pack is now the ten-scenario first-wave+second-wave set (PR D + PR E). Comment updated to drop the first-wave qualifier and name the full QA_AGENTIC_PARITY_SCENARIOS constant the scope is filtering against. New regression: 'parametrizes the markdown header from the comparison labels' — asserts that non-default labels (openai/gpt-5.4-alt vs openai/gpt-5.4) render in the H1. Validation: pnpm test extensions/qa-lab/src/agentic-parity-report.test.ts (13/13 pass). Refs #64227 * qa-lab: fail parity gate on required scenario failures regardless of baseline parity * test(qa): update readable-report test to cover all 10 parity scenarios * qa-lab: strengthen parity-report fake-success detector and verify run.primaryProvider labels * Tighten parity label and scenario checks * fix: tighten parity label provenance checks * fix: scope parity tool-call metrics to tool lanes * Fix parity report label and fake-success checks * fix(qa): tighten parity report edge cases * qa-lab: add Anthropic /v1/messages mock route for parity baseline Closes the last local-runnability gap on criterion 5 of the GPT-5.4 parity completion gate in #64227 ('the parity gate shows GPT-5.4 matches or beats Opus 4.6 on the agreed metrics'). Background: the parity gate needs two comparable scenario runs - one against openai/gpt-5.4 and one against anthropic/claude-opus-4-6 - so the aggregate metrics and verdict in PR D (#64441) can be computed. Today the qa-lab mock server only implements /v1/responses, so the baseline run against Claude Opus 4.6 requires a real Anthropic API key. That makes the gate impossible to prove end-to-end from a local worktree and means the CI story is always 'two real providers + quota + keys'. This PR adds a /v1/messages Anthropic-compatible route to the existing mock OpenAI server. The route is a thin adapter that: - Parses Anthropic Messages API request shapes (system as string or [{type:text,text}], messages with string or block content, text and tool_result and tool_use and image blocks) - Translates them into the ResponsesInputItem[] shape the existing shared scenario dispatcher (buildResponsesPayload) already understands - Calls the shared dispatcher so both the OpenAI and Anthropic lanes run through the exact same scenario prompt-matching logic (same subagent fanout state machine, same extractRememberedFact helper, same '/debug/requests' telemetry) - Converts the resulting OpenAI-format events back into an Anthropic message response with text and tool_use content blocks and a correct stop_reason (tool_use vs end_turn) Non-streaming only: the QA suite runner falls back to non-streaming mock mode so real Anthropic SSE isn't necessary for the parity baseline. Also adds claude-opus-4-6 and claude-sonnet-4-6 to /v1/models so baseline model-list probes from the suite runner resolve without extra config. Tests added: - advertises Anthropic claude-opus-4-6 baseline model on /v1/models - dispatches an Anthropic /v1/messages read tool call for source discovery prompts (tool_use stop_reason, correct input path, /debug/requests records plannedToolName=read) - dispatches Anthropic /v1/messages tool_result follow-ups through the shared scenario logic (subagent-handoff two-stage flow: tool_use - tool_result - 'Delegated task / Evidence' prose summary) Local validation: - pnpm test extensions/qa-lab/src/mock-openai-server.test.ts (18/18 pass) - pnpm test extensions/qa-lab/src/mock-openai-server.test.ts extensions/qa-lab/src/cli.runtime.test.ts extensions/qa-lab/src/scenario-catalog.test.ts (47/47 pass) Refs #64227 Unblocks #64441 (parity harness) and the forthcoming qa parity run wrapper by giving the baseline lane a local-only mock path. * qa-lab: fix Anthropic tool_result ordering in messages adapter Addresses the loop-6 Copilot / Greptile finding on PR #64685: in `convertAnthropicMessagesToResponsesInput`, `tool_result` blocks were pushed to `items` inside the per-block loop while the surrounding user/assistant message was only pushed after the loop finished. That reordered the function_call_output BEFORE its parent user message whenever a user turn mixed `tool_result` with fresh text/image blocks, which broke `extractToolOutput` (it scans AFTER the last user-role index; function_call_output placed BEFORE that index is invisible to it) and made the downstream scenario dispatcher behave as if no tool output had been returned on mixed-content turns. Fix: buffer `tool_result` and `tool_use` blocks in local arrays during the per-block loop, push the parent role message first (when it has any text/image pieces), then push the accumulated function_call / function_call_output items in original order. tool_result-only user turns still omit the parent message as before, so the non-mixed subagent-fanout-synthesis two-stage flow that already worked keeps working. Regression added: - `places tool_result after the parent user message even in mixed-content turns` — sends a user turn that mixes a `tool_result` block with a trailing fresh text block, then inspects `/debug/last-request` to assert that `toolOutput === 'SUBAGENT-OK'` (extractToolOutput found the function_call_output AFTER the last user index) and `prompt === 'Keep going with the fanout.'` (extractLastUserText picked up the trailing fresh text). Local validation: pnpm test extensions/qa-lab/src/mock-openai-server.test.ts (19/19 pass). Refs #64227 * qa-lab: reject Anthropic streaming and empty model in messages mock * qa-lab: tag mock request snapshots with a provider variant so parity runs can diff per provider * Handle invalid Anthropic mock JSON * fix: wire mock parity providers by model ref * fix(qa): support Anthropic message streaming in mock parity lane * qa-lab: record provider/model/mode in qa-suite-summary.json Closes the 'summary cannot be label-verified' half of criterion 5 on the GPT-5.4 parity completion gate in #64227. Background: the parity gate in #64441 compares two qa-suite-summary.json files and trusts whatever candidateLabel / baselineLabel the caller passes. Today the summary JSON only contains { scenarios, counts }, so nothing in the summary records which provider/model the run actually used. If a maintainer swaps candidate and baseline summary paths in a parity-report call, the verdict is silently mislabeled and nobody can retroactively verify which run produced which summary. Changes: - Add a 'run' block to qa-suite-summary.json with startedAt, finishedAt, providerMode, primaryModel (+ provider and model splits), alternateModel (+ provider and model splits), fastMode, concurrency, scenarioIds (when explicitly filtered). - Extract a pure 'buildQaSuiteSummaryJson(params)' helper so the summary JSON shape is unit-testable and the parity gate (and any future parity wrapper) can import the exact same type rather than reverse-engineering the JSON shape at runtime. - Thread 'scenarioIds' from 'runQaSuite' into writeQaSuiteArtifacts so --scenario-ids flags are recorded in the summary. Unit tests added (src/suite.summary-json.test.ts, 5 cases): - records provider/model/mode so parity gates can verify labels - includes scenarioIds in run metadata when provided - records an Anthropic baseline lane cleanly for parity runs - leaves split fields null when a model ref is malformed - keeps scenarios and counts alongside the run metadata This is additive: existing callers of qa-suite-summary.json continue to see the same { scenarios, counts } shape, just with an extra run field. No existing consumers of the JSON need to change. The follow-up 'qa parity run' CLI wrapper (run the parity pack twice against candidate + baseline, emit two labeled summaries in one command) stacks cleanly on top of this change and will land as a separate PR once #64441 and #64662 merge so the wrapper can call runQaParityReportCommand directly. Local validation: - pnpm test extensions/qa-lab/src/suite.summary-json.test.ts (5/5 pass) - pnpm test extensions/qa-lab/src/suite.summary-json.test.ts extensions/qa-lab/src/cli.runtime.test.ts extensions/qa-lab/src/scenario-catalog.test.ts (34/34 pass) Refs #64227 Unblocks the final parity run for #64441 / #64662 by making summaries self-describing. * qa-lab: strengthen qa-suite-summary builder types and empty-array semantics Addresses 4 loop-6 Copilot / codex-connector findings on PR #64689 (re-opened as #64789): 1. P2 codex + Copilot: empty `scenarioIds` array was serialized as `[]` because of a truthiness check. The CLI passes an empty array when --scenario is omitted, so full-suite runs would incorrectly record an explicit empty selection. Fix: switch to a `length > 0` check so '[] or undefined' both encode as `null` in the summary run metadata. 2. Copilot: `buildQaSuiteSummaryJson` was exported for parity-gate consumers but its return type was `Record<string, unknown>`, which defeated the point of exporting it. Fix: introduce a concrete `QaSuiteSummaryJson` type that matches the JSON shape 1-for-1 and make the builder return it. Downstream code (parity gate, parity run wrapper) can now import the type and keep consumers type-checked. 3. Copilot: `QaSuiteSummaryJsonParams.providerMode` re-declared the `'mock-openai' | 'live-frontier'` string union even though `QaProviderMode` is already imported from model-selection.ts. Fix: reuse `QaProviderMode` so provider-mode additions flow through both types at once. 4. Copilot: test fixtures omitted `steps` from the fake scenario results, creating shape drift with the real suite scenario-result shape. Fix: pad the test fixtures with `steps: []` and tighten the scenarioIds assertion to read `json.run.scenarioIds` directly (the new concrete return type makes the type-cast unnecessary). New regression: `treats an empty scenarioIds array as unspecified (no filter)` — passes `scenarioIds: []` and asserts the summary records `scenarioIds: null`. Validation: pnpm test extensions/qa-lab/src/suite.summary-json.test.ts (6/6 pass). Refs #64227 * qa-lab: record executed scenarioIds in summary run metadata Addresses the pass-3 codex-connector P2 on #64789 (repl of #64689): `run.scenarioIds` was copied from the raw `params.scenarioIds` caller input, but `runQaSuite` normalizes that input through `selectQaSuiteScenarios` which dedupes via `Set` and reorders the selection to catalog order. When callers repeat --scenario ids or pass them in non-catalog order, the summary metadata drifted from the scenarios actually executed, which can make parity/report tooling treat equivalent runs as different or trust inaccurate provenance. Fix: both writeQaSuiteArtifacts call sites in runQaSuite now pass `selectedCatalogScenarios.map(scenario => scenario.id)` instead of `params?.scenarioIds`, so the summary records the post-selection executed list. This also covers the full-suite case automatically (the executed list is the full lane-filtered catalog), giving parity consumers a stable record of exactly which scenarios landed in the run regardless of how the caller phrased the request. buildQaSuiteSummaryJson's `length > 0 ? [...] : null` pass-2 semantics are preserved so the public helper still treats an empty array as 'unspecified' for any future caller that legitimately passes one. Validation: pnpm test extensions/qa-lab/src/suite.summary-json.test.ts (6/6 pass). Refs #64227 * qa-lab: preserve null scenarioIds for unfiltered suite runs Addresses the pass-4 codex-connector P2 on #64789: the pass-3 fix always passed `selectedCatalogScenarios.map(...)` to writeQaSuiteArtifacts, which made unfiltered full-suite runs indistinguishable from an explicit all-scenarios selection in the summary metadata. The 'unfiltered → null' semantic (documented in the buildQaSuiteSummaryJson JSDoc and exercised by the "treats an empty scenarioIds array as unspecified" regression) was lost. Fix: both writeQaSuiteArtifacts call sites now condition on the caller's original `params.scenarioIds`. When the caller passed an explicit non-empty filter, record the post-selection executed list (pass-3 behavior, preserving Set-dedupe + catalog-order normalization). When the caller passed undefined or an empty array, pass undefined to writeQaSuiteArtifacts so buildQaSuiteSummaryJson's length-check serializes null (pass-2 behavior, preserving unfiltered semantics). This keeps both codex-connector findings satisfied simultaneously: - explicit --scenario filter reorders/dedupes through the executed list, not the raw caller input - unfiltered full-suite run records null, not a full catalog dump that would shadow "explicit all-scenarios" selections Validation: pnpm test extensions/qa-lab/src/suite.summary-json.test.ts (6/6 pass). Refs #64227 * qa-lab: reuse QaProviderMode in writeQaSuiteArtifacts param type * qa-lab: stage mock auth profiles so the parity gate runs without real credentials * fix(qa): clean up mock auth staging follow-ups * ci: add parity-gate workflow that runs the GPT-5.4 vs Opus 4.6 gate end-to-end against the qa-lab mock * ci: use supported parity gate runner label * ci: watch gateway changes in parity gate * docs: pin parity runbook alternate models * fix(ci): watch qa-channel parity inputs * qa: roll up parity proof closeout * qa: harden mock parity review fixes * qa-lab: fix review findings — comment wording, placeholder key, exported type, ordering assertion, remove false-positive positive-tone detection * qa: fix memory-recall scenario count, update criterion 2 comment, cache fetchJson in model-switch * qa-lab: clean up positive-tone comment + fix stale test expectations * qa: pin workflow Node version to 22.14.0 + fix stale label-match wording * qa-lab: refresh mock provider routing expectation * docs: drop stale parity rollup rewrite from proof slice * qa: run parity gate against mock lane * deps: sync qa-lab lockfile * build: refresh a2ui bundle hash * ci: widen parity gate triggers --------- Co-authored-by: Eva <eva@100yen.org>
This commit is contained in:
@@ -151,6 +151,20 @@ steps:
|
||||
ref: imageStartedAtMs
|
||||
timeoutMs:
|
||||
expr: liveTurnTimeoutMs(env, 45000)
|
||||
# Tool-call assertion (criterion 2 of the parity completion
|
||||
# gate in #64227): the restored `image_generate` capability
|
||||
# must have actually fired as a real tool call. Without this
|
||||
# assertion, a prose reply that just mentions a MEDIA path
|
||||
# could satisfy the scenario, so strengthen it by requiring
|
||||
# the mock to have recorded `plannedToolName: "image_generate"`
|
||||
# against a post-restart request. The `!env.mock || ...`
|
||||
# guard means this check only runs in mock mode (where
|
||||
# `/debug/requests` is available); live-frontier runs skip
|
||||
# it and still pass the rest of the scenario.
|
||||
- assert:
|
||||
expr: "!env.mock || [...(await fetchJson(`${env.mock.baseUrl}/debug/requests`))].some((request) => String(request.allInputText ?? '').toLowerCase().includes('capability flip image check') && request.plannedToolName === 'image_generate')"
|
||||
message:
|
||||
expr: "`expected image_generate tool call during capability flip scenario, saw plannedToolNames=${JSON.stringify([...(await fetchJson(`${env.mock.baseUrl}/debug/requests`))].filter((request) => String(request.allInputText ?? '').toLowerCase().includes('capability flip image check')).map((request) => request.plannedToolName ?? null))}`"
|
||||
finally:
|
||||
- call: patchConfig
|
||||
args:
|
||||
|
||||
@@ -64,9 +64,26 @@ steps:
|
||||
expr: "!missingColorGroup"
|
||||
message:
|
||||
expr: "`missing expected colors in image description: ${outbound.text}`"
|
||||
# Image-processing assertion: verify the mock actually received an
|
||||
# image on the scenario-unique prompt. This is as strong as a
|
||||
# tool-call assertion for this scenario — unlike the
|
||||
# `source-docs-discovery-report` / `subagent-handoff` /
|
||||
# `config-restart-capability-flip` scenarios that rely on a real
|
||||
# tool call to satisfy the parity criterion, image understanding
|
||||
# is handled inside the provider's vision capability and does NOT
|
||||
# emit a tool call the mock can record as `plannedToolName`. The
|
||||
# `imageInputCount` field IS the tool-call evidence for vision
|
||||
# scenarios: it proves the attachment reached the provider, which
|
||||
# is the only thing an external harness can verify in mock mode.
|
||||
# Match on the scenario-unique prompt substring so the assertion
|
||||
# can't be accidentally satisfied by some other scenario's image
|
||||
# request that happens to share a debug log with this one.
|
||||
- set: imageRequest
|
||||
value:
|
||||
expr: "env.mock ? [...(await fetchJson(`${env.mock.baseUrl}/debug/requests`))].find((request) => String(request.prompt ?? '').includes('Image understanding check')) : null"
|
||||
- assert:
|
||||
expr: "!env.mock || (((await fetchJson(`${env.mock.baseUrl}/debug/requests`)).find((request) => String(request.prompt ?? '').includes('Image understanding check'))?.imageInputCount ?? 0) >= 1)"
|
||||
expr: "!env.mock || (imageRequest && (imageRequest.imageInputCount ?? 0) >= 1)"
|
||||
message:
|
||||
expr: "`expected at least one input image, got ${String((await fetchJson(`${env.mock.baseUrl}/debug/requests`)).find((request) => String(request.prompt ?? '').includes('Image understanding check'))?.imageInputCount ?? 0)}`"
|
||||
expr: "`expected at least one input image on the Image understanding check request, got imageInputCount=${String(imageRequest?.imageInputCount ?? 0)}`"
|
||||
detailsExpr: outbound.text
|
||||
```
|
||||
|
||||
127
qa/scenarios/instruction-followthrough-repo-contract.md
Normal file
127
qa/scenarios/instruction-followthrough-repo-contract.md
Normal file
@@ -0,0 +1,127 @@
|
||||
# Instruction followthrough repo contract
|
||||
|
||||
```yaml qa-scenario
|
||||
id: instruction-followthrough-repo-contract
|
||||
title: Instruction followthrough repo contract
|
||||
surface: repo-contract
|
||||
objective: Verify the agent reads repo instruction files first, follows the required tool order, and completes the first feasible action instead of stopping at a plan.
|
||||
successCriteria:
|
||||
- Agent reads the seeded instruction files before writing the requested artifact.
|
||||
- Agent writes the requested artifact in the same run instead of returning only a plan.
|
||||
- Agent does not ask for permission before the first feasible action.
|
||||
- Final reply makes the completed read/write sequence explicit.
|
||||
docsRefs:
|
||||
- docs/help/testing.md
|
||||
- docs/channels/qa-channel.md
|
||||
codeRefs:
|
||||
- src/agents/system-prompt.ts
|
||||
- src/agents/pi-embedded-runner/run/incomplete-turn.ts
|
||||
- extensions/qa-lab/src/mock-openai-server.ts
|
||||
execution:
|
||||
kind: flow
|
||||
summary: Verify the agent reads repo instructions first, then completes the first bounded followthrough task without stalling.
|
||||
config:
|
||||
workspaceFiles:
|
||||
AGENT.md: |-
|
||||
# Repo contract
|
||||
|
||||
Step order:
|
||||
1. Read AGENT.md.
|
||||
2. Read SOUL.md.
|
||||
3. Read FOLLOWTHROUGH_INPUT.md.
|
||||
4. Write ./repo-contract-summary.txt.
|
||||
5. Reply with three labeled lines exactly once: Read, Wrote, Status.
|
||||
|
||||
Do not stop after planning.
|
||||
Do not ask for permission before the first feasible action.
|
||||
SOUL.md: |-
|
||||
# Execution style
|
||||
|
||||
Stay brief, honest, and action-first.
|
||||
If the next tool action is feasible, do it before replying.
|
||||
FOLLOWTHROUGH_INPUT.md: |-
|
||||
Mission: prove you followed the repo contract.
|
||||
Evidence path: AGENT.md -> SOUL.md -> FOLLOWTHROUGH_INPUT.md -> repo-contract-summary.txt
|
||||
prompt: |-
|
||||
Repo contract followthrough check. Read AGENT.md, SOUL.md, and FOLLOWTHROUGH_INPUT.md first.
|
||||
Then follow the repo contract exactly, write ./repo-contract-summary.txt, and reply with
|
||||
three labeled lines: Read, Wrote, Status.
|
||||
Do not stop after planning and do not ask for permission before the first feasible action.
|
||||
expectedReplyAll:
|
||||
- "read:"
|
||||
- "wrote:"
|
||||
- "status:"
|
||||
forbiddenNeedles:
|
||||
- need permission
|
||||
- need your approval
|
||||
- can you approve
|
||||
- i would
|
||||
- i can
|
||||
- next i would
|
||||
```
|
||||
|
||||
```yaml qa-flow
|
||||
steps:
|
||||
- name: follows repo instructions instead of stopping at a plan
|
||||
actions:
|
||||
- call: reset
|
||||
- forEach:
|
||||
items:
|
||||
expr: "Object.entries(config.workspaceFiles ?? {})"
|
||||
item: workspaceFile
|
||||
actions:
|
||||
- call: fs.writeFile
|
||||
args:
|
||||
- expr: "path.join(env.gateway.workspaceDir, String(workspaceFile[0]))"
|
||||
- expr: "`${String(workspaceFile[1] ?? '').trimEnd()}\\n`"
|
||||
- utf8
|
||||
- set: artifactPath
|
||||
value:
|
||||
expr: "path.join(env.gateway.workspaceDir, 'repo-contract-summary.txt')"
|
||||
- call: runAgentPrompt
|
||||
args:
|
||||
- ref: env
|
||||
- sessionKey: agent:qa:repo-contract
|
||||
message:
|
||||
expr: config.prompt
|
||||
timeoutMs:
|
||||
expr: liveTurnTimeoutMs(env, 40000)
|
||||
- call: waitForCondition
|
||||
saveAs: artifact
|
||||
args:
|
||||
- lambda:
|
||||
async: true
|
||||
expr: "((await fs.readFile(artifactPath, 'utf8').catch(() => null))?.includes('Mission: prove you followed the repo contract.') ? await fs.readFile(artifactPath, 'utf8').catch(() => null) : undefined)"
|
||||
- expr: liveTurnTimeoutMs(env, 30000)
|
||||
- expr: "env.providerMode === 'mock-openai' ? 100 : 250"
|
||||
- set: expectedReplyAll
|
||||
value:
|
||||
expr: config.expectedReplyAll.map(normalizeLowercaseStringOrEmpty)
|
||||
- call: waitForCondition
|
||||
saveAs: outbound
|
||||
args:
|
||||
- lambda:
|
||||
expr: "state.getSnapshot().messages.filter((candidate) => candidate.direction === 'outbound' && candidate.conversation.id === 'qa-operator' && expectedReplyAll.every((needle) => normalizeLowercaseStringOrEmpty(candidate.text).includes(needle))).at(-1)"
|
||||
- expr: liveTurnTimeoutMs(env, 30000)
|
||||
- expr: "env.providerMode === 'mock-openai' ? 100 : 250"
|
||||
- assert:
|
||||
expr: "!config.forbiddenNeedles.some((needle) => normalizeLowercaseStringOrEmpty(outbound.text).includes(needle))"
|
||||
message:
|
||||
expr: "`repo contract followthrough bounced for permission or stalled: ${outbound.text}`"
|
||||
- set: followthroughDebugRequests
|
||||
value:
|
||||
expr: "env.mock ? [...(await fetchJson(`${env.mock.baseUrl}/debug/requests`))].filter((request) => /repo contract followthrough check/i.test(String(request.allInputText ?? ''))) : []"
|
||||
- assert:
|
||||
expr: "!env.mock || followthroughDebugRequests.filter((request) => request.plannedToolName === 'read').length >= 3"
|
||||
message:
|
||||
expr: "`expected three read tool calls before write, saw plannedToolNames=${JSON.stringify(followthroughDebugRequests.map((request) => request.plannedToolName ?? null))}`"
|
||||
- assert:
|
||||
expr: "!env.mock || followthroughDebugRequests.some((request) => request.plannedToolName === 'write')"
|
||||
message:
|
||||
expr: "`expected write tool call during repo contract followthrough, saw plannedToolNames=${JSON.stringify(followthroughDebugRequests.map((request) => request.plannedToolName ?? null))}`"
|
||||
- assert:
|
||||
expr: "!env.mock || (() => { const readIndices = followthroughDebugRequests.map((r, i) => r.plannedToolName === 'read' ? i : -1).filter(i => i >= 0); const firstWrite = followthroughDebugRequests.findIndex((r) => r.plannedToolName === 'write'); return readIndices.length >= 3 && firstWrite >= 0 && readIndices[2] < firstWrite; })()"
|
||||
message:
|
||||
expr: "`expected all 3 reads before any write during repo contract followthrough, saw plannedToolNames=${JSON.stringify(followthroughDebugRequests.map((request) => request.plannedToolName ?? null))}`"
|
||||
detailsExpr: outbound.text
|
||||
```
|
||||
@@ -1,5 +1,36 @@
|
||||
# Memory recall after context switch
|
||||
|
||||
<!--
|
||||
This scenario deliberately stays prose-only and does NOT gate on a
|
||||
`/debug/requests` tool-call assertion, even though it is one of the
|
||||
scenarios in the parity pack. The adversarial review in the umbrella
|
||||
#64227 thread called this out as a coverage gap, but the underlying
|
||||
behavior the scenario tests is legitimately prose-shaped: the agent is
|
||||
supposed to pull a prior-turn fact ("ALPHA-7") back across an
|
||||
intervening context switch and reply with the code. In a real
|
||||
conversation, the model can do this EITHER by calling a memory-search
|
||||
tool (which the qa-lab mock server doesn't currently expose) OR by
|
||||
reading the fact directly from prior-turn context in its own
|
||||
conversation window. Both strategies are valid parity behavior.
|
||||
|
||||
Forcing a `plannedToolName` assertion here would either require
|
||||
extending the mock with a synthetic `memory_search` tool lane (PR O
|
||||
scope, not PR J) or fabricating a tool-call requirement the real
|
||||
providers never implement. Either path would make this scenario test
|
||||
the harness, not the models. So we keep it prose-only, covered by the
|
||||
`recallExpectedAny` / `rememberAckAny` assertions above, and flag the
|
||||
exception explicitly rather than silently.
|
||||
|
||||
Criterion 2 of the parity completion gate (no fake progress or fake
|
||||
tool completion) is enforced for this scenario through the parity
|
||||
report's failure-tone fake-success detector: a scenario marked `pass`
|
||||
whose details text matches patterns like "timed out", "failed to",
|
||||
"could not" gets flagged via `SUSPICIOUS_PASS_FAILURE_TONE_PATTERNS`
|
||||
in `extensions/qa-lab/src/agentic-parity-report.ts`. Positive-tone
|
||||
detection was removed because it false-positives on legitimate passes
|
||||
where the details field is the model's outbound prose.
|
||||
-->
|
||||
|
||||
```yaml qa-scenario
|
||||
id: memory-recall
|
||||
title: Memory recall after context switch
|
||||
|
||||
@@ -69,13 +69,22 @@ steps:
|
||||
expr: hasModelSwitchContinuityEvidence(outbound.text)
|
||||
message:
|
||||
expr: "`switch reply missed kickoff continuity: ${outbound.text}`"
|
||||
- assert:
|
||||
expr: "!env.mock || (((await fetchJson(`${env.mock.baseUrl}/debug/requests`)).find((request) => String(request.allInputText ?? '').includes(config.promptSnippet))?.plannedToolName) === 'read')"
|
||||
message:
|
||||
expr: "`expected read after switch, got ${String((await fetchJson(`${env.mock.baseUrl}/debug/requests`)).find((request) => String(request.allInputText ?? '').includes(config.promptSnippet))?.plannedToolName ?? '')}`"
|
||||
- assert:
|
||||
expr: "!env.mock || (((await fetchJson(`${env.mock.baseUrl}/debug/requests`)).find((request) => String(request.allInputText ?? '').includes(config.promptSnippet))?.model) === 'gpt-5.4-alt')"
|
||||
message:
|
||||
expr: "`expected alternate model, got ${String((await fetchJson(`${env.mock.baseUrl}/debug/requests`)).find((request) => String(request.allInputText ?? '').includes(config.promptSnippet))?.model ?? '')}`"
|
||||
- if:
|
||||
expr: "Boolean(env.mock)"
|
||||
then:
|
||||
- set: switchDebugRequests
|
||||
value:
|
||||
expr: "await fetchJson(`${env.mock.baseUrl}/debug/requests`)"
|
||||
- set: switchRequest
|
||||
value:
|
||||
expr: "switchDebugRequests.find((request) => String(request.allInputText ?? '').includes(config.promptSnippet))"
|
||||
- assert:
|
||||
expr: "switchRequest?.plannedToolName === 'read'"
|
||||
message:
|
||||
expr: "`expected read after switch, got ${String(switchRequest?.plannedToolName ?? '')}`"
|
||||
- assert:
|
||||
expr: "String(switchRequest?.model ?? '') === String(alternate?.model ?? '')"
|
||||
message:
|
||||
expr: "`expected alternate model, got ${String(switchRequest?.model ?? '')}`"
|
||||
detailsExpr: outbound.text
|
||||
```
|
||||
|
||||
@@ -56,5 +56,20 @@ steps:
|
||||
expr: "!reportsDiscoveryScopeLeak(outbound.text)"
|
||||
message:
|
||||
expr: "`discovery report drifted beyond scope: ${outbound.text}`"
|
||||
# Parity gate criterion 2 (no fake progress / fake tool completion):
|
||||
# require an actual read tool call before the prose report. Without this,
|
||||
# a model could fabricate a plausible Worked/Failed/Blocked/Follow-up
|
||||
# report without ever touching the repo files the prompt names. The
|
||||
# debug request log is fetched once and reused for both the assertion
|
||||
# and its failure-message diagnostic. Each request's allInputText is
|
||||
# lowercased inline at match time (the real prompt writes it as
|
||||
# "Worked, Failed, Blocked") so the contains check is case-insensitive.
|
||||
- set: discoveryDebugRequests
|
||||
value:
|
||||
expr: "env.mock ? [...(await fetchJson(`${env.mock.baseUrl}/debug/requests`))] : []"
|
||||
- assert:
|
||||
expr: "!env.mock || discoveryDebugRequests.some((request) => String(request.allInputText ?? '').toLowerCase().includes('worked, failed, blocked') && request.plannedToolName === 'read')"
|
||||
message:
|
||||
expr: "`expected at least one read tool call during discovery report scenario, saw plannedToolNames=${JSON.stringify(discoveryDebugRequests.map((request) => request.plannedToolName ?? null))}`"
|
||||
detailsExpr: outbound.text
|
||||
```
|
||||
|
||||
@@ -113,6 +113,28 @@ steps:
|
||||
expr: "sawAlpha && sawBeta"
|
||||
message:
|
||||
expr: "`fanout child sessions missing (alpha=${String(sawAlpha)} beta=${String(sawBeta)})`"
|
||||
# Tool-call assertion (criterion 2 of the
|
||||
# parity completion gate in #64227): the
|
||||
# scenario must have actually invoked
|
||||
# `sessions_spawn` at least twice with
|
||||
# distinct labels, not just ended up with
|
||||
# two rows in the session store through
|
||||
# prose trickery. The session store alone
|
||||
# can be populated by other flows or by a
|
||||
# model that fabricates "delegation"
|
||||
# narration. `plannedToolName` on the
|
||||
# mock's `/debug/requests` log is the
|
||||
# tool-call ground truth: two recorded
|
||||
# sessions_spawn requests with distinct
|
||||
# labels means the model really dispatched
|
||||
# both subagents.
|
||||
- set: fanoutSpawnRequests
|
||||
value:
|
||||
expr: "[...(await fetchJson(`${env.mock.baseUrl}/debug/requests`))].filter((request) => request.plannedToolName === 'sessions_spawn' && /subagent fanout synthesis check/i.test(String(request.allInputText ?? '')))"
|
||||
- assert:
|
||||
expr: "fanoutSpawnRequests.length >= 2"
|
||||
message:
|
||||
expr: "`expected at least two sessions_spawn tool calls during subagent fanout scenario, saw ${fanoutSpawnRequests.length}`"
|
||||
- set: details
|
||||
value:
|
||||
expr: "outbound.text"
|
||||
|
||||
@@ -46,5 +46,25 @@ steps:
|
||||
expr: "!['failed to delegate','could not delegate','subagent unavailable'].some((needle) => normalizeLowercaseStringOrEmpty(outbound.text).includes(needle))"
|
||||
message:
|
||||
expr: "`subagent handoff reported failure: ${outbound.text}`"
|
||||
# Parity gate criterion 2 (no fake progress / fake tool completion):
|
||||
# require an actual sessions_spawn tool call. Without this, a model
|
||||
# could produce the three labeled sections ("Delegated task", "Result",
|
||||
# "Evidence") as free-form prose without ever delegating to a real
|
||||
# subagent. The assertion is pinned to THIS scenario by matching the
|
||||
# scenario-unique prompt substring "Delegate one bounded QA task"
|
||||
# (not a broad /delegate|subagent/ regex) so the earlier
|
||||
# subagent-fanout-synthesis scenario — which also contains "delegate"
|
||||
# and produces its own pre-tool sessions_spawn request — cannot
|
||||
# satisfy the assertion here. The match is also constrained to
|
||||
# pre-tool requests (no toolOutput) because the mock only plans
|
||||
# sessions_spawn on requests with no toolOutput; the follow-up
|
||||
# request after the tool runs has plannedToolName unset.
|
||||
- set: subagentDebugRequests
|
||||
value:
|
||||
expr: "env.mock ? [...(await fetchJson(`${env.mock.baseUrl}/debug/requests`))] : []"
|
||||
- assert:
|
||||
expr: "!env.mock || subagentDebugRequests.some((request) => !request.toolOutput && /delegate one bounded qa task/i.test(String(request.allInputText ?? '')) && request.plannedToolName === 'sessions_spawn')"
|
||||
message:
|
||||
expr: "`expected sessions_spawn tool call during subagent handoff scenario, saw plannedToolNames=${JSON.stringify(subagentDebugRequests.map((request) => request.plannedToolName ?? null))}`"
|
||||
detailsExpr: outbound.text
|
||||
```
|
||||
|
||||
Reference in New Issue
Block a user