mirror of https://github.com/openclaw/openclaw.git synced 2026-04-12 09:41:11 +00:00

Files

Eva fd45ea2bf1 test(qa): add compaction retry parity scenario

2026-04-11 14:22:48 +01:00

12 KiB

Raw Blame History

GPT-5.4 / Codex Agentic Parity in OpenClaw

OpenClaw already worked well with tool-using frontier models, but GPT-5.4 and Codex-style models were still underperforming in a few practical ways:

they could stop after planning instead of doing the work
they could use strict OpenAI/Codex tool schemas incorrectly
they could ask for /elevated full even when full access was impossible
they could lose long-running task state during replay or compaction
parity claims against Claude Opus 4.6 were based on anecdotes instead of repeatable scenarios

This parity program fixes those gaps in four reviewable slices.

What changed

PR A: strict-agentic execution

This slice adds an opt-in strict-agentic execution contract for embedded Pi GPT-5 runs.

When enabled, OpenClaw stops accepting plan-only turns as “good enough” completion. If the model only says what it intends to do and does not actually use tools or make progress, OpenClaw retries with an act-now steer and then fails closed with an explicit blocked state instead of silently ending the task.

This improves the GPT-5.4 experience most on:

short “ok do it” follow-ups
code tasks where the first step is obvious
flows where update_plan should be progress tracking rather than filler text

PR B: runtime truthfulness

This slice makes OpenClaw tell the truth about two things:

why the provider/runtime call failed
whether /elevated full is actually available

That means GPT-5.4 gets better runtime signals for missing scope, auth refresh failures, HTML 403 auth failures, proxy issues, DNS or timeout failures, and blocked full-access modes. The model is less likely to hallucinate the wrong remediation or keep asking for a permission mode the runtime cannot provide.

PR C: execution correctness

This slice improves two kinds of correctness:

provider-owned OpenAI/Codex tool-schema compatibility
replay and long-task liveness surfacing

The tool-compat work reduces schema friction for strict OpenAI/Codex tool registration, especially around parameter-free tools and strict object-root expectations. The replay/liveness work makes long-running tasks more observable, so paused, blocked, and abandoned states are visible instead of disappearing into generic failure text.

PR D: parity harness

This slice adds the first-wave QA-lab parity pack so GPT-5.4 and Opus 4.6 can be exercised through the same scenarios and compared using shared evidence.

The parity pack is the proof layer. It does not change runtime behavior by itself.

After you have two qa-suite-summary.json artifacts, generate the release-gate comparison with:

pnpm openclaw qa parity-report \
  --repo-root . \
  --candidate-summary .artifacts/qa-e2e/gpt54/qa-suite-summary.json \
  --baseline-summary .artifacts/qa-e2e/opus46/qa-suite-summary.json \
  --output-dir .artifacts/qa-e2e/parity

That command writes:

a human-readable Markdown report
a machine-readable JSON verdict
an explicit pass / fail gate result

Why this improves GPT-5.4 in practice

Before this work, GPT-5.4 on OpenClaw could feel less agentic than Opus in real coding sessions because the runtime tolerated behaviors that are especially harmful for GPT-5-style models:

commentary-only turns
schema friction around tools
vague permission feedback
silent replay or compaction breakage

The goal is not to make GPT-5.4 imitate Opus. The goal is to give GPT-5.4 a runtime contract that rewards real progress, supplies cleaner tool and permission semantics, and turns failure modes into explicit machine- and human-readable states.

That changes the user experience from:

“the model had a good plan but stopped”

to:

“the model either acted, or OpenClaw surfaced the exact reason it could not”

Before vs after for GPT-5.4 users

Before this program	After PR A-D
GPT-5.4 could stop after a reasonable plan without taking the next tool step	PR A turns “plan only” into “act now or surface a blocked state”
Strict tool schemas could reject parameter-free or OpenAI/Codex-shaped tools in confusing ways	PR C makes provider-owned tool registration and invocation more predictable
`/elevated full` guidance could be vague or wrong in blocked runtimes	PR B gives GPT-5.4 and the user truthful runtime and permission hints
Replay or compaction failures could feel like the task silently disappeared	PR C surfaces paused, blocked, abandoned, and replay-invalid outcomes explicitly
“GPT-5.4 feels worse than Opus” was mostly anecdotal	PR D turns that into the same scenario pack, the same metrics, and a hard pass/fail gate

Architecture

flowchart TD
    A["User request"] --> B["Embedded Pi runtime"]
    B --> C["Strict-agentic execution contract"]
    B --> D["Provider-owned tool compatibility"]
    B --> E["Runtime truthfulness"]
    B --> F["Replay and liveness state"]
    C --> G["Tool call or explicit blocked state"]
    D --> G
    E --> G
    F --> G
    G --> H["QA-lab parity pack"]
    H --> I["Scenario report and parity gate"]

Release flow

flowchart LR
    A["Merged runtime slices (PR A-C)"] --> B["Run GPT-5.4 parity pack"]
    A --> C["Run Opus 4.6 parity pack"]
    B --> D["qa-suite-summary.json"]
    C --> E["qa-suite-summary.json"]
    D --> F["openclaw qa parity-report"]
    E --> F
    F --> G["qa-agentic-parity-report.md"]
    F --> H["qa-agentic-parity-summary.json"]
    H --> I{"Gate pass?"}
    I -- "yes" --> J["Evidence-backed parity claim"]
    I -- "no" --> K["Keep runtime/review loop open"]

Scenario pack

The first-wave parity pack currently covers five scenarios:

`approval-turn-tool-followthrough`

Checks that the model does not stop at “I’ll do that” after a short approval. It should take the first concrete action in the same turn.

`model-switch-tool-continuity`

Checks that tool-using work remains coherent across model/runtime switching boundaries instead of resetting into commentary or losing execution context.

`source-docs-discovery-report`

Checks that the model can read source and docs, synthesize findings, and continue the task agentically rather than producing a thin summary and stopping early.

`image-understanding-attachment`

Checks that mixed-mode tasks involving attachments remain actionable and do not collapse into vague narration.

`compaction-retry-mutating-tool`

Checks that a task with a real mutating write keeps replay-unsafety explicit instead of quietly looking replay-safe if the run compacts, retries, or loses reply state under pressure.

Scenario matrix

Scenario	What it tests	Good GPT-5.4 behavior	Failure signal
`approval-turn-tool-followthrough`	Short approval turns after a plan	Starts the first concrete tool action immediately instead of restating intent	plan-only follow-up, no tool activity, or blocked turn without a real blocker
`model-switch-tool-continuity`	Runtime/model switching under tool use	Preserves task context and continues acting coherently	resets into commentary, loses tool context, or stops after switch
`source-docs-discovery-report`	Source reading + synthesis + action	Finds sources, uses tools, and produces a useful report without stalling	thin summary, missing tool work, or incomplete-turn stop
`image-understanding-attachment`	Attachment-driven agentic work	Interprets the attachment, connects it to tools, and continues the task	vague narration, attachment ignored, or no concrete next action
`compaction-retry-mutating-tool`	Mutating work under compaction pressure	Performs a real write and keeps replay-unsafety explicit after the side effect	mutating write happens but replay safety is implied, missing, or contradictory

Release gate

GPT-5.4 can only be considered at parity or better when the merged runtime passes the parity pack and the runtime-truthfulness regressions at the same time.

Required outcomes:

no plan-only stall when the next tool action is clear
no fake completion without real execution
no incorrect /elevated full guidance
no silent replay or compaction abandonment
parity-pack metrics that are at least as strong as the agreed Opus 4.6 baseline

For the first-wave harness, the gate compares:

completion rate
unintended-stop rate
valid-tool-call rate
fake-success count

Parity evidence is intentionally split across two layers:

PR D proves same-scenario GPT-5.4 vs Opus 4.6 behavior with QA-lab
PR B deterministic suites prove auth, proxy, DNS, and /elevated full truthfulness outside the harness

Goal-to-evidence matrix

Completion gate item	Owning PR	Evidence source	Pass signal
GPT-5.4 no longer stalls after planning	PR A	`approval-turn-tool-followthrough` plus PR A runtime suites	approval turns trigger real work or an explicit blocked state
GPT-5.4 no longer fakes progress or fake tool completion	PR A + PR D	parity report scenario outcomes and fake-success count	no suspicious pass results and no commentary-only completion
GPT-5.4 no longer gives false `/elevated full` guidance	PR B	deterministic truthfulness suites	blocked reasons and full-access hints stay runtime-accurate
Replay/liveness failures stay explicit	PR C + PR D	PR C lifecycle/replay suites plus `compaction-retry-mutating-tool`	mutating work keeps replay-unsafety explicit instead of silently disappearing
GPT-5.4 matches or beats Opus 4.6 on the agreed metrics	PR D	`qa-agentic-parity-report.md` and `qa-agentic-parity-summary.json`	same scenario coverage and no regression on completion, stop behavior, or valid tool use

How to read the parity verdict

Use the verdict in qa-agentic-parity-summary.json as the final machine-readable decision for the first-wave parity pack.

pass means GPT-5.4 covered the same scenarios as Opus 4.6 and did not regress on the agreed aggregate metrics.
fail means at least one hard gate tripped: weaker completion, worse unintended stops, weaker valid tool use, any fake-success case, or mismatched scenario coverage.
“shared/base CI issue” is not itself a parity result. If CI noise outside PR D blocks a run, the verdict should wait for a clean merged-runtime execution instead of being inferred from branch-era logs.
Auth, proxy, DNS, and /elevated full truthfulness still come from PR B’s deterministic suites, so the final release claim needs both: a passing PR D parity verdict and green PR B truthfulness coverage.

Who should enable `strict-agentic`

Use strict-agentic when:

the agent is expected to act immediately when a next step is obvious
GPT-5.4 or Codex-family models are the primary runtime
you prefer explicit blocked states over “helpful” recap-only replies

Keep the default contract when:

you want the existing looser behavior
you are not using GPT-5-family models
you are testing prompts rather than runtime enforcement

12 KiB Raw Blame History Unescape Escape