From e8b446b98505e00ceb257eb7d040905d6254519b Mon Sep 17 00:00:00 2001 From: Vincent Koc Date: Tue, 7 Apr 2026 10:38:36 +0100 Subject: [PATCH] docs(qa): expand frontier bakeoff runbook --- qa/README.md | 5 +++++ qa/frontier-harness-plan.md | 23 +++++++++++++++++++++++ 2 files changed, 28 insertions(+) diff --git a/qa/README.md b/qa/README.md index f5d6621866d..3063c079026 100644 --- a/qa/README.md +++ b/qa/README.md @@ -8,4 +8,9 @@ Files: - `frontier-harness-plan.md` - big-model bakeoff and tuning loop for harness work. - `seed-scenarios.json` - repo-backed baseline QA scenarios. +Key workflow: + +- `qa suite` is the executable frontier subset / regression loop. +- `qa manual` is the scoped personality and style probe after the executable subset is green. + Keep this folder in git. Add new scenarios here before wiring them into automation. diff --git a/qa/frontier-harness-plan.md b/qa/frontier-harness-plan.md index 0b1930dcbb9..164816f0a7b 100644 --- a/qa/frontier-harness-plan.md +++ b/qa/frontier-harness-plan.md @@ -84,6 +84,7 @@ Use the QA Lab runner catalog or `openclaw models list --all` to pick the curren - empty-promise rate - tool continuity after model switch - discovery report completeness and specificity +- scope drift: unrelated scenario updates, grand wrap-ups, or invented completion tallies - latency / obvious stall behavior - token cost notes if a change makes the prompt materially heavier @@ -95,11 +96,33 @@ Run this after the executable subset, not before: read QA_KICKOFF_TASK.md, tell me what feels half-baked about this qa mission, and keep it to two short sentences ``` +GPT manual lane: + +```bash +pnpm openclaw qa manual \ + --provider-mode live-frontier \ + --model openai/gpt-5.4 \ + --alt-model openai/gpt-5.4 \ + --fast \ + --message "read QA_KICKOFF_TASK.md, tell me what feels half-baked about this qa mission, and keep it to two short sentences" +``` + +Claude manual lane: + +```bash +pnpm openclaw qa manual \ + --provider-mode live-frontier \ + --model anthropic/claude-sonnet-4-6 \ + --alt-model anthropic/claude-opus-4-6 \ + --message "read QA_KICKOFF_TASK.md, tell me what feels half-baked about this qa mission, and keep it to two short sentences" +``` + Score it on: - did it read first - did it say something specific instead of generic fluff - did the agent still sound like itself while doing useful work +- did it stay on the scoped ask instead of widening into a suite recap or fake completion claim ## Deferred