docs(qa): expand frontier bakeoff runbook

This commit is contained in:
Vincent Koc
2026-04-07 10:38:36 +01:00
committed by Peter Steinberger
parent f93b217834
commit e8b446b985
2 changed files with 28 additions and 0 deletions

View File

@@ -8,4 +8,9 @@ Files:
- `frontier-harness-plan.md` - big-model bakeoff and tuning loop for harness work.
- `seed-scenarios.json` - repo-backed baseline QA scenarios.
Key workflow:
- `qa suite` is the executable frontier subset / regression loop.
- `qa manual` is the scoped personality and style probe after the executable subset is green.
Keep this folder in git. Add new scenarios here before wiring them into automation.

View File

@@ -84,6 +84,7 @@ Use the QA Lab runner catalog or `openclaw models list --all` to pick the curren
- empty-promise rate
- tool continuity after model switch
- discovery report completeness and specificity
- scope drift: unrelated scenario updates, grand wrap-ups, or invented completion tallies
- latency / obvious stall behavior
- token cost notes if a change makes the prompt materially heavier
@@ -95,11 +96,33 @@ Run this after the executable subset, not before:
read QA_KICKOFF_TASK.md, tell me what feels half-baked about this qa mission, and keep it to two short sentences
```
GPT manual lane:
```bash
pnpm openclaw qa manual \
--provider-mode live-frontier \
--model openai/gpt-5.4 \
--alt-model openai/gpt-5.4 \
--fast \
--message "read QA_KICKOFF_TASK.md, tell me what feels half-baked about this qa mission, and keep it to two short sentences"
```
Claude manual lane:
```bash
pnpm openclaw qa manual \
--provider-mode live-frontier \
--model anthropic/claude-sonnet-4-6 \
--alt-model anthropic/claude-opus-4-6 \
--message "read QA_KICKOFF_TASK.md, tell me what feels half-baked about this qa mission, and keep it to two short sentences"
```
Score it on:
- did it read first
- did it say something specific instead of generic fluff
- did the agent still sound like itself while doing useful work
- did it stay on the scoped ask instead of widening into a suite recap or fake completion claim
## Deferred