fix: stabilize qa lab mock suite

2026-05-06 17:20:45 +00:00 · 2026-04-24 02:46:25 +01:00
parent 2779020cbe
commit 903308dbf2
26 changed files with 302 additions and 104 deletions
--- a/docs/concepts/qa-e2e-automation.md
+++ b/docs/concepts/qa-e2e-automation.md
@@ -207,7 +207,7 @@ refs and write a judged Markdown report:

 ```bash
 pnpm openclaw qa character-eval \
-  --model openai-codex/gpt-5.5,thinking=xhigh \
+  --model openai/gpt-5.4,thinking=medium,fast \
  --model openai/gpt-5.2,thinking=xhigh \
  --model openai/gpt-5,thinking=xhigh \
  --model anthropic/claude-opus-4-6,thinking=high \
@@ -215,7 +215,7 @@ pnpm openclaw qa character-eval \
  --model zai/glm-5.1,thinking=high \
  --model moonshot/kimi-k2.5,thinking=high \
  --model google/gemini-3.1-pro-preview,thinking=high \
-  --judge-model openai-codex/gpt-5.5,thinking=xhigh,fast \
+  --judge-model openai/gpt-5.4,thinking=xhigh,fast \
  --judge-model anthropic/claude-opus-4-6,thinking=high \
  --blind-judge-models \
  --concurrency 16 \
@@ -227,13 +227,13 @@ scenarios should set the persona through `SOUL.md`, then run ordinary user turns
 such as chat, workspace help, and small file tasks. The candidate model should
 not be told that it is being evaluated. The command preserves each full
 transcript, records basic run stats, then asks the judge models in fast mode with
-`xhigh` reasoning to rank the runs by naturalness, vibe, and humor.
+`xhigh` reasoning where supported to rank the runs by naturalness, vibe, and humor.
 Use `--blind-judge-models` when comparing providers: the judge prompt still gets
 every transcript and run status, but candidate refs are replaced with neutral
 labels such as `candidate-01`; the report maps rankings back to real refs after
 parsing.
-Candidate runs default to `high` thinking, with `xhigh` for OpenAI models that
-support it. Override a specific candidate inline with
+Candidate runs default to `high` thinking, with `medium` for GPT-5.4 and `xhigh`
+for older OpenAI eval refs that support it. Override a specific candidate inline with
 `--model provider/model,thinking=<level>`. `--thinking <level>` still sets a
 global fallback, and the older `--model-thinking <provider/model=level>` form is
 kept for compatibility.
@@ -247,12 +247,12 @@ Candidate and judge model runs both default to concurrency 16. Lower
 `--concurrency` or `--judge-concurrency` when provider limits or local gateway
 pressure make a run too noisy.
 When no candidate `--model` is passed, the character eval defaults to
-`openai-codex/gpt-5.5`, `openai/gpt-5.4`, `openai/gpt-5.2`, `anthropic/claude-opus-4-6`,
+`openai/gpt-5.4`, `openai/gpt-5.2`, `openai/gpt-5`, `anthropic/claude-opus-4-6`,
 `anthropic/claude-sonnet-4-6`, `zai/glm-5.1`,
 `moonshot/kimi-k2.5`, and
 `google/gemini-3.1-pro-preview` when no `--model` is passed.
 When no `--judge-model` is passed, the judges default to
-`openai-codex/gpt-5.5,thinking=xhigh,fast` and
+`openai/gpt-5.4,thinking=xhigh,fast` and
 `anthropic/claude-opus-4-6,thinking=high`.

 ## Related docs