openclaw/docs/concepts/qa-e2e-automation.md at be46d0ddc6ed93558e3dcdee07f48cd560e4eded

vultr/openclaw

Fork 0

mirror of https://github.com/openclaw/openclaw.git synced 2026-04-12 09:41:11 +00:00

Files

Peter Steinberger be46d0ddc6 test: update character eval public panel

2026-04-09 01:25:59 +01:00

5.2 KiB

Raw Blame History

summary, read_when, title

summary

read_when

title

Private QA automation shape for qa-lab, qa-channel, seeded scenarios, and protocol reports

Extending qa-lab or qa-channel

Adding repo-backed QA scenarios

Building higher-realism QA automation around the Gateway dashboard

QA E2E Automation

The private QA stack is meant to exercise OpenClaw in a more realistic, channel-shaped way than a single unit test can.

Current pieces:

extensions/qa-channel: synthetic message channel with DM, channel, thread, reaction, edit, and delete surfaces.
extensions/qa-lab: debugger UI and QA bus for observing the transcript, injecting inbound messages, and exporting a Markdown report.
qa/: repo-backed seed assets for the kickoff task and baseline QA scenarios.

The current QA operator flow is a two-pane QA site:

Left: Gateway dashboard (Control UI) with the agent.
Right: QA Lab, showing the Slack-ish transcript and scenario plan.

Run it with:

pnpm qa:lab:up

That builds the QA site, starts the Docker-backed gateway lane, and exposes the QA Lab page where an operator or automation loop can give the agent a QA mission, observe real channel behavior, and record what worked, failed, or stayed blocked.

For faster QA Lab UI iteration without rebuilding the Docker image each time, start the stack with a bind-mounted QA Lab bundle:

pnpm openclaw qa docker-build-image
pnpm qa:lab:build
pnpm qa:lab:up:fast
pnpm qa:lab:watch

qa:lab:up:fast keeps the Docker services on a prebuilt image and bind-mounts extensions/qa-lab/web/dist into the qa-lab container. qa:lab:watch rebuilds that bundle on change, and the browser auto-reloads when the QA Lab asset hash changes.

Repo-backed seeds

Seed assets live in qa/:

qa/scenarios/index.md
qa/scenarios/*.md

These are intentionally in git so the QA plan is visible to both humans and the agent. The baseline list should stay broad enough to cover:

DM and channel chat
thread behavior
message action lifecycle
cron callbacks
memory recall
model switching
subagent handoff
repo-reading and docs-reading
one small build task such as Lobster Invaders

Reporting

qa-lab exports a Markdown protocol report from the observed bus timeline. The report should answer:

What worked
What failed
What stayed blocked
What follow-up scenarios are worth adding

For character and style checks, run the same scenario across multiple live model refs and write a judged Markdown report:

pnpm openclaw qa character-eval \
  --model openai/gpt-5.4,thinking=xhigh \
  --model openai/gpt-5.2,thinking=xhigh \
  --model openai/gpt-5,thinking=xhigh \
  --model anthropic/claude-opus-4-6,thinking=high \
  --model anthropic/claude-sonnet-4-6,thinking=high \
  --model zai/glm-5.1,thinking=high \
  --model moonshot/kimi-k2.5,thinking=high \
  --model google/gemini-3.1-pro-preview,thinking=high \
  --judge-model openai/gpt-5.4,thinking=xhigh,fast \
  --judge-model anthropic/claude-opus-4-6,thinking=high \
  --blind-judge-models \
  --concurrency 16 \
  --judge-concurrency 16

The command runs local QA gateway child processes, not Docker. Character eval scenarios should set the persona through SOUL.md, then run ordinary user turns such as chat, workspace help, and small file tasks. The candidate model should not be told that it is being evaluated. The command preserves each full transcript, records basic run stats, then asks the judge models in fast mode with xhigh reasoning to rank the runs by naturalness, vibe, and humor. Use --blind-judge-models when comparing providers: the judge prompt still gets every transcript and run status, but candidate refs are replaced with neutral labels such as candidate-01; the report maps rankings back to real refs after parsing. Candidate runs default to high thinking, with xhigh for OpenAI models that support it. Override a specific candidate inline with --model provider/model,thinking=<level>. --thinking <level> still sets a global fallback, and the older --model-thinking <provider/model=level> form is kept for compatibility. OpenAI candidate refs default to fast mode so priority processing is used where the provider supports it. Add ,fast, ,no-fast, or ,fast=false inline when a single candidate or judge needs an override. Pass --fast only when you want to force fast mode on for every candidate model. Candidate and judge durations are recorded in the report for benchmark analysis, but judge prompts explicitly say not to rank by speed. Candidate and judge model runs both default to concurrency 16. Lower --concurrency or --judge-concurrency when provider limits or local gateway pressure make a run too noisy. When no candidate --model is passed, the character eval defaults to openai/gpt-5.4, openai/gpt-5.2, openai/gpt-5, anthropic/claude-opus-4-6, anthropic/claude-sonnet-4-6, zai/glm-5.1, moonshot/kimi-k2.5, and google/gemini-3.1-pro-preview when no --model is passed. When no --judge-model is passed, the judges default to openai/gpt-5.4,thinking=xhigh,fast and anthropic/claude-opus-4-6,thinking=high.

5.2 KiB Raw Blame History

QA E2E Automation

Repo-backed seeds

Reporting

Related docs

5.2 KiB

Raw Blame History