feat: add Mantis visual task video QA

2026-05-06 11:50:43 +00:00 · 2026-05-05 05:34:49 +01:00
parent b0f841ef37
commit 35266879de
12 changed files with 1612 additions and 18 deletions
--- a/docs/concepts/qa-e2e-automation.md
+++ b/docs/concepts/qa-e2e-automation.md
@@ -132,12 +132,37 @@ pnpm openclaw qa mantis slack-desktop-smoke \

 That command leases a Crabbox desktop/browser machine, runs the Slack live lane
 inside the VM, opens Slack Web in the VNC browser, captures the desktop, and
-copies `slack-qa/` plus `slack-desktop-smoke.png` back to the Mantis artifact
-directory. Reuse `--lease-id <cbx_...>` after logging in to Slack Web manually
+copies `slack-qa/`, `slack-desktop-smoke.png`, and `slack-desktop-smoke.mp4`
+when video capture is available back to the Mantis artifact directory. Reuse `--lease-id <cbx_...>` after logging in to Slack Web manually
 through VNC. With `--gateway-setup`, Mantis leaves a persistent OpenClaw Slack
 gateway running inside the VM on port `38973`; without it, the command runs the
 normal bot-to-bot Slack QA lane and exits after artifact capture.

+For an agent/CV style desktop task, run:
+
+```bash
+pnpm openclaw qa mantis visual-task \
+  --browser-url https://example.net \
+  --expect-text "Example Domain" \
+  --vision-model openai/gpt-5.4
+```
+
+`visual-task` leases or reuses a Crabbox desktop/browser machine, starts
+`crabbox record --while`, drives the visible browser through a nested
+`visual-driver`, captures `visual-task.png`, runs `openclaw infer image describe`
+against the screenshot when `--vision-mode image-describe` is selected, and
+writes `visual-task.mp4`, `mantis-visual-task-summary.json`,
+`mantis-visual-task-driver-result.json`, and `mantis-visual-task-report.md`.
+When `--expect-text` is set, the vision prompt asks for a structured JSON
+verdict and only passes when the model reports positive visible evidence; a
+negative response that merely quotes the target text fails the assertion.
+Use `--vision-mode metadata` for a no-model smoke that proves the desktop,
+browser, screenshot, and video plumbing without calling an image-understanding
+provider. Recording is a required artifact for `visual-task`; if Crabbox records
+no non-empty `visual-task.mp4`, the task fails even when the visual driver
+passed. On failure, Mantis keeps the lease for VNC unless the task had already
+passed and `--keep-lease` was not set.
+
 Before using pooled live credentials, run:

 ```bash
@@ -266,7 +291,7 @@ Scenarios (`extensions/qa-lab/src/live-transports/discord/discord-live.runtime.t
 - `discord-canary`
 - `discord-mention-gating`
 - `discord-native-help-command-registration`
- `discord-status-reactions-tool-only` — opt-in Mantis scenario. Runs by itself because it switches the SUT to always-on, tool-only guild replies with `messages.statusReactions.enabled=true`, then captures a REST reaction timeline plus an HTML/PNG visual artifact.
+- `discord-status-reactions-tool-only` — opt-in Mantis scenario. Runs by itself because it switches the SUT to always-on, tool-only guild replies with `messages.statusReactions.enabled=true`, then captures a REST reaction timeline plus HTML/PNG visual artifacts. Mantis before/after reports also preserve scenario-provided MP4 artifacts as `baseline.mp4` and `candidate.mp4`.

 Run the Mantis status-reaction scenario explicitly: