feat: add Mantis Discord smoke runner (#76696)

* docs: add Mantis QA system design * feat: add Mantis Discord smoke runner * fix: harden Mantis Discord smoke * fix: redact Mantis Discord artifacts * fix: satisfy Mantis redaction lint * fix: redact Mantis mismatch failures * test: avoid promise assertions in Mantis tests
2026-05-06 14:50:45 +00:00 · 2026-05-03 15:25:56 +01:00
parent 6aa4fb7a69
commit 0bf06e953f
11 changed files with 1509 additions and 1 deletions
--- a/docs/concepts/mantis.md
+++ b/docs/concepts/mantis.md
@@ -0,0 +1,412 @@
+---
+summary: "Mantis is the visual end-to-end verification system for reproducing OpenClaw bugs on live transports, capturing before and after evidence, and attaching artifacts to PRs."
+title: "Mantis"
+read_when:
+  - Building or running live visual QA for OpenClaw bugs
+  - Adding before and after verification for a pull request
+  - Adding Discord, Slack, WhatsApp, or other live transport scenarios
+  - Debugging QA runs that need screenshots, browser automation, or VNC access
+---
+
+Mantis is the OpenClaw end-to-end verification system for bugs that need a real
+runtime, a real transport, and visible proof. It runs a scenario against a known
+bad ref, captures evidence, runs the same scenario against a candidate ref, and
+publishes the comparison as artifacts that a maintainer can inspect from a PR or
+from a local command.
+
+Mantis starts with Discord because Discord gives us a high-value first lane:
+real bot auth, real guild channels, reactions, threads, native commands, and a
+browser UI where humans can visually confirm what the transport showed.
+
+## Goals
+
+- Reproduce a bug from a GitHub issue or PR with the same transport shape users
+  see.
+- Capture a **before** artifact on the baseline ref before applying the fix.
+- Capture an **after** artifact on the candidate ref after applying the fix.
+- Use a deterministic oracle whenever possible, such as a Discord REST reaction
+  read or channel transcript check.
+- Capture screenshots when the bug has a visible UI surface.
+- Run locally from an agent-controlled CLI and remotely from GitHub.
+- Preserve enough machine state for VNC rescue when login, browser automation, or
+  provider auth gets stuck.
+- Post concise status to an operator Discord channel when the run is blocked,
+  needs manual VNC help, or finishes.
+
+## Non Goals
+
+- Mantis is not a replacement for unit tests. A Mantis run should usually become
+  a smaller regression test after the fix is understood.
+- Mantis is not the normal fast CI gate. It is slower, uses live credentials, and
+  is reserved for bugs where the live environment matters.
+- Mantis should not require a human for normal operation. Manual VNC is a rescue
+  path, not the happy path.
+- Mantis does not store raw secrets in artifacts, logs, screenshots, Markdown
+  reports, or PR comments.
+
+## Ownership
+
+Mantis lives in the OpenClaw QA stack.
+
+- OpenClaw owns the scenario runtime, transport adapters, evidence schema, and
+  local CLI under `pnpm openclaw qa mantis`.
+- QA Lab owns the live transport harness pieces, browser capture helpers, and
+  artifact writers.
+- Crabbox owns warmed Linux machines when a remote VM is needed.
+- GitHub Actions owns the remote workflow entrypoint and artifact retention.
+- ClawSweeper owns GitHub comment routing: parsing maintainer commands,
+  dispatching the workflow, and posting the final PR comment.
+- OpenClaw agents drive Mantis through Codex when a scenario needs agentic setup,
+  debugging, or stuck-state reporting.
+
+This boundary keeps transport knowledge in OpenClaw, machine scheduling in
+Crabbox, and maintainer workflow glue in ClawSweeper.
+
+## Command Shape
+
+The first local command verifies the Discord bot, guild, channel, message send,
+reaction send, and artifact path:
+
+```bash
+pnpm openclaw qa mantis discord-smoke \
+  --output-dir .artifacts/qa-e2e/mantis/discord-smoke
+```
+
+The later before and after runner should accept this shape:
+
+```bash
+pnpm openclaw qa mantis run \
+  --transport discord \
+  --scenario discord-status-reactions-tool-only \
+  --baseline origin/main \
+  --candidate HEAD \
+  --output-dir .artifacts/qa-e2e/mantis/local-discord-status-reactions
+```
+
+The GitHub smoke workflow is `Mantis Discord Smoke`. The before and after GitHub
+workflow should accept equivalent inputs:
+
+- `transport`: `discord` for the first version.
+- `scenario`: one or more scenario ids.
+- `baseline_ref`: default `origin/main` or the linked issue's reported bad tag.
+- `candidate_ref`: the PR head SHA.
+- `machine_provider`: `aws` by default, with later `hetzner` fallback.
+- `post_to_pr`: whether ClawSweeper should comment with the result.
+
+ClawSweeper command examples:
+
+```text
+@clawsweeper mantis discord discord-status-reactions-tool-only
+@clawsweeper verify e2e discord
+```
+
+The first command is explicit and scenario-focused. The second can later map a PR
+or issue to recommended Mantis scenarios from labels, changed files, and
+ClawSweeper review findings.
+
+## Run Lifecycle
+
+1. Acquire credentials.
+2. Allocate or reuse a VM.
+3. Prepare a clean checkout for the baseline ref.
+4. Install dependencies and build only what the scenario needs.
+5. Start a child OpenClaw Gateway with an isolated state directory.
+6. Configure the live transport, provider, model, and browser profile.
+7. Run the scenario and capture baseline evidence.
+8. Stop the gateway and preserve logs.
+9. Prepare the candidate ref in the same VM.
+10. Run the same scenario and capture candidate evidence.
+11. Compare the oracle results and visual evidence.
+12. Write Markdown, JSON, logs, screenshots, and optional trace artifacts.
+13. Upload GitHub Actions artifacts.
+14. Post a concise PR or Discord status message.
+
+The scenario should be able to fail in two different ways:
+
+- **Bug reproduced**: baseline failed in the expected way.
+- **Harness failure**: environment setup, credentials, Discord API, browser, or
+  provider failed before the bug oracle was meaningful.
+
+The final report must separate these cases so maintainers do not confuse a flaky
+environment with product behavior.
+
+## Discord MVP
+
+The first scenario should target Discord status reactions in guild channels where
+the source reply delivery mode is `message_tool_only`.
+
+Why it is a good Mantis seed:
+
+- It is visible in Discord as reactions on the triggering message.
+- It has a strong REST oracle through Discord message reaction state.
+- It exercises a real OpenClaw Gateway, Discord bot auth, message dispatch,
+  source reply delivery mode, status reaction state, and model turn lifecycle.
+- It is narrow enough to keep the first implementation honest.
+
+Expected scenario shape:
+
+```yaml
+id: discord-status-reactions-tool-only
+transport: discord
+baseline:
+  expect:
+    reproduced: true
+candidate:
+  expect:
+    fixed: true
+config:
+  messages:
+    ackReaction: "👀"
+    ackReactionScope: "group-mentions"
+    groupChat:
+      visibleReplies: "message_tool"
+    statusReactions:
+      enabled: true
+      timing:
+        debounceMs: 0
+discord:
+  requireMention: true
+  notifyChannel: operator-notify
+evidence:
+  rest:
+    messageReactions: true
+  browser:
+    screenshotMessageRow: true
+```
+
+Baseline evidence should show the queued acknowledgement reaction but no
+lifecycle transition in tool-only mode. Candidate evidence should show lifecycle
+status reactions running when `messages.statusReactions.enabled` is explicitly
+true.
+
+## Existing QA Pieces
+
+Mantis should build on the existing private QA stack instead of starting from
+zero:
+
+- `pnpm openclaw qa discord` already runs a live Discord lane with driver and
+  SUT bots.
+- The live transport runner already writes reports and observed-message
+  artifacts under `.artifacts/qa-e2e/`.
+- Convex credential leases already provide exclusive access to shared live
+  transport credentials.
+- The browser control service already supports screenshots, snapshots,
+  headless managed profiles, and remote CDP profiles.
+- QA Lab already has a debugger UI and bus for transport-shaped testing.
+
+The first Mantis implementation can be a thin before/after runner over these
+pieces, plus one visual evidence layer.
+
+## Evidence Model
+
+Every run writes a stable artifact directory:
+
+```text
+.artifacts/qa-e2e/mantis/<run-id>/
+  mantis-report.md
+  mantis-summary.json
+  baseline/
+    summary.json
+    discord-message.json
+    screenshot-message-row.png
+    gateway-debug/
+  candidate/
+    summary.json
+    discord-message.json
+    screenshot-message-row.png
+    gateway-debug/
+  comparison.json
+  run.log
+```
+
+`mantis-summary.json` should be the machine-readable source of truth. The
+Markdown report is for PR comments and human review.
+
+The summary must include:
+
+- refs and SHAs tested
+- transport and scenario id
+- machine provider and machine id or lease id
+- credential source without secret values
+- baseline result
+- candidate result
+- whether the bug reproduced on baseline
+- whether the candidate fixed it
+- artifact paths
+- sanitized setup or cleanup issues
+
+Screenshots are evidence, not secrets. They still need redaction discipline:
+private channel names, user names, or message content may appear. For public PRs,
+prefer GitHub Actions artifact links over inline images until the redaction story
+is stronger.
+
+## Browser And VNC
+
+The browser lane has two modes:
+
+- **Headless automation**: default for CI. Chrome runs with CDP enabled, and
+  Playwright or OpenClaw browser control captures screenshots.
+- **VNC rescue**: enabled on the same VM when login, MFA, Discord anti-automation,
+  or visual debugging needs a human.
+
+The Discord observer browser profile should be persistent enough to avoid
+logging in for every run, but isolated from personal browser state. A profile
+belongs to the Mantis machine pool, not to a developer laptop.
+
+When Mantis gets stuck, it posts a Discord status message with:
+
+- run id
+- scenario id
+- machine provider
+- artifact directory
+- VNC or noVNC connection instructions if available
+- short blocker text
+
+The first private deployment can post these messages to the existing operator
+channel and move to a dedicated Mantis channel later.
+
+## Machines
+
+Mantis should prefer AWS through Crabbox for the first remote implementation.
+Crabbox gives us warmed machines, lease tracking, hydration, logs, results, and
+cleanup. If AWS capacity is too slow or unavailable, add a Hetzner provider
+behind the same machine interface.
+
+Minimum VM requirements:
+
+- Linux with a desktop-capable Chrome or Chromium install
+- CDP access for browser automation
+- VNC or noVNC for rescue
+- Node 22 and pnpm
+- OpenClaw checkout and dependency cache
+- Playwright Chromium browser cache when Playwright is used
+- enough CPU and memory for one OpenClaw Gateway, one browser, and one model run
+- outbound access to Discord, GitHub, model providers, and the credential broker
+
+The VM should not keep long-lived raw secrets outside the expected credential or
+browser profile stores.
+
+## Secrets
+
+Secrets live in GitHub organization or repository secrets for remote runs, and in
+a local operator-controlled secret file for local runs.
+
+Recommended secret names:
+
+- `OPENCLAW_QA_DISCORD_MANTIS_BOT_TOKEN`
+- `OPENCLAW_QA_DISCORD_DRIVER_BOT_TOKEN`
+- `OPENCLAW_QA_DISCORD_SUT_BOT_TOKEN`
+- `OPENCLAW_QA_DISCORD_GUILD_ID`
+- `OPENCLAW_QA_DISCORD_CHANNEL_ID`
+- `OPENCLAW_QA_DISCORD_NOTIFY_CHANNEL_ID`
+- `OPENCLAW_QA_REDACT_PUBLIC_METADATA=1` for public GitHub artifact uploads
+- `OPENCLAW_QA_CONVEX_SITE_URL`
+- `OPENCLAW_QA_CONVEX_SECRET_CI`
+
+Long term, the Convex credential pool should remain the normal source for live
+transport credentials. GitHub secrets bootstrap the broker and fallback lanes.
+
+The Mantis runner must never print:
+
+- Discord bot tokens
+- provider API keys
+- browser cookies
+- auth profile contents
+- VNC passwords
+- raw credential payloads
+
+Public artifact uploads should also redact Discord target metadata such as bot,
+guild, channel, and message ids. The GitHub smoke workflow enables
+`OPENCLAW_QA_REDACT_PUBLIC_METADATA=1` for this reason.
+
+If a token is accidentally pasted into an issue, PR, chat, or log, rotate it
+after the new secret has been stored.
+
+## GitHub Artifacts And PR Comments
+
+The first GitHub version should upload screenshots as Actions artifacts and link
+them from the PR comment. Inline images can come later once redaction, retention,
+and public/private repo behavior are settled.
+
+The PR comment should be short:
+
+```md
+Mantis Discord verification: pass
+
+- Scenario: `discord-status-reactions-tool-only`
+- Baseline: reproduced on `<sha>`
+- Candidate: fixed on `<sha>`
+- Evidence: <artifact link>
+- Screenshots: baseline and candidate message-row captures in the artifact
+```
+
+When the run fails because the harness failed, the comment must say that instead
+of implying the candidate failed.
+
+## Private Deployment Notes
+
+A private deployment may already have a Mantis Discord application. Reuse that
+application instead of creating another app when it has the right bot
+permissions and can be safely rotated.
+
+Set the initial operator notification channel through secrets or deployment
+configuration. It can point at an existing maintainer or operations channel
+first, then move to a dedicated Mantis channel once one exists.
+
+Do not put guild ids, channel ids, bot tokens, browser cookies, or VNC passwords
+in this document. Store them in GitHub secrets, the credential broker, or the
+operator's local secret store.
+
+## Adding A Scenario
+
+A Mantis scenario should declare:
+
+- id and title
+- transport
+- required credentials
+- baseline ref policy
+- candidate ref policy
+- OpenClaw config patch
+- setup steps
+- stimulus
+- expected baseline oracle
+- expected candidate oracle
+- visual capture targets
+- timeout budget
+- cleanup steps
+
+Scenarios should prefer small, typed oracles:
+
+- Discord reaction state for reaction bugs
+- Discord message references for threading bugs
+- Slack thread ts and reaction API state for Slack bugs
+- email message ids and headers for email bugs
+- browser screenshots when UI is the only reliable observable
+
+Vision checks should be additive. If a platform API can prove the bug, use the
+API as the pass/fail oracle and keep screenshots for human confidence.
+
+## Provider Expansion
+
+After Discord, the same runner can add:
+
+- Slack: reactions, threads, app mentions, modals, file uploads.
+- Email: Gmail auth and message threading using `gog` where connectors are not
+  enough.
+- WhatsApp: QR login, re-identification, message delivery, media, reactions.
+- Telegram: group mention gating, commands, reactions where available.
+- Matrix: encrypted rooms, thread or reply relations, restart resume.
+
+Each transport should have one cheap smoke scenario and one or more bug-class
+scenarios. Expensive visual scenarios should stay opt-in.
+
+## Open Questions
+
+- Which Discord bot should be the driver, and which should be the SUT, when the
+  existing Mantis bot is reused?
+- Should the observer browser login use a human Discord account, a test account,
+  or only bot-readable REST evidence for the first phase?
+- How long should GitHub retain Mantis artifacts for PRs?
+- When should ClawSweeper automatically recommend Mantis instead of waiting for a
+  maintainer command?
+- Should screenshots be redacted or cropped before upload for public PRs?