openclaw/docs/concepts/mantis.md

---
summary: "Mantis is the visual end-to-end verification system for reproducing OpenClaw bugs on live transports, capturing before and after evidence, and attaching artifacts to PRs."
title: "Mantis"
read_when:
  - Building or running live visual QA for OpenClaw bugs
  - Adding before and after verification for a pull request
  - Adding Discord, Slack, WhatsApp, or other live transport scenarios
  - Debugging QA runs that need screenshots, browser automation, or VNC access
---

Mantis is the OpenClaw end-to-end verification system for bugs that need a real
runtime, a real transport, and visible proof. It runs a scenario against a known
bad ref, captures evidence, runs the same scenario against a candidate ref, and
publishes the comparison as artifacts that a maintainer can inspect from a PR or
from a local command.

Mantis starts with Discord because Discord gives us a high-value first lane:
real bot auth, real guild channels, reactions, threads, native commands, and a
browser UI where humans can visually confirm what the transport showed.

## Goals

- Reproduce a bug from a GitHub issue or PR with the same transport shape users
  see.
- Capture a **before** artifact on the baseline ref before applying the fix.
- Capture an **after** artifact on the candidate ref after applying the fix.
- Use a deterministic oracle whenever possible, such as a Discord REST reaction
  read or channel transcript check.
- Capture screenshots when the bug has a visible UI surface.
- Run locally from an agent-controlled CLI and remotely from GitHub.
- Preserve enough machine state for VNC rescue when login, browser automation, or
  provider auth gets stuck.
- Post concise status to an operator Discord channel when the run is blocked,
  needs manual VNC help, or finishes.

## Non Goals

- Mantis is not a replacement for unit tests. A Mantis run should usually become
  a smaller regression test after the fix is understood.
- Mantis is not the normal fast CI gate. It is slower, uses live credentials, and
  is reserved for bugs where the live environment matters.
- Mantis should not require a human for normal operation. Manual VNC is a rescue
  path, not the happy path.
- Mantis does not store raw secrets in artifacts, logs, screenshots, Markdown
  reports, or PR comments.

## Ownership

Mantis lives in the OpenClaw QA stack.

- OpenClaw owns the scenario runtime, transport adapters, evidence schema, and
  local CLI under `pnpm openclaw qa mantis`.
- QA Lab owns the live transport harness pieces, browser capture helpers, and
  artifact writers.
- Crabbox owns warmed Linux machines when a remote VM is needed.
- GitHub Actions owns the remote workflow entrypoint and artifact retention.
- ClawSweeper owns GitHub comment routing: parsing maintainer commands,
  dispatching the workflow, and posting the final PR comment.
- OpenClaw agents drive Mantis through Codex when a scenario needs agentic setup,
  debugging, or stuck-state reporting.

This boundary keeps transport knowledge in OpenClaw, machine scheduling in
Crabbox, and maintainer workflow glue in ClawSweeper.

## Command Shape

The first local command verifies the Discord bot, guild, channel, message send,
reaction send, and artifact path:

```bash
pnpm openclaw qa mantis discord-smoke \
  --output-dir .artifacts/qa-e2e/mantis/discord-smoke
```

The local before and after runner accepts this shape:

```bash
pnpm openclaw qa mantis run \
  --transport discord \
  --scenario discord-status-reactions-tool-only \
  --baseline origin/main \
  --candidate HEAD \
  --output-dir .artifacts/qa-e2e/mantis/local-discord-status-reactions
```

The runner creates detached baseline and candidate worktrees under the output
directory, installs dependencies, builds each ref, runs the scenario with
`--allow-failures`, then writes `baseline/`, `candidate/`, `comparison.json`,
and `mantis-report.md`. For the first Discord scenario, a successful verification
means baseline status is `fail` and candidate status is `pass`.

The GitHub smoke workflow is `Mantis Discord Smoke`. The before and after GitHub
workflow for the first real scenario is `Mantis Discord Status Reactions`. It
accepts:

- `baseline_ref`: the ref expected to reproduce queued-only behavior.
- `candidate_ref`: the ref expected to show `queued -> thinking -> done`.

It checks out the workflow harness ref, builds separate baseline and candidate
worktrees, runs `discord-status-reactions-tool-only` against each worktree, and
uploads `baseline/`, `candidate/`, `comparison.json`, and `mantis-report.md` as
Actions artifacts.

You can also trigger the status-reactions run directly from a PR comment:

```text
@Mantis discord status reactions
```

The comment trigger is intentionally narrow. It only runs on pull request
comments from users with write, maintain, or admin access, and it only recognizes
Discord status-reaction requests. By default it uses the known bad baseline ref
and the current PR head SHA as the candidate. Maintainers can override either
ref:

```text
@Mantis discord status reactions baseline=origin/main candidate=HEAD
```

ClawSweeper command examples:

```text
@clawsweeper mantis discord discord-status-reactions-tool-only
@clawsweeper verify e2e discord
```

The first command is explicit and scenario-focused. The second can later map a PR
or issue to recommended Mantis scenarios from labels, changed files, and
ClawSweeper review findings.

## Run Lifecycle

1. Acquire credentials.
2. Allocate or reuse a VM.
3. Prepare a clean checkout for the baseline ref.
4. Install dependencies and build only what the scenario needs.
5. Start a child OpenClaw Gateway with an isolated state directory.
6. Configure the live transport, provider, model, and browser profile.
7. Run the scenario and capture baseline evidence.
8. Stop the gateway and preserve logs.
9. Prepare the candidate ref in the same VM.
10. Run the same scenario and capture candidate evidence.
11. Compare the oracle results and visual evidence.
12. Write Markdown, JSON, logs, screenshots, and optional trace artifacts.
13. Upload GitHub Actions artifacts.
14. Post a concise PR or Discord status message.

The scenario should be able to fail in two different ways:

- **Bug reproduced**: baseline failed in the expected way.
- **Harness failure**: environment setup, credentials, Discord API, browser, or
  provider failed before the bug oracle was meaningful.

The final report must separate these cases so maintainers do not confuse a flaky
environment with product behavior.

## Discord MVP

The first scenario should target Discord status reactions in guild channels where
the source reply delivery mode is `message_tool_only`.

Why it is a good Mantis seed:

- It is visible in Discord as reactions on the triggering message.
- It has a strong REST oracle through Discord message reaction state.
- It exercises a real OpenClaw Gateway, Discord bot auth, message dispatch,
  source reply delivery mode, status reaction state, and model turn lifecycle.
- It is narrow enough to keep the first implementation honest.

Expected scenario shape:

```yaml
id: discord-status-reactions-tool-only
transport: discord
baseline:
  expect:
    reproduced: true
candidate:
  expect:
    fixed: true
config:
  messages:
    ackReaction: "👀"
    ackReactionScope: "group-mentions"
    groupChat:
      visibleReplies: "message_tool"
    statusReactions:
      enabled: true
      timing:
        debounceMs: 0
discord:
  requireMention: true
  notifyChannel: operator-notify
evidence:
  rest:
    messageReactions: true
  browser:
    screenshotMessageRow: true
```

Baseline evidence should show the queued acknowledgement reaction but no
lifecycle transition in tool-only mode. Candidate evidence should show lifecycle
status reactions running when `messages.statusReactions.enabled` is explicitly
true.

The executable first slice is the opt-in Discord live QA scenario:

```bash
pnpm openclaw qa discord \
  --scenario discord-status-reactions-tool-only \
  --provider-mode live-frontier \
  --model openai/gpt-5.4 \
  --alt-model openai/gpt-5.4 \
  --fast \
  --output-dir .artifacts/qa-e2e/mantis/discord-status-reactions-candidate
```

It configures the SUT with always-on guild handling, `visibleReplies:
"message_tool"`, `ackReaction: "👀"`, and explicit status reactions. The oracle
polls the real Discord triggering message and expects the observed sequence
`👀 -> 🤔 -> 👍`. Artifacts include `discord-qa-reaction-timelines.json`,
`discord-status-reactions-tool-only-timeline.html`, and
`discord-status-reactions-tool-only-timeline.png`.

## Existing QA Pieces

Mantis should build on the existing private QA stack instead of starting from
zero:

- `pnpm openclaw qa discord` already runs a live Discord lane with driver and
  SUT bots.
- The live transport runner already writes reports and observed-message
  artifacts under `.artifacts/qa-e2e/`.
- Convex credential leases already provide exclusive access to shared live
  transport credentials.
- The browser control service already supports screenshots, snapshots,
  headless managed profiles, and remote CDP profiles.
- QA Lab already has a debugger UI and bus for transport-shaped testing.

The first Mantis implementation can be a thin before/after runner over these
pieces, plus one visual evidence layer.

## Evidence Model

Every run writes a stable artifact directory:

```text
.artifacts/qa-e2e/mantis/<run-id>/
  mantis-report.md
  mantis-summary.json
  baseline/
    summary.json
    discord-message.json
    screenshot-message-row.png
    gateway-debug/
  candidate/
    summary.json
    discord-message.json
    screenshot-message-row.png
    gateway-debug/
  comparison.json
  run.log
```

`mantis-summary.json` should be the machine-readable source of truth. The
Markdown report is for PR comments and human review.

The summary must include:

- refs and SHAs tested
- transport and scenario id
- machine provider and machine id or lease id
- credential source without secret values
- baseline result
- candidate result
- whether the bug reproduced on baseline
- whether the candidate fixed it
- artifact paths
- sanitized setup or cleanup issues

Screenshots are evidence, not secrets. They still need redaction discipline:
private channel names, user names, or message content may appear. For public PRs,
prefer GitHub Actions artifact links over inline images until the redaction story
is stronger.

## Browser And VNC

The browser lane has two modes:

- **Headless automation**: default for CI. Chrome runs with CDP enabled, and
  Playwright or OpenClaw browser control captures screenshots.
- **VNC rescue**: enabled on the same VM when login, MFA, Discord anti-automation,
  or visual debugging needs a human.

The Discord observer browser profile should be persistent enough to avoid
logging in for every run, but isolated from personal browser state. A profile
belongs to the Mantis machine pool, not to a developer laptop.

When Mantis gets stuck, it posts a Discord status message with:

- run id
- scenario id
- machine provider
- artifact directory
- VNC or noVNC connection instructions if available
- short blocker text

The first private deployment can post these messages to the existing operator
channel and move to a dedicated Mantis channel later.

## Machines

Mantis should prefer AWS through Crabbox for the first remote implementation.
Crabbox gives us warmed machines, lease tracking, hydration, logs, results, and
cleanup. If AWS capacity is too slow or unavailable, add a Hetzner provider
behind the same machine interface.

Minimum VM requirements:

- Linux with a desktop-capable Chrome or Chromium install
- CDP access for browser automation
- VNC or noVNC for rescue
- Node 22 and pnpm
- OpenClaw checkout and dependency cache
- Playwright Chromium browser cache when Playwright is used
- enough CPU and memory for one OpenClaw Gateway, one browser, and one model run
- outbound access to Discord, GitHub, model providers, and the credential broker

The VM should not keep long-lived raw secrets outside the expected credential or
browser profile stores.

## Secrets

Secrets live in GitHub organization or repository secrets for remote runs, and in
a local operator-controlled secret file for local runs.

Recommended secret names:

- `OPENCLAW_QA_DISCORD_MANTIS_BOT_TOKEN`
- `OPENCLAW_QA_DISCORD_DRIVER_BOT_TOKEN`
- `OPENCLAW_QA_DISCORD_SUT_BOT_TOKEN`
- `OPENCLAW_QA_DISCORD_GUILD_ID`
- `OPENCLAW_QA_DISCORD_CHANNEL_ID`
- `OPENCLAW_QA_DISCORD_NOTIFY_CHANNEL_ID`
- `OPENCLAW_QA_REDACT_PUBLIC_METADATA=1` for public GitHub artifact uploads
- `OPENCLAW_QA_CONVEX_SITE_URL`
- `OPENCLAW_QA_CONVEX_SECRET_CI`

Long term, the Convex credential pool should remain the normal source for live
transport credentials. GitHub secrets bootstrap the broker and fallback lanes.

The Mantis runner must never print:

- Discord bot tokens
- provider API keys
- browser cookies
- auth profile contents
- VNC passwords
- raw credential payloads

Public artifact uploads should also redact Discord target metadata such as bot,
guild, channel, and message ids. The GitHub smoke workflow enables
`OPENCLAW_QA_REDACT_PUBLIC_METADATA=1` for this reason.

If a token is accidentally pasted into an issue, PR, chat, or log, rotate it
after the new secret has been stored.

## GitHub Artifacts And PR Comments

Mantis workflows should upload the full evidence bundle as a short-lived Actions
artifact. When the workflow is run for a bug report or fix PR, it should also
publish the redacted PNG screenshots to the `qa-artifacts` branch and upsert a
comment on that bug or fix PR with inline before/after screenshots. Do not post
the primary proof only on a generic QA automation PR. Raw logs, observed
messages, and other bulky evidence stay in the Actions artifact.

Production workflows should post those comments with the Mantis GitHub App, not
with `github-actions[bot]`. Store the app id and private key as
`MANTIS_GITHUB_APP_ID` and `MANTIS_GITHUB_APP_PRIVATE_KEY` GitHub Actions
secrets. The workflow uses a hidden marker as the upsert key, updates that
comment when the token can edit it, and creates a new Mantis-owned comment when
an older bot-owned marker cannot be edited.

The PR comment should be short and visual:

```md
Mantis Discord Status Reactions QA

Summary: Mantis reran the reported Discord status-reaction bug against the known
bad baseline and the candidate fix. The baseline reproduced the bug, while the
candidate showed the expected queued -> thinking -> done sequence.

- Scenario: `discord-status-reactions-tool-only`
- Run: <workflow run link>
- Artifact: <artifact link>
- Baseline: `<status>` at `<sha>`
- Candidate: `<status>` at `<sha>`

| Baseline            | Candidate           |
| ------------------- | ------------------- |
| <inline screenshot> | <inline screenshot> |
```

When the run fails because the harness failed, the comment must say that instead
of implying the candidate failed.

## Private Deployment Notes

A private deployment may already have a Mantis Discord application. Reuse that
application instead of creating another app when it has the right bot
permissions and can be safely rotated.

Set the initial operator notification channel through secrets or deployment
configuration. It can point at an existing maintainer or operations channel
first, then move to a dedicated Mantis channel once one exists.

Do not put guild ids, channel ids, bot tokens, browser cookies, or VNC passwords
in this document. Store them in GitHub secrets, the credential broker, or the
operator's local secret store.

## Adding A Scenario

A Mantis scenario should declare:

- id and title
- transport
- required credentials
- baseline ref policy
- candidate ref policy
- OpenClaw config patch
- setup steps
- stimulus
- expected baseline oracle
- expected candidate oracle
- visual capture targets
- timeout budget
- cleanup steps

Scenarios should prefer small, typed oracles:

- Discord reaction state for reaction bugs
- Discord message references for threading bugs
- Slack thread ts and reaction API state for Slack bugs
- email message ids and headers for email bugs
- browser screenshots when UI is the only reliable observable

Vision checks should be additive. If a platform API can prove the bug, use the
API as the pass/fail oracle and keep screenshots for human confidence.

## Provider Expansion

After Discord, the same runner can add:

- Slack: reactions, threads, app mentions, modals, file uploads.
- Email: Gmail auth and message threading using `gog` where connectors are not
  enough.
- WhatsApp: QR login, re-identification, message delivery, media, reactions.
- Telegram: group mention gating, commands, reactions where available.
- Matrix: encrypted rooms, thread or reply relations, restart resume.

Each transport should have one cheap smoke scenario and one or more bug-class
scenarios. Expensive visual scenarios should stay opt-in.

## Open Questions

- Which Discord bot should be the driver, and which should be the SUT, when the
  existing Mantis bot is reused?
- Should the observer browser login use a human Discord account, a test account,
  or only bot-readable REST evidence for the first phase?
- How long should GitHub retain Mantis artifacts for PRs?
- When should ClawSweeper automatically recommend Mantis instead of waiting for a
  maintainer command?
- Should screenshots be redacted or cropped before upload for public PRs?