Files
openclaw/docs/concepts/qa-e2e-automation.md
2026-04-05 23:21:56 +01:00

20 KiB

title, summary, read_when
title summary read_when
QA E2E Automation Design note for a full end-to-end QA system built on a synthetic message-channel plugin, Dockerized OpenClaw, and subagent-driven scenario execution
You are designing a true end-to-end QA harness for OpenClaw
You want a synthetic message channel for automated feature verification
You want subagents to discover features, run scenarios, and propose fixes

QA E2E Automation

This note proposes a true end-to-end QA system for OpenClaw built around a real channel plugin dedicated to testing.

The core idea:

  • run OpenClaw inside Docker in a realistic gateway configuration
  • expose a synthetic but full-featured message channel as a normal plugin
  • let a QA harness inject inbound traffic and inspect outbound state
  • let OpenClaw agents and subagents explore, verify, and report on behavior
  • optionally escalate failing scenarios into host-side fix workflows that open PRs

This is not a unit-test replacement. It is a product-level system test layer.

Chosen direction

The initial direction for this project is:

  • build the full system inside this repo
  • test against a matrix, not a single model/provider pair
  • use Markdown reports as the first output artifact
  • defer auto-PR and auto-fix work until later
  • treat Slack-class semantics as the MVP transport target
  • keep orchestration simple in v1, with a host-side controller that exercises the moving parts directly
  • evolve toward OpenClaw becoming the orchestration layer later, once the transport, scenario, and reporting model are proven

Goals

  • Test OpenClaw through a real messaging-channel boundary, not only chat.send or embedded mocks.
  • Verify channel semantics that matter for real use:
    • DMs
    • channels/groups
    • threads
    • edits
    • deletes
    • reactions
    • polls
    • attachments
  • Verify agent behavior across realistic user flows:
    • memory
    • thread binding
    • model switching
    • cron jobs
    • subagents
    • approvals
    • routing
    • channel-specific message actions
  • Make the QA runner capable of feature discovery:
    • read docs
    • inspect plugin capability discovery
    • inspect code and config
    • generate a scenario protocol
  • Support deterministic protocol tests and best-effort real-model tests as separate lanes.
  • Allow automated bug triage artifacts that can feed a host-side fix worker.

Non-goals

  • Not a replacement for existing unit, contract, or live tests.
  • Not a production channel.
  • Not a requirement that all bug fixing happen from inside the Dockerized OpenClaw runtime.
  • Not a reason to add test-only core branches for one channel.

Why a channel plugin

OpenClaw already has the right boundary:

  • core owns the shared message tool, prompt wiring, outer session bookkeeping, and dispatch
  • channel plugins own:
    • config
    • pairing
    • security
    • session grammar
    • threading
    • outbound delivery
    • channel-owned actions and capability discovery

That means the cleanest design is:

  • a real channel plugin for QA transport semantics
  • a separate QA control plane for injection and inspection

This keeps the test transport inside the same architecture used by Slack, Discord, Teams, and similar channels.

System overview

The system has six pieces.

  1. qa-channel plugin
  • Bundled extension under extensions/qa-channel
  • Normal ChannelPlugin
  • Behaves like a Slack/Discord/Teams-class channel
  • Registers channel-owned message actions through the shared message tool
  1. qa-bus sidecar
  • Small HTTP and/or WS service
  • Canonical state store for synthetic conversations, messages, threads, reactions, edits, and event history
  • Accepts inbound events from the harness
  • Exposes inspection and wait APIs for assertions
  1. Dockerized OpenClaw gateway
  • Runs as close to real deployment as practical
  • Loads qa-channel
  • Uses normal config, routing, session, cron, and plugin loading
  1. QA orchestrator
  • Host-side runner or dedicated OpenClaw-driven controller
  • Provisions scenario environments
  • Seeds config
  • Resets state
  • Executes test matrix
  • Collects structured outcomes
  1. Auto-fix worker
  • Host-side workflow
  • Creates a worktree
  • launches a coding agent
  • runs scoped verification
  • opens a PR

The auto-fix worker should start outside the container. It needs direct repo and GitHub access, clean worktree control, and better isolation from the runtime under test.

  1. qa-lab extension
  • Bundled extension under extensions/qa-lab
  • Owns the QA harness, Markdown report flow, and private debugger UI
  • Registers hidden CLI entrypoints such as openclaw qa run and openclaw qa ui
  • Stays separate from the shipped Control UI bundle

High-level flow

  1. Start qa-bus.
  2. Start OpenClaw in Docker with qa-channel enabled.
  3. QA orchestrator injects inbound messages into qa-bus.
  4. qa-channel receives them as normal inbound traffic.
  5. OpenClaw runs the agent loop normally.
  6. Outbound replies and channel actions flow back through qa-channel into qa-bus.
  7. QA orchestrator inspects state or waits on events.
  8. Orchestrator records pass/fail/flaky/unknown plus artifacts.
  9. Severe failures optionally emit a bug packet for the host-side fix worker.

Lanes

The system should have two distinct lanes.

Lane A: deterministic protocol lane

Use a deterministic or tightly controlled model setup.

Preferred options:

  • a canned provider fixture
  • the bundled synthetic provider when useful
  • fixed prompts with exact assertions

Purpose:

  • verify transport and product semantics
  • keep flakiness low
  • catch regressions in routing, memory plumbing, thread binding, cron, and tool invocation

Lane B: quality lane

Use real providers and real models in a matrix.

Purpose:

  • verify that the agent can still do good work end to end
  • evaluate feature discoverability and instruction following
  • surface model-specific breakage or degraded behavior

Expected result type:

  • best-effort
  • rubric-based
  • more tolerant of wording variation

Matrix guidance for v1:

  • start with a small curated matrix, not "everything configured"
  • keep deterministic protocol runs separate from quality runs
  • report matrix cells independently so one provider/model failure does not hide transport correctness

Do not mix these lanes. Protocol correctness and model quality should fail independently.

Use existing bootstrap seam first

Before the custom channel exists, OpenClaw already has a useful bootstrap path:

  • admin-scoped synthetic originating-route fields on chat.send
  • synthetic message-channel headers for HTTP flows

That is enough to build a first QA controller for:

  • thread/session routing
  • ACP bind flows
  • subagent delivery
  • cron wake paths
  • memory persistence checks

This should be Phase 0 because it de-risks the scenario protocol before the full channel lands.

qa-lab extension design

qa-lab is the private operator-facing half of this system.

Suggested package:

  • extensions/qa-lab/

Suggested responsibilities:

  • host the synthetic bus state machine
  • host the scenario runner
  • write Markdown reports
  • serve a private debugger UI on a separate local server
  • keep that UI entirely outside the shipped Control UI bundle

Suggested UI shape:

  • left rail for conversations and threads
  • center transcript pane
  • right rail for event stream and report inspection
  • bottom inject-composer for inbound QA traffic

qa-channel plugin design

Package layout

Suggested package:

  • extensions/qa-channel/

Suggested file layout:

  • package.json
  • openclaw.plugin.json
  • index.ts
  • setup-entry.ts
  • api.ts
  • runtime-api.ts
  • src/channel.ts
  • src/channel-api.ts
  • src/config-schema.ts
  • src/setup-core.ts
  • src/setup-surface.ts
  • src/runtime.ts
  • src/channel.runtime.ts
  • src/inbound.ts
  • src/outbound.ts
  • src/state-client.ts
  • src/targets.ts
  • src/threading.ts
  • src/message-actions.ts
  • src/probe.ts
  • src/doctor.ts
  • src/*.test.ts

Model it after Slack, Discord, Teams, or Google Chat packaging, not as a one-off test helper.

Capabilities

MVP capabilities:

  • one account
  • DMs
  • channels
  • threads
  • send text
  • reply in thread
  • read
  • edit
  • delete
  • react
  • search
  • upload-file
  • download-file

Phase 2 capabilities:

  • polls
  • member-info
  • channel-info
  • channel-list
  • pin and unpin
  • permissions
  • topic create and edit

These map naturally onto the shared message tool action model already used by channel plugins.

Conversation model

Use a stable synthetic grammar that supports both simplicity and realistic coverage.

Suggested ids:

  • DM conversation: dm:<user-id>
  • channel: chan:<space-id>
  • thread: thread:<space-id>:<thread-id>
  • message id: msg:<ulid>

Suggested target forms:

  • qa:dm:<user-id>
  • qa:chan:<space-id>
  • qa:thread:<space-id>:<thread-id>

The plugin should own translation between external target strings and canonical conversation ids.

Pairing and security

Even though this is a QA channel, it should still implement real policy surfaces:

  • DM allowlist / pairing flow
  • group policy
  • mention gating where relevant
  • trusted sender ids

Reason:

  • these are product features and should be testable through the QA transport
  • the QA lane should be able to verify policy failures, not only happy paths

Threading model

Threading is one of the main reasons to build this channel.

Required semantics:

  • create thread from a top-level message
  • reply inside an existing thread
  • list thread messages
  • preserve parent message linkage
  • let OpenClaw thread binding attach a session to a thread

The QA bus must preserve:

  • conversation id
  • thread id
  • parent message id
  • sender id
  • timestamps

Channel-owned message actions

The plugin should implement actions.describeMessageTool(...) and actions.handleAction(...).

MVP action list:

  • send
  • read
  • reply
  • react
  • edit
  • delete
  • thread-create
  • thread-reply
  • search
  • upload-file
  • download-file

This is enough to test the shared message tool end to end with real channel semantics.

qa-bus design

qa-bus is the transport simulator and assertion backend.

It should not know OpenClaw internals. It should know channel state.

For v1, keep qa-bus in this repo so:

  • fixtures and scenarios evolve with product code
  • the transport contract can change in lock-step with the plugin
  • CI and local dev do not need another repo checkout

Responsibilities

  • accept inbound user/platform events
  • persist canonical conversation state
  • persist append-only event log
  • expose inspection APIs
  • expose blocking wait APIs
  • support reset per scenario or per suite

Transport

HTTP is enough for MVP.

Suggested endpoints:

  • POST /reset
  • POST /inbound/message
  • POST /inbound/edit
  • POST /inbound/delete
  • POST /inbound/reaction
  • POST /inbound/thread/create
  • GET /state/conversations
  • GET /state/messages
  • GET /state/threads
  • GET /events
  • POST /wait

Optional WS stream:

  • /stream

Useful for live event taps and debugging.

State model

Persist three layers.

  1. Conversation snapshot
  • participants
  • type
  • thread topology
  • latest message pointers
  1. Message snapshot
  • sender
  • content
  • attachments
  • edit history
  • reactions
  • parent and thread linkage
  1. Append-only event log
  • canonical timestamp
  • causal ordering
  • source: inbound, outbound, action, system
  • payload

The append-only log matters because many QA assertions are event-oriented, not just state-oriented.

Assertion API

The harness needs waiters, not just snapshots.

Suggested POST /wait contract:

  • kind
  • match
  • timeoutMs

Examples:

  • wait for outbound message matching text regex
  • wait for thread creation
  • wait for reaction added
  • wait for message edit
  • wait for no event of type X within Y ms

This gives stable tests without custom polling code in every scenario.

QA orchestrator design

The orchestrator should own scenario planning and artifact collection.

Start host-side. Later, OpenClaw can orchestrate parts of it.

This is the chosen v1 direction.

Why:

  • simpler to iterate while the transport and scenario protocol are still moving
  • easier access to the repo, logs, Docker, and test fixtures
  • easier artifact collection and report generation
  • avoids over-coupling the first version to subagent behavior before the QA protocol itself is stable

Inputs

  • docs pages
  • channel capability discovery
  • configured provider/model lane
  • scenario catalog
  • repo/test metadata

Outputs

  • structured protocol report
  • scenario transcript
  • captured channel state
  • gateway logs
  • failure packets

For v1, the primary output is a Markdown report.

Suggested report sections:

  • suite summary
  • environment
  • provider/model matrix
  • scenarios passed
  • scenarios failed
  • flaky or inconclusive scenarios
  • captured evidence links or inline excerpts
  • suspected ownership or file hints
  • follow-up recommendations

Scenario format

Use a data-driven scenario spec.

Suggested shape:

{
  "id": "thread-memory-recall",
  "lane": "deterministic",
  "preconditions": ["qa-channel", "memory-enabled"],
  "steps": [
    {
      "type": "injectMessage",
      "to": "qa:dm:user-a",
      "text": "Remember that the deploy key is kiwi."
    },
    { "type": "waitForOutbound", "match": { "textIncludes": "kiwi" } },
    { "type": "injectMessage", "to": "qa:dm:user-a", "text": "What was the deploy key?" },
    { "type": "waitForOutbound", "match": { "textIncludes": "kiwi" } }
  ],
  "assertions": [{ "type": "outboundTextIncludes", "value": "kiwi" }]
}

Keep the execution engine generic and the scenario catalog declarative.

Feature discovery

The orchestrator can discover candidate scenarios from three sources.

  1. Docs
  • channel docs
  • testing docs
  • gateway docs
  • subagents docs
  • cron docs
  1. Runtime capability discovery
  • channel message action discovery
  • plugin status and channel capabilities
  • configured providers/models
  1. Code hints
  • known action names
  • channel-specific feature flags
  • config schema

This should produce a proposed protocol with:

  • must-test
  • can-test
  • blocked
  • unsupported

Scenario classes

Recommended catalog:

  • transport basics
    • DM send and reply
    • channel send
    • thread create and reply
    • reaction add and read
    • edit and delete
  • policy
    • allowlist
    • pairing
    • group mention gating
  • shared message tool
    • read
    • search
    • reply
    • react
    • upload and download
  • agent quality
    • follows channel context
    • obeys thread semantics
    • uses memory across turns
    • switches model when instructed
  • automation
    • cron add and run
    • cron delivery into channel
    • scheduled reminders
  • subagents
    • spawn
    • announce
    • threaded follow-up
    • nested orchestration when enabled
  • failure handling
    • unsupported action
    • timeout
    • malformed target
    • policy denial

OpenClaw as orchestrator

Longer-term, OpenClaw itself can coordinate the QA run.

Suggested architecture:

  • one controller session
  • N worker subagents
  • each worker owns one scenario or scenario shard
  • workers report structured results back to controller

Good fits for existing OpenClaw primitives:

  • sessions_spawn
  • subagents
  • cron-based wakeups for long-running suites
  • thread-bound sessions for scenario-local follow-up

Best near-term use:

  • controller generates the plan
  • workers execute scenarios in parallel
  • controller synthesizes report

Avoid making the controller also own host Git operations in the first version.

Chosen direction:

  • v1: host-side controller
  • v2+: OpenClaw-native orchestration once the scenario protocol and transport model are stable

Auto-fix workflow

The system should emit a structured bug packet when a scenario fails.

Suggested bug packet:

  • scenario id
  • lane
  • failure kind
  • minimal repro steps
  • channel event transcript
  • gateway transcript
  • logs
  • suspected files
  • confidence

Host-side fix worker flow:

  1. receive bug packet
  2. create detached worktree
  3. launch coding agent in worktree
  4. write failing regression first when practical
  5. implement fix
  6. run scoped verification
  7. open PR

This should remain host-side at first because it needs:

  • repo write access
  • worktree hygiene
  • git credentials
  • GitHub auth

Chosen direction:

  • do not auto-open PRs in v1
  • emit Markdown reports and structured failure packets first
  • add host-side worktree + PR automation later

Rollout plan

Phase 0: bootstrap on existing synthetic ingress

Build a first QA runner without a new channel:

  • use chat.send with admin-scoped synthetic originating-route fields
  • run deterministic scenarios against routing, memory, cron, subagents, and ACP
  • validate protocol format and artifact collection

Exit criteria:

  • scenario runner exists
  • structured protocol report exists
  • failure artifacts exist

Phase 1: MVP qa-channel

Build the plugin and bus with:

  • DM
  • channels
  • threads
  • read
  • reply
  • react
  • edit
  • delete
  • search

Target semantics:

  • Slack-class transport behavior
  • not full Teams-class parity yet

Exit criteria:

  • OpenClaw in Docker can talk to qa-bus
  • harness can inject + inspect
  • one green end-to-end suite across message transport and agent behavior

Phase 2: protocol expansion

Add:

  • attachments
  • polls
  • pins
  • richer policy tests
  • quality lane with real provider/model matrix

Exit criteria:

  • scenario matrix covers major built-in features
  • deterministic and quality lanes are separated

Phase 3: subagent-driven QA

Add:

  • controller agent
  • worker subagents
  • scenario discovery from docs + capability discovery
  • parallel execution

Exit criteria:

  • one controller can fan out and synthesize a suite report

Phase 4: auto-fix loop

Add:

  • bug packet emission
  • host-side worktree runner
  • PR creation

Exit criteria:

  • selected failures can auto-produce draft PRs

Risks

Risk: too much magic in one layer

If the QA channel, bus, and orchestrator all become smart at once, debugging will be painful.

Mitigation:

  • keep qa-channel transport-focused
  • keep qa-bus state-focused
  • keep orchestrator separate

Risk: flaky assertions from model variance

Mitigation:

  • deterministic lane
  • quality lane
  • different pass criteria

Risk: test-only branches leaking into core

Mitigation:

  • no core special cases for qa-channel
  • use normal plugin seams
  • use admin synthetic ingress only as bootstrap

Risk: auto-fix overreach

Mitigation:

  • keep fix worker host-side
  • require explicit policy for when PRs can open automatically
  • gate with scoped tests

Risk: building a fake platform nobody uses

Mitigation:

  • emulate Slack/Discord/Teams semantics, not an abstract transport
  • prioritize features that stress shared OpenClaw boundaries

MVP recommendation

If building this now, start with this exact order.

  1. Host-side scenario runner using existing synthetic originating-route support.
  2. qa-bus sidecar with state, events, reset, and wait APIs.
  3. extensions/qa-channel MVP with DMs, channels, threads, reply, read, react, edit, delete, and search.
  4. Markdown report generator for suite + matrix output.
  5. One deterministic end-to-end suite:
    • inject inbound DM
    • verify reply
    • create thread
    • verify follow-up in thread
    • verify memory recall on later turn
  6. Add curated real-model matrix quality lane.
  7. Add controller subagent orchestration.
  8. Add host-side auto-fix worktree runner.

This order gets real value quickly without requiring the full grand design to land before the first useful signal appears.

Current product decisions

  • qa-bus lives inside this repo
  • the first controller is host-side
  • Slack-class behavior is the MVP target
  • the quality lane uses a curated matrix
  • first version produces Markdown reports, not PRs
  • OpenClaw-native orchestration is a later phase, not a v1 requirement