20 KiB
title, summary, read_when
| title | summary | read_when | |||
|---|---|---|---|---|---|
| QA E2E Automation | Design note for a full end-to-end QA system built on a synthetic message-channel plugin, Dockerized OpenClaw, and subagent-driven scenario execution |
|
QA E2E Automation
This note proposes a true end-to-end QA system for OpenClaw built around a real channel plugin dedicated to testing.
The core idea:
- run OpenClaw inside Docker in a realistic gateway configuration
- expose a synthetic but full-featured message channel as a normal plugin
- let a QA harness inject inbound traffic and inspect outbound state
- let OpenClaw agents and subagents explore, verify, and report on behavior
- optionally escalate failing scenarios into host-side fix workflows that open PRs
This is not a unit-test replacement. It is a product-level system test layer.
Chosen direction
The initial direction for this project is:
- build the full system inside this repo
- test against a matrix, not a single model/provider pair
- use Markdown reports as the first output artifact
- defer auto-PR and auto-fix work until later
- treat Slack-class semantics as the MVP transport target
- keep orchestration simple in v1, with a host-side controller that exercises the moving parts directly
- evolve toward OpenClaw becoming the orchestration layer later, once the transport, scenario, and reporting model are proven
Goals
- Test OpenClaw through a real messaging-channel boundary, not only
chat.sendor embedded mocks. - Verify channel semantics that matter for real use:
- DMs
- channels/groups
- threads
- edits
- deletes
- reactions
- polls
- attachments
- Verify agent behavior across realistic user flows:
- memory
- thread binding
- model switching
- cron jobs
- subagents
- approvals
- routing
- channel-specific
messageactions
- Make the QA runner capable of feature discovery:
- read docs
- inspect plugin capability discovery
- inspect code and config
- generate a scenario protocol
- Support deterministic protocol tests and best-effort real-model tests as separate lanes.
- Allow automated bug triage artifacts that can feed a host-side fix worker.
Non-goals
- Not a replacement for existing unit, contract, or live tests.
- Not a production channel.
- Not a requirement that all bug fixing happen from inside the Dockerized OpenClaw runtime.
- Not a reason to add test-only core branches for one channel.
Why a channel plugin
OpenClaw already has the right boundary:
- core owns the shared
messagetool, prompt wiring, outer session bookkeeping, and dispatch - channel plugins own:
- config
- pairing
- security
- session grammar
- threading
- outbound delivery
- channel-owned actions and capability discovery
That means the cleanest design is:
- a real channel plugin for QA transport semantics
- a separate QA control plane for injection and inspection
This keeps the test transport inside the same architecture used by Slack, Discord, Teams, and similar channels.
System overview
The system has six pieces.
qa-channelplugin
- Bundled extension under
extensions/qa-channel - Normal
ChannelPlugin - Behaves like a Slack/Discord/Teams-class channel
- Registers channel-owned message actions through the shared
messagetool
qa-bussidecar
- Small HTTP and/or WS service
- Canonical state store for synthetic conversations, messages, threads, reactions, edits, and event history
- Accepts inbound events from the harness
- Exposes inspection and wait APIs for assertions
- Dockerized OpenClaw gateway
- Runs as close to real deployment as practical
- Loads
qa-channel - Uses normal config, routing, session, cron, and plugin loading
- QA orchestrator
- Host-side runner or dedicated OpenClaw-driven controller
- Provisions scenario environments
- Seeds config
- Resets state
- Executes test matrix
- Collects structured outcomes
- Auto-fix worker
- Host-side workflow
- Creates a worktree
- launches a coding agent
- runs scoped verification
- opens a PR
The auto-fix worker should start outside the container. It needs direct repo and GitHub access, clean worktree control, and better isolation from the runtime under test.
qa-labextension
- Bundled extension under
extensions/qa-lab - Owns the QA harness, Markdown report flow, and private debugger UI
- Registers hidden CLI entrypoints such as
openclaw qa runandopenclaw qa ui - Stays separate from the shipped Control UI bundle
High-level flow
- Start
qa-bus. - Start OpenClaw in Docker with
qa-channelenabled. - QA orchestrator injects inbound messages into
qa-bus. qa-channelreceives them as normal inbound traffic.- OpenClaw runs the agent loop normally.
- Outbound replies and channel actions flow back through
qa-channelintoqa-bus. - QA orchestrator inspects state or waits on events.
- Orchestrator records pass/fail/flaky/unknown plus artifacts.
- Severe failures optionally emit a bug packet for the host-side fix worker.
Lanes
The system should have two distinct lanes.
Lane A: deterministic protocol lane
Use a deterministic or tightly controlled model setup.
Preferred options:
- a canned provider fixture
- the bundled
syntheticprovider when useful - fixed prompts with exact assertions
Purpose:
- verify transport and product semantics
- keep flakiness low
- catch regressions in routing, memory plumbing, thread binding, cron, and tool invocation
Lane B: quality lane
Use real providers and real models in a matrix.
Purpose:
- verify that the agent can still do good work end to end
- evaluate feature discoverability and instruction following
- surface model-specific breakage or degraded behavior
Expected result type:
- best-effort
- rubric-based
- more tolerant of wording variation
Matrix guidance for v1:
- start with a small curated matrix, not "everything configured"
- keep deterministic protocol runs separate from quality runs
- report matrix cells independently so one provider/model failure does not hide transport correctness
Do not mix these lanes. Protocol correctness and model quality should fail independently.
Use existing bootstrap seam first
Before the custom channel exists, OpenClaw already has a useful bootstrap path:
- admin-scoped synthetic originating-route fields on
chat.send - synthetic message-channel headers for HTTP flows
That is enough to build a first QA controller for:
- thread/session routing
- ACP bind flows
- subagent delivery
- cron wake paths
- memory persistence checks
This should be Phase 0 because it de-risks the scenario protocol before the full channel lands.
qa-lab extension design
qa-lab is the private operator-facing half of this system.
Suggested package:
extensions/qa-lab/
Suggested responsibilities:
- host the synthetic bus state machine
- host the scenario runner
- write Markdown reports
- serve a private debugger UI on a separate local server
- keep that UI entirely outside the shipped Control UI bundle
Suggested UI shape:
- left rail for conversations and threads
- center transcript pane
- right rail for event stream and report inspection
- bottom inject-composer for inbound QA traffic
qa-channel plugin design
Package layout
Suggested package:
extensions/qa-channel/
Suggested file layout:
package.jsonopenclaw.plugin.jsonindex.tssetup-entry.tsapi.tsruntime-api.tssrc/channel.tssrc/channel-api.tssrc/config-schema.tssrc/setup-core.tssrc/setup-surface.tssrc/runtime.tssrc/channel.runtime.tssrc/inbound.tssrc/outbound.tssrc/state-client.tssrc/targets.tssrc/threading.tssrc/message-actions.tssrc/probe.tssrc/doctor.tssrc/*.test.ts
Model it after Slack, Discord, Teams, or Google Chat packaging, not as a one-off test helper.
Capabilities
MVP capabilities:
- one account
- DMs
- channels
- threads
- send text
- reply in thread
- read
- edit
- delete
- react
- search
- upload-file
- download-file
Phase 2 capabilities:
- polls
- member-info
- channel-info
- channel-list
- pin and unpin
- permissions
- topic create and edit
These map naturally onto the shared message tool action model already used by
channel plugins.
Conversation model
Use a stable synthetic grammar that supports both simplicity and realistic coverage.
Suggested ids:
- DM conversation:
dm:<user-id> - channel:
chan:<space-id> - thread:
thread:<space-id>:<thread-id> - message id:
msg:<ulid>
Suggested target forms:
qa:dm:<user-id>qa:chan:<space-id>qa:thread:<space-id>:<thread-id>
The plugin should own translation between external target strings and canonical conversation ids.
Pairing and security
Even though this is a QA channel, it should still implement real policy surfaces:
- DM allowlist / pairing flow
- group policy
- mention gating where relevant
- trusted sender ids
Reason:
- these are product features and should be testable through the QA transport
- the QA lane should be able to verify policy failures, not only happy paths
Threading model
Threading is one of the main reasons to build this channel.
Required semantics:
- create thread from a top-level message
- reply inside an existing thread
- list thread messages
- preserve parent message linkage
- let OpenClaw thread binding attach a session to a thread
The QA bus must preserve:
- conversation id
- thread id
- parent message id
- sender id
- timestamps
Channel-owned message actions
The plugin should implement actions.describeMessageTool(...) and
actions.handleAction(...).
MVP action list:
sendreadreplyreacteditdeletethread-createthread-replysearchupload-filedownload-file
This is enough to test the shared message tool end to end with real channel
semantics.
qa-bus design
qa-bus is the transport simulator and assertion backend.
It should not know OpenClaw internals. It should know channel state.
For v1, keep qa-bus in this repo so:
- fixtures and scenarios evolve with product code
- the transport contract can change in lock-step with the plugin
- CI and local dev do not need another repo checkout
Responsibilities
- accept inbound user/platform events
- persist canonical conversation state
- persist append-only event log
- expose inspection APIs
- expose blocking wait APIs
- support reset per scenario or per suite
Transport
HTTP is enough for MVP.
Suggested endpoints:
POST /resetPOST /inbound/messagePOST /inbound/editPOST /inbound/deletePOST /inbound/reactionPOST /inbound/thread/createGET /state/conversationsGET /state/messagesGET /state/threadsGET /eventsPOST /wait
Optional WS stream:
/stream
Useful for live event taps and debugging.
State model
Persist three layers.
- Conversation snapshot
- participants
- type
- thread topology
- latest message pointers
- Message snapshot
- sender
- content
- attachments
- edit history
- reactions
- parent and thread linkage
- Append-only event log
- canonical timestamp
- causal ordering
- source: inbound, outbound, action, system
- payload
The append-only log matters because many QA assertions are event-oriented, not just state-oriented.
Assertion API
The harness needs waiters, not just snapshots.
Suggested POST /wait contract:
kindmatchtimeoutMs
Examples:
- wait for outbound message matching text regex
- wait for thread creation
- wait for reaction added
- wait for message edit
- wait for no event of type X within Y ms
This gives stable tests without custom polling code in every scenario.
QA orchestrator design
The orchestrator should own scenario planning and artifact collection.
Start host-side. Later, OpenClaw can orchestrate parts of it.
This is the chosen v1 direction.
Why:
- simpler to iterate while the transport and scenario protocol are still moving
- easier access to the repo, logs, Docker, and test fixtures
- easier artifact collection and report generation
- avoids over-coupling the first version to subagent behavior before the QA protocol itself is stable
Inputs
- docs pages
- channel capability discovery
- configured provider/model lane
- scenario catalog
- repo/test metadata
Outputs
- structured protocol report
- scenario transcript
- captured channel state
- gateway logs
- failure packets
For v1, the primary output is a Markdown report.
Suggested report sections:
- suite summary
- environment
- provider/model matrix
- scenarios passed
- scenarios failed
- flaky or inconclusive scenarios
- captured evidence links or inline excerpts
- suspected ownership or file hints
- follow-up recommendations
Scenario format
Use a data-driven scenario spec.
Suggested shape:
{
"id": "thread-memory-recall",
"lane": "deterministic",
"preconditions": ["qa-channel", "memory-enabled"],
"steps": [
{
"type": "injectMessage",
"to": "qa:dm:user-a",
"text": "Remember that the deploy key is kiwi."
},
{ "type": "waitForOutbound", "match": { "textIncludes": "kiwi" } },
{ "type": "injectMessage", "to": "qa:dm:user-a", "text": "What was the deploy key?" },
{ "type": "waitForOutbound", "match": { "textIncludes": "kiwi" } }
],
"assertions": [{ "type": "outboundTextIncludes", "value": "kiwi" }]
}
Keep the execution engine generic and the scenario catalog declarative.
Feature discovery
The orchestrator can discover candidate scenarios from three sources.
- Docs
- channel docs
- testing docs
- gateway docs
- subagents docs
- cron docs
- Runtime capability discovery
- channel
messageaction discovery - plugin status and channel capabilities
- configured providers/models
- Code hints
- known action names
- channel-specific feature flags
- config schema
This should produce a proposed protocol with:
- must-test
- can-test
- blocked
- unsupported
Scenario classes
Recommended catalog:
- transport basics
- DM send and reply
- channel send
- thread create and reply
- reaction add and read
- edit and delete
- policy
- allowlist
- pairing
- group mention gating
- shared
messagetool- read
- search
- reply
- react
- upload and download
- agent quality
- follows channel context
- obeys thread semantics
- uses memory across turns
- switches model when instructed
- automation
- cron add and run
- cron delivery into channel
- scheduled reminders
- subagents
- spawn
- announce
- threaded follow-up
- nested orchestration when enabled
- failure handling
- unsupported action
- timeout
- malformed target
- policy denial
OpenClaw as orchestrator
Longer-term, OpenClaw itself can coordinate the QA run.
Suggested architecture:
- one controller session
- N worker subagents
- each worker owns one scenario or scenario shard
- workers report structured results back to controller
Good fits for existing OpenClaw primitives:
sessions_spawnsubagents- cron-based wakeups for long-running suites
- thread-bound sessions for scenario-local follow-up
Best near-term use:
- controller generates the plan
- workers execute scenarios in parallel
- controller synthesizes report
Avoid making the controller also own host Git operations in the first version.
Chosen direction:
- v1: host-side controller
- v2+: OpenClaw-native orchestration once the scenario protocol and transport model are stable
Auto-fix workflow
The system should emit a structured bug packet when a scenario fails.
Suggested bug packet:
- scenario id
- lane
- failure kind
- minimal repro steps
- channel event transcript
- gateway transcript
- logs
- suspected files
- confidence
Host-side fix worker flow:
- receive bug packet
- create detached worktree
- launch coding agent in worktree
- write failing regression first when practical
- implement fix
- run scoped verification
- open PR
This should remain host-side at first because it needs:
- repo write access
- worktree hygiene
- git credentials
- GitHub auth
Chosen direction:
- do not auto-open PRs in v1
- emit Markdown reports and structured failure packets first
- add host-side worktree + PR automation later
Rollout plan
Phase 0: bootstrap on existing synthetic ingress
Build a first QA runner without a new channel:
- use
chat.sendwith admin-scoped synthetic originating-route fields - run deterministic scenarios against routing, memory, cron, subagents, and ACP
- validate protocol format and artifact collection
Exit criteria:
- scenario runner exists
- structured protocol report exists
- failure artifacts exist
Phase 1: MVP qa-channel
Build the plugin and bus with:
- DM
- channels
- threads
- read
- reply
- react
- edit
- delete
- search
Target semantics:
- Slack-class transport behavior
- not full Teams-class parity yet
Exit criteria:
- OpenClaw in Docker can talk to
qa-bus - harness can inject + inspect
- one green end-to-end suite across message transport and agent behavior
Phase 2: protocol expansion
Add:
- attachments
- polls
- pins
- richer policy tests
- quality lane with real provider/model matrix
Exit criteria:
- scenario matrix covers major built-in features
- deterministic and quality lanes are separated
Phase 3: subagent-driven QA
Add:
- controller agent
- worker subagents
- scenario discovery from docs + capability discovery
- parallel execution
Exit criteria:
- one controller can fan out and synthesize a suite report
Phase 4: auto-fix loop
Add:
- bug packet emission
- host-side worktree runner
- PR creation
Exit criteria:
- selected failures can auto-produce draft PRs
Risks
Risk: too much magic in one layer
If the QA channel, bus, and orchestrator all become smart at once, debugging will be painful.
Mitigation:
- keep
qa-channeltransport-focused - keep
qa-busstate-focused - keep orchestrator separate
Risk: flaky assertions from model variance
Mitigation:
- deterministic lane
- quality lane
- different pass criteria
Risk: test-only branches leaking into core
Mitigation:
- no core special cases for
qa-channel - use normal plugin seams
- use admin synthetic ingress only as bootstrap
Risk: auto-fix overreach
Mitigation:
- keep fix worker host-side
- require explicit policy for when PRs can open automatically
- gate with scoped tests
Risk: building a fake platform nobody uses
Mitigation:
- emulate Slack/Discord/Teams semantics, not an abstract transport
- prioritize features that stress shared OpenClaw boundaries
MVP recommendation
If building this now, start with this exact order.
- Host-side scenario runner using existing synthetic originating-route support.
qa-bussidecar with state, events, reset, and wait APIs.extensions/qa-channelMVP with DMs, channels, threads, reply, read, react, edit, delete, and search.- Markdown report generator for suite + matrix output.
- One deterministic end-to-end suite:
- inject inbound DM
- verify reply
- create thread
- verify follow-up in thread
- verify memory recall on later turn
- Add curated real-model matrix quality lane.
- Add controller subagent orchestration.
- Add host-side auto-fix worktree runner.
This order gets real value quickly without requiring the full grand design to land before the first useful signal appears.
Current product decisions
qa-buslives inside this repo- the first controller is host-side
- Slack-class behavior is the MVP target
- the quality lane uses a curated matrix
- first version produces Markdown reports, not PRs
- OpenClaw-native orchestration is a later phase, not a v1 requirement