Files
openclaw/docs/concepts/qa-e2e-automation.md
2026-04-05 23:21:56 +01:00

866 lines
20 KiB
Markdown

---
title: "QA E2E Automation"
summary: "Design note for a full end-to-end QA system built on a synthetic message-channel plugin, Dockerized OpenClaw, and subagent-driven scenario execution"
read_when:
- You are designing a true end-to-end QA harness for OpenClaw
- You want a synthetic message channel for automated feature verification
- You want subagents to discover features, run scenarios, and propose fixes
---
# QA E2E Automation
This note proposes a true end-to-end QA system for OpenClaw built around a
real channel plugin dedicated to testing.
The core idea:
- run OpenClaw inside Docker in a realistic gateway configuration
- expose a synthetic but full-featured message channel as a normal plugin
- let a QA harness inject inbound traffic and inspect outbound state
- let OpenClaw agents and subagents explore, verify, and report on behavior
- optionally escalate failing scenarios into host-side fix workflows that open PRs
This is not a unit-test replacement. It is a product-level system test layer.
## Chosen direction
The initial direction for this project is:
- build the full system inside this repo
- test against a matrix, not a single model/provider pair
- use Markdown reports as the first output artifact
- defer auto-PR and auto-fix work until later
- treat Slack-class semantics as the MVP transport target
- keep orchestration simple in v1, with a host-side controller that exercises
the moving parts directly
- evolve toward OpenClaw becoming the orchestration layer later, once the
transport, scenario, and reporting model are proven
## Goals
- Test OpenClaw through a real messaging-channel boundary, not only `chat.send`
or embedded mocks.
- Verify channel semantics that matter for real use:
- DMs
- channels/groups
- threads
- edits
- deletes
- reactions
- polls
- attachments
- Verify agent behavior across realistic user flows:
- memory
- thread binding
- model switching
- cron jobs
- subagents
- approvals
- routing
- channel-specific `message` actions
- Make the QA runner capable of feature discovery:
- read docs
- inspect plugin capability discovery
- inspect code and config
- generate a scenario protocol
- Support deterministic protocol tests and best-effort real-model tests as
separate lanes.
- Allow automated bug triage artifacts that can feed a host-side fix worker.
## Non-goals
- Not a replacement for existing unit, contract, or live tests.
- Not a production channel.
- Not a requirement that all bug fixing happen from inside the Dockerized
OpenClaw runtime.
- Not a reason to add test-only core branches for one channel.
## Why a channel plugin
OpenClaw already has the right boundary:
- core owns the shared `message` tool, prompt wiring, outer session
bookkeeping, and dispatch
- channel plugins own:
- config
- pairing
- security
- session grammar
- threading
- outbound delivery
- channel-owned actions and capability discovery
That means the cleanest design is:
- a real channel plugin for QA transport semantics
- a separate QA control plane for injection and inspection
This keeps the test transport inside the same architecture used by Slack,
Discord, Teams, and similar channels.
## System overview
The system has six pieces.
1. `qa-channel` plugin
- Bundled extension under `extensions/qa-channel`
- Normal `ChannelPlugin`
- Behaves like a Slack/Discord/Teams-class channel
- Registers channel-owned message actions through the shared `message` tool
2. `qa-bus` sidecar
- Small HTTP and/or WS service
- Canonical state store for synthetic conversations, messages, threads,
reactions, edits, and event history
- Accepts inbound events from the harness
- Exposes inspection and wait APIs for assertions
3. Dockerized OpenClaw gateway
- Runs as close to real deployment as practical
- Loads `qa-channel`
- Uses normal config, routing, session, cron, and plugin loading
4. QA orchestrator
- Host-side runner or dedicated OpenClaw-driven controller
- Provisions scenario environments
- Seeds config
- Resets state
- Executes test matrix
- Collects structured outcomes
5. Auto-fix worker
- Host-side workflow
- Creates a worktree
- launches a coding agent
- runs scoped verification
- opens a PR
The auto-fix worker should start outside the container. It needs direct repo
and GitHub access, clean worktree control, and better isolation from the
runtime under test.
6. `qa-lab` extension
- Bundled extension under `extensions/qa-lab`
- Owns the QA harness, Markdown report flow, and private debugger UI
- Registers hidden CLI entrypoints such as `openclaw qa run` and
`openclaw qa ui`
- Stays separate from the shipped Control UI bundle
## High-level flow
1. Start `qa-bus`.
2. Start OpenClaw in Docker with `qa-channel` enabled.
3. QA orchestrator injects inbound messages into `qa-bus`.
4. `qa-channel` receives them as normal inbound traffic.
5. OpenClaw runs the agent loop normally.
6. Outbound replies and channel actions flow back through `qa-channel` into
`qa-bus`.
7. QA orchestrator inspects state or waits on events.
8. Orchestrator records pass/fail/flaky/unknown plus artifacts.
9. Severe failures optionally emit a bug packet for the host-side fix worker.
## Lanes
The system should have two distinct lanes.
### Lane A: deterministic protocol lane
Use a deterministic or tightly controlled model setup.
Preferred options:
- a canned provider fixture
- the bundled `synthetic` provider when useful
- fixed prompts with exact assertions
Purpose:
- verify transport and product semantics
- keep flakiness low
- catch regressions in routing, memory plumbing, thread binding, cron, and tool
invocation
### Lane B: quality lane
Use real providers and real models in a matrix.
Purpose:
- verify that the agent can still do good work end to end
- evaluate feature discoverability and instruction following
- surface model-specific breakage or degraded behavior
Expected result type:
- best-effort
- rubric-based
- more tolerant of wording variation
Matrix guidance for v1:
- start with a small curated matrix, not "everything configured"
- keep deterministic protocol runs separate from quality runs
- report matrix cells independently so one provider/model failure does not hide
transport correctness
Do not mix these lanes. Protocol correctness and model quality should fail
independently.
## Use existing bootstrap seam first
Before the custom channel exists, OpenClaw already has a useful bootstrap path:
- admin-scoped synthetic originating-route fields on `chat.send`
- synthetic message-channel headers for HTTP flows
That is enough to build a first QA controller for:
- thread/session routing
- ACP bind flows
- subagent delivery
- cron wake paths
- memory persistence checks
This should be Phase 0 because it de-risks the scenario protocol before the
full channel lands.
## `qa-lab` extension design
`qa-lab` is the private operator-facing half of this system.
Suggested package:
- `extensions/qa-lab/`
Suggested responsibilities:
- host the synthetic bus state machine
- host the scenario runner
- write Markdown reports
- serve a private debugger UI on a separate local server
- keep that UI entirely outside the shipped Control UI bundle
Suggested UI shape:
- left rail for conversations and threads
- center transcript pane
- right rail for event stream and report inspection
- bottom inject-composer for inbound QA traffic
## `qa-channel` plugin design
## Package layout
Suggested package:
- `extensions/qa-channel/`
Suggested file layout:
- `package.json`
- `openclaw.plugin.json`
- `index.ts`
- `setup-entry.ts`
- `api.ts`
- `runtime-api.ts`
- `src/channel.ts`
- `src/channel-api.ts`
- `src/config-schema.ts`
- `src/setup-core.ts`
- `src/setup-surface.ts`
- `src/runtime.ts`
- `src/channel.runtime.ts`
- `src/inbound.ts`
- `src/outbound.ts`
- `src/state-client.ts`
- `src/targets.ts`
- `src/threading.ts`
- `src/message-actions.ts`
- `src/probe.ts`
- `src/doctor.ts`
- `src/*.test.ts`
Model it after Slack, Discord, Teams, or Google Chat packaging, not as a one-off
test helper.
## Capabilities
MVP capabilities:
- one account
- DMs
- channels
- threads
- send text
- reply in thread
- read
- edit
- delete
- react
- search
- upload-file
- download-file
Phase 2 capabilities:
- polls
- member-info
- channel-info
- channel-list
- pin and unpin
- permissions
- topic create and edit
These map naturally onto the shared `message` tool action model already used by
channel plugins.
## Conversation model
Use a stable synthetic grammar that supports both simplicity and realistic
coverage.
Suggested ids:
- DM conversation: `dm:<user-id>`
- channel: `chan:<space-id>`
- thread: `thread:<space-id>:<thread-id>`
- message id: `msg:<ulid>`
Suggested target forms:
- `qa:dm:<user-id>`
- `qa:chan:<space-id>`
- `qa:thread:<space-id>:<thread-id>`
The plugin should own translation between external target strings and canonical
conversation ids.
## Pairing and security
Even though this is a QA channel, it should still implement real policy
surfaces:
- DM allowlist / pairing flow
- group policy
- mention gating where relevant
- trusted sender ids
Reason:
- these are product features and should be testable through the QA transport
- the QA lane should be able to verify policy failures, not only happy paths
## Threading model
Threading is one of the main reasons to build this channel.
Required semantics:
- create thread from a top-level message
- reply inside an existing thread
- list thread messages
- preserve parent message linkage
- let OpenClaw thread binding attach a session to a thread
The QA bus must preserve:
- conversation id
- thread id
- parent message id
- sender id
- timestamps
## Channel-owned message actions
The plugin should implement `actions.describeMessageTool(...)` and
`actions.handleAction(...)`.
MVP action list:
- `send`
- `read`
- `reply`
- `react`
- `edit`
- `delete`
- `thread-create`
- `thread-reply`
- `search`
- `upload-file`
- `download-file`
This is enough to test the shared `message` tool end to end with real channel
semantics.
## `qa-bus` design
`qa-bus` is the transport simulator and assertion backend.
It should not know OpenClaw internals. It should know channel state.
For v1, keep `qa-bus` in this repo so:
- fixtures and scenarios evolve with product code
- the transport contract can change in lock-step with the plugin
- CI and local dev do not need another repo checkout
## Responsibilities
- accept inbound user/platform events
- persist canonical conversation state
- persist append-only event log
- expose inspection APIs
- expose blocking wait APIs
- support reset per scenario or per suite
## Transport
HTTP is enough for MVP.
Suggested endpoints:
- `POST /reset`
- `POST /inbound/message`
- `POST /inbound/edit`
- `POST /inbound/delete`
- `POST /inbound/reaction`
- `POST /inbound/thread/create`
- `GET /state/conversations`
- `GET /state/messages`
- `GET /state/threads`
- `GET /events`
- `POST /wait`
Optional WS stream:
- `/stream`
Useful for live event taps and debugging.
## State model
Persist three layers.
1. Conversation snapshot
- participants
- type
- thread topology
- latest message pointers
2. Message snapshot
- sender
- content
- attachments
- edit history
- reactions
- parent and thread linkage
3. Append-only event log
- canonical timestamp
- causal ordering
- source: inbound, outbound, action, system
- payload
The append-only log matters because many QA assertions are event-oriented, not
just state-oriented.
## Assertion API
The harness needs waiters, not just snapshots.
Suggested `POST /wait` contract:
- `kind`
- `match`
- `timeoutMs`
Examples:
- wait for outbound message matching text regex
- wait for thread creation
- wait for reaction added
- wait for message edit
- wait for no event of type X within Y ms
This gives stable tests without custom polling code in every scenario.
## QA orchestrator design
The orchestrator should own scenario planning and artifact collection.
Start host-side. Later, OpenClaw can orchestrate parts of it.
This is the chosen v1 direction.
Why:
- simpler to iterate while the transport and scenario protocol are still moving
- easier access to the repo, logs, Docker, and test fixtures
- easier artifact collection and report generation
- avoids over-coupling the first version to subagent behavior before the QA
protocol itself is stable
## Inputs
- docs pages
- channel capability discovery
- configured provider/model lane
- scenario catalog
- repo/test metadata
## Outputs
- structured protocol report
- scenario transcript
- captured channel state
- gateway logs
- failure packets
For v1, the primary output is a Markdown report.
Suggested report sections:
- suite summary
- environment
- provider/model matrix
- scenarios passed
- scenarios failed
- flaky or inconclusive scenarios
- captured evidence links or inline excerpts
- suspected ownership or file hints
- follow-up recommendations
## Scenario format
Use a data-driven scenario spec.
Suggested shape:
```json
{
"id": "thread-memory-recall",
"lane": "deterministic",
"preconditions": ["qa-channel", "memory-enabled"],
"steps": [
{
"type": "injectMessage",
"to": "qa:dm:user-a",
"text": "Remember that the deploy key is kiwi."
},
{ "type": "waitForOutbound", "match": { "textIncludes": "kiwi" } },
{ "type": "injectMessage", "to": "qa:dm:user-a", "text": "What was the deploy key?" },
{ "type": "waitForOutbound", "match": { "textIncludes": "kiwi" } }
],
"assertions": [{ "type": "outboundTextIncludes", "value": "kiwi" }]
}
```
Keep the execution engine generic and the scenario catalog declarative.
## Feature discovery
The orchestrator can discover candidate scenarios from three sources.
1. Docs
- channel docs
- testing docs
- gateway docs
- subagents docs
- cron docs
2. Runtime capability discovery
- channel `message` action discovery
- plugin status and channel capabilities
- configured providers/models
3. Code hints
- known action names
- channel-specific feature flags
- config schema
This should produce a proposed protocol with:
- must-test
- can-test
- blocked
- unsupported
## Scenario classes
Recommended catalog:
- transport basics
- DM send and reply
- channel send
- thread create and reply
- reaction add and read
- edit and delete
- policy
- allowlist
- pairing
- group mention gating
- shared `message` tool
- read
- search
- reply
- react
- upload and download
- agent quality
- follows channel context
- obeys thread semantics
- uses memory across turns
- switches model when instructed
- automation
- cron add and run
- cron delivery into channel
- scheduled reminders
- subagents
- spawn
- announce
- threaded follow-up
- nested orchestration when enabled
- failure handling
- unsupported action
- timeout
- malformed target
- policy denial
## OpenClaw as orchestrator
Longer-term, OpenClaw itself can coordinate the QA run.
Suggested architecture:
- one controller session
- N worker subagents
- each worker owns one scenario or scenario shard
- workers report structured results back to controller
Good fits for existing OpenClaw primitives:
- `sessions_spawn`
- `subagents`
- cron-based wakeups for long-running suites
- thread-bound sessions for scenario-local follow-up
Best near-term use:
- controller generates the plan
- workers execute scenarios in parallel
- controller synthesizes report
Avoid making the controller also own host Git operations in the first version.
Chosen direction:
- v1: host-side controller
- v2+: OpenClaw-native orchestration once the scenario protocol and transport
model are stable
## Auto-fix workflow
The system should emit a structured bug packet when a scenario fails.
Suggested bug packet:
- scenario id
- lane
- failure kind
- minimal repro steps
- channel event transcript
- gateway transcript
- logs
- suspected files
- confidence
Host-side fix worker flow:
1. receive bug packet
2. create detached worktree
3. launch coding agent in worktree
4. write failing regression first when practical
5. implement fix
6. run scoped verification
7. open PR
This should remain host-side at first because it needs:
- repo write access
- worktree hygiene
- git credentials
- GitHub auth
Chosen direction:
- do not auto-open PRs in v1
- emit Markdown reports and structured failure packets first
- add host-side worktree + PR automation later
## Rollout plan
## Phase 0: bootstrap on existing synthetic ingress
Build a first QA runner without a new channel:
- use `chat.send` with admin-scoped synthetic originating-route fields
- run deterministic scenarios against routing, memory, cron, subagents, and ACP
- validate protocol format and artifact collection
Exit criteria:
- scenario runner exists
- structured protocol report exists
- failure artifacts exist
## Phase 1: MVP `qa-channel`
Build the plugin and bus with:
- DM
- channels
- threads
- read
- reply
- react
- edit
- delete
- search
Target semantics:
- Slack-class transport behavior
- not full Teams-class parity yet
Exit criteria:
- OpenClaw in Docker can talk to `qa-bus`
- harness can inject + inspect
- one green end-to-end suite across message transport and agent behavior
## Phase 2: protocol expansion
Add:
- attachments
- polls
- pins
- richer policy tests
- quality lane with real provider/model matrix
Exit criteria:
- scenario matrix covers major built-in features
- deterministic and quality lanes are separated
## Phase 3: subagent-driven QA
Add:
- controller agent
- worker subagents
- scenario discovery from docs + capability discovery
- parallel execution
Exit criteria:
- one controller can fan out and synthesize a suite report
## Phase 4: auto-fix loop
Add:
- bug packet emission
- host-side worktree runner
- PR creation
Exit criteria:
- selected failures can auto-produce draft PRs
## Risks
## Risk: too much magic in one layer
If the QA channel, bus, and orchestrator all become smart at once, debugging
will be painful.
Mitigation:
- keep `qa-channel` transport-focused
- keep `qa-bus` state-focused
- keep orchestrator separate
## Risk: flaky assertions from model variance
Mitigation:
- deterministic lane
- quality lane
- different pass criteria
## Risk: test-only branches leaking into core
Mitigation:
- no core special cases for `qa-channel`
- use normal plugin seams
- use admin synthetic ingress only as bootstrap
## Risk: auto-fix overreach
Mitigation:
- keep fix worker host-side
- require explicit policy for when PRs can open automatically
- gate with scoped tests
## Risk: building a fake platform nobody uses
Mitigation:
- emulate Slack/Discord/Teams semantics, not an abstract transport
- prioritize features that stress shared OpenClaw boundaries
## MVP recommendation
If building this now, start with this exact order.
1. Host-side scenario runner using existing synthetic originating-route support.
2. `qa-bus` sidecar with state, events, reset, and wait APIs.
3. `extensions/qa-channel` MVP with DMs, channels, threads, reply, read, react,
edit, delete, and search.
4. Markdown report generator for suite + matrix output.
5. One deterministic end-to-end suite:
- inject inbound DM
- verify reply
- create thread
- verify follow-up in thread
- verify memory recall on later turn
6. Add curated real-model matrix quality lane.
7. Add controller subagent orchestration.
8. Add host-side auto-fix worktree runner.
This order gets real value quickly without requiring the full grand design to
land before the first useful signal appears.
## Current product decisions
- `qa-bus` lives inside this repo
- the first controller is host-side
- Slack-class behavior is the MVP target
- the quality lane uses a curated matrix
- first version produces Markdown reports, not PRs
- OpenClaw-native orchestration is a later phase, not a v1 requirement