vultr/openclaw

Fork 0

mirror of https://github.com/openclaw/openclaw.git synced 2026-04-12 01:31:08 +00:00

Files

Peter Steinberger 17a324b0de chore: polish qa lab follow-ups

2026-04-05 23:21:56 +01:00

20 KiB

Raw Blame History

title, summary, read_when

title

summary

read_when

QA E2E Automation

Design note for a full end-to-end QA system built on a synthetic message-channel plugin, Dockerized OpenClaw, and subagent-driven scenario execution

You are designing a true end-to-end QA harness for OpenClaw

You want a synthetic message channel for automated feature verification

You want subagents to discover features, run scenarios, and propose fixes

QA E2E Automation

This note proposes a true end-to-end QA system for OpenClaw built around a real channel plugin dedicated to testing.

The core idea:

run OpenClaw inside Docker in a realistic gateway configuration
expose a synthetic but full-featured message channel as a normal plugin
let a QA harness inject inbound traffic and inspect outbound state
let OpenClaw agents and subagents explore, verify, and report on behavior
optionally escalate failing scenarios into host-side fix workflows that open PRs

This is not a unit-test replacement. It is a product-level system test layer.

Chosen direction

The initial direction for this project is:

build the full system inside this repo
test against a matrix, not a single model/provider pair
use Markdown reports as the first output artifact
defer auto-PR and auto-fix work until later
treat Slack-class semantics as the MVP transport target
keep orchestration simple in v1, with a host-side controller that exercises the moving parts directly
evolve toward OpenClaw becoming the orchestration layer later, once the transport, scenario, and reporting model are proven

Goals

Test OpenClaw through a real messaging-channel boundary, not only chat.send or embedded mocks.
Verify channel semantics that matter for real use:
- DMs
- channels/groups
- threads
- edits
- deletes
- reactions
- polls
- attachments
Verify agent behavior across realistic user flows:
- memory
- thread binding
- model switching
- cron jobs
- subagents
- approvals
- routing
- channel-specific message actions
Make the QA runner capable of feature discovery:
- read docs
- inspect plugin capability discovery
- inspect code and config
- generate a scenario protocol
Support deterministic protocol tests and best-effort real-model tests as separate lanes.
Allow automated bug triage artifacts that can feed a host-side fix worker.

Non-goals

Not a replacement for existing unit, contract, or live tests.
Not a production channel.
Not a requirement that all bug fixing happen from inside the Dockerized OpenClaw runtime.
Not a reason to add test-only core branches for one channel.

Why a channel plugin

OpenClaw already has the right boundary:

core owns the shared message tool, prompt wiring, outer session bookkeeping, and dispatch
channel plugins own:
- config
- pairing
- security
- session grammar
- threading
- outbound delivery
- channel-owned actions and capability discovery

That means the cleanest design is:

a real channel plugin for QA transport semantics
a separate QA control plane for injection and inspection

This keeps the test transport inside the same architecture used by Slack, Discord, Teams, and similar channels.

System overview

The system has six pieces.

qa-channel plugin

Bundled extension under extensions/qa-channel
Normal ChannelPlugin
Behaves like a Slack/Discord/Teams-class channel
Registers channel-owned message actions through the shared message tool

qa-bus sidecar

Small HTTP and/or WS service
Canonical state store for synthetic conversations, messages, threads, reactions, edits, and event history
Accepts inbound events from the harness
Exposes inspection and wait APIs for assertions

Dockerized OpenClaw gateway

Runs as close to real deployment as practical
Loads qa-channel
Uses normal config, routing, session, cron, and plugin loading

QA orchestrator

Host-side runner or dedicated OpenClaw-driven controller
Provisions scenario environments
Seeds config
Resets state
Executes test matrix
Collects structured outcomes

Auto-fix worker

Host-side workflow
Creates a worktree
launches a coding agent
runs scoped verification
opens a PR

The auto-fix worker should start outside the container. It needs direct repo and GitHub access, clean worktree control, and better isolation from the runtime under test.

qa-lab extension

Bundled extension under extensions/qa-lab
Owns the QA harness, Markdown report flow, and private debugger UI
Registers hidden CLI entrypoints such as openclaw qa run and openclaw qa ui
Stays separate from the shipped Control UI bundle

High-level flow

Start qa-bus.
Start OpenClaw in Docker with qa-channel enabled.
QA orchestrator injects inbound messages into qa-bus.
qa-channel receives them as normal inbound traffic.
OpenClaw runs the agent loop normally.
Outbound replies and channel actions flow back through qa-channel into qa-bus.
QA orchestrator inspects state or waits on events.
Orchestrator records pass/fail/flaky/unknown plus artifacts.
Severe failures optionally emit a bug packet for the host-side fix worker.

Lanes

The system should have two distinct lanes.

Lane A: deterministic protocol lane

Use a deterministic or tightly controlled model setup.

Preferred options:

a canned provider fixture
the bundled synthetic provider when useful
fixed prompts with exact assertions

Purpose:

verify transport and product semantics
keep flakiness low
catch regressions in routing, memory plumbing, thread binding, cron, and tool invocation

Lane B: quality lane

Use real providers and real models in a matrix.

Purpose:

verify that the agent can still do good work end to end
evaluate feature discoverability and instruction following
surface model-specific breakage or degraded behavior

Expected result type:

best-effort
rubric-based
more tolerant of wording variation

Matrix guidance for v1:

start with a small curated matrix, not "everything configured"
keep deterministic protocol runs separate from quality runs
report matrix cells independently so one provider/model failure does not hide transport correctness

Do not mix these lanes. Protocol correctness and model quality should fail independently.

Use existing bootstrap seam first

Before the custom channel exists, OpenClaw already has a useful bootstrap path:

admin-scoped synthetic originating-route fields on chat.send
synthetic message-channel headers for HTTP flows

That is enough to build a first QA controller for:

thread/session routing
ACP bind flows
subagent delivery
cron wake paths
memory persistence checks

This should be Phase 0 because it de-risks the scenario protocol before the full channel lands.

`qa-lab` extension design

qa-lab is the private operator-facing half of this system.

Suggested package:

extensions/qa-lab/

Suggested responsibilities:

host the synthetic bus state machine
host the scenario runner
write Markdown reports
serve a private debugger UI on a separate local server
keep that UI entirely outside the shipped Control UI bundle

Suggested UI shape:

left rail for conversations and threads
center transcript pane
right rail for event stream and report inspection
bottom inject-composer for inbound QA traffic

`qa-channel` plugin design

Package layout

Suggested package:

extensions/qa-channel/

Suggested file layout:

package.json
openclaw.plugin.json
index.ts
setup-entry.ts
api.ts
runtime-api.ts
src/channel.ts
src/channel-api.ts
src/config-schema.ts
src/setup-core.ts
src/setup-surface.ts
src/runtime.ts
src/channel.runtime.ts
src/inbound.ts
src/outbound.ts
src/state-client.ts
src/targets.ts
src/threading.ts
src/message-actions.ts
src/probe.ts
src/doctor.ts
src/*.test.ts

Model it after Slack, Discord, Teams, or Google Chat packaging, not as a one-off test helper.

Capabilities

MVP capabilities:

one account
DMs
channels
threads
send text
reply in thread
read
edit
delete
react
search
upload-file
download-file

Phase 2 capabilities:

polls
member-info
channel-info
channel-list
pin and unpin
permissions
topic create and edit

These map naturally onto the shared message tool action model already used by channel plugins.

Conversation model

Use a stable synthetic grammar that supports both simplicity and realistic coverage.

Suggested ids:

DM conversation: dm:<user-id>
channel: chan:<space-id>
thread: thread:<space-id>:<thread-id>
message id: msg:<ulid>

Suggested target forms:

qa:dm:<user-id>
qa:chan:<space-id>
qa:thread:<space-id>:<thread-id>

The plugin should own translation between external target strings and canonical conversation ids.

Pairing and security

Even though this is a QA channel, it should still implement real policy surfaces:

DM allowlist / pairing flow
group policy
mention gating where relevant
trusted sender ids

Reason:

these are product features and should be testable through the QA transport
the QA lane should be able to verify policy failures, not only happy paths

Threading model

Threading is one of the main reasons to build this channel.

Required semantics:

create thread from a top-level message
reply inside an existing thread
list thread messages
preserve parent message linkage
let OpenClaw thread binding attach a session to a thread

The QA bus must preserve:

conversation id
thread id
parent message id
sender id
timestamps

Channel-owned message actions

The plugin should implement actions.describeMessageTool(...) and actions.handleAction(...).

MVP action list:

send
read
reply
react
edit
delete
thread-create
thread-reply
search
upload-file
download-file

This is enough to test the shared message tool end to end with real channel semantics.

`qa-bus` design

qa-bus is the transport simulator and assertion backend.

It should not know OpenClaw internals. It should know channel state.

For v1, keep qa-bus in this repo so:

fixtures and scenarios evolve with product code
the transport contract can change in lock-step with the plugin
CI and local dev do not need another repo checkout

Responsibilities

accept inbound user/platform events
persist canonical conversation state
persist append-only event log
expose inspection APIs
expose blocking wait APIs
support reset per scenario or per suite

Transport

HTTP is enough for MVP.

Suggested endpoints:

POST /reset
POST /inbound/message
POST /inbound/edit
POST /inbound/delete
POST /inbound/reaction
POST /inbound/thread/create
GET /state/conversations
GET /state/messages
GET /state/threads
GET /events
POST /wait

Optional WS stream:

/stream

Useful for live event taps and debugging.

State model

Persist three layers.

Conversation snapshot

participants
type
thread topology
latest message pointers

Message snapshot

sender
content
attachments
edit history
reactions
parent and thread linkage

Append-only event log

canonical timestamp
causal ordering
source: inbound, outbound, action, system
payload

The append-only log matters because many QA assertions are event-oriented, not just state-oriented.

Assertion API

The harness needs waiters, not just snapshots.

Suggested POST /wait contract:

kind
match
timeoutMs

Examples:

wait for outbound message matching text regex
wait for thread creation
wait for reaction added
wait for message edit
wait for no event of type X within Y ms

This gives stable tests without custom polling code in every scenario.

QA orchestrator design

The orchestrator should own scenario planning and artifact collection.

Start host-side. Later, OpenClaw can orchestrate parts of it.

This is the chosen v1 direction.

Why:

simpler to iterate while the transport and scenario protocol are still moving
easier access to the repo, logs, Docker, and test fixtures
easier artifact collection and report generation
avoids over-coupling the first version to subagent behavior before the QA protocol itself is stable

Inputs

docs pages
channel capability discovery
configured provider/model lane
scenario catalog
repo/test metadata

Outputs

structured protocol report
scenario transcript
captured channel state
gateway logs
failure packets

For v1, the primary output is a Markdown report.

Suggested report sections:

suite summary
environment
provider/model matrix
scenarios passed
scenarios failed
flaky or inconclusive scenarios
captured evidence links or inline excerpts
suspected ownership or file hints
follow-up recommendations

Scenario format

Use a data-driven scenario spec.

Suggested shape:

{
  "id": "thread-memory-recall",
  "lane": "deterministic",
  "preconditions": ["qa-channel", "memory-enabled"],
  "steps": [
    {
      "type": "injectMessage",
      "to": "qa:dm:user-a",
      "text": "Remember that the deploy key is kiwi."
    },
    { "type": "waitForOutbound", "match": { "textIncludes": "kiwi" } },
    { "type": "injectMessage", "to": "qa:dm:user-a", "text": "What was the deploy key?" },
    { "type": "waitForOutbound", "match": { "textIncludes": "kiwi" } }
  ],
  "assertions": [{ "type": "outboundTextIncludes", "value": "kiwi" }]
}

Keep the execution engine generic and the scenario catalog declarative.

Feature discovery

The orchestrator can discover candidate scenarios from three sources.

Docs

channel docs
testing docs
gateway docs
subagents docs
cron docs

Runtime capability discovery

channel message action discovery
plugin status and channel capabilities
configured providers/models

Code hints

known action names
channel-specific feature flags
config schema

This should produce a proposed protocol with:

must-test
can-test
blocked
unsupported

Scenario classes

Recommended catalog:

transport basics
- DM send and reply
- channel send
- thread create and reply
- reaction add and read
- edit and delete
policy
- allowlist
- pairing
- group mention gating
shared message tool
- read
- search
- reply
- react
- upload and download
agent quality
- follows channel context
- obeys thread semantics
- uses memory across turns
- switches model when instructed
automation
- cron add and run
- cron delivery into channel
- scheduled reminders
subagents
- spawn
- announce
- threaded follow-up
- nested orchestration when enabled
failure handling
- unsupported action
- timeout
- malformed target
- policy denial

OpenClaw as orchestrator

Longer-term, OpenClaw itself can coordinate the QA run.

Suggested architecture:

one controller session
N worker subagents
each worker owns one scenario or scenario shard
workers report structured results back to controller

Good fits for existing OpenClaw primitives:

sessions_spawn
subagents
cron-based wakeups for long-running suites
thread-bound sessions for scenario-local follow-up

Best near-term use:

controller generates the plan
workers execute scenarios in parallel
controller synthesizes report

Avoid making the controller also own host Git operations in the first version.

Chosen direction:

v1: host-side controller
v2+: OpenClaw-native orchestration once the scenario protocol and transport model are stable

Auto-fix workflow

The system should emit a structured bug packet when a scenario fails.

Suggested bug packet:

scenario id
lane
failure kind
minimal repro steps
channel event transcript
gateway transcript
logs
suspected files
confidence

Host-side fix worker flow:

receive bug packet
create detached worktree
launch coding agent in worktree
write failing regression first when practical
implement fix
run scoped verification
open PR

This should remain host-side at first because it needs:

repo write access
worktree hygiene
git credentials
GitHub auth

Chosen direction:

do not auto-open PRs in v1
emit Markdown reports and structured failure packets first
add host-side worktree + PR automation later

Rollout plan

Phase 0: bootstrap on existing synthetic ingress

Build a first QA runner without a new channel:

use chat.send with admin-scoped synthetic originating-route fields
run deterministic scenarios against routing, memory, cron, subagents, and ACP
validate protocol format and artifact collection

Exit criteria:

scenario runner exists
structured protocol report exists
failure artifacts exist

Phase 1: MVP `qa-channel`

Build the plugin and bus with:

DM
channels
threads
read
reply
react
edit
delete
search

Target semantics:

Slack-class transport behavior
not full Teams-class parity yet

Exit criteria:

OpenClaw in Docker can talk to qa-bus
harness can inject + inspect
one green end-to-end suite across message transport and agent behavior

Phase 2: protocol expansion

Add:

attachments
polls
pins
richer policy tests
quality lane with real provider/model matrix

Exit criteria:

scenario matrix covers major built-in features
deterministic and quality lanes are separated

Phase 3: subagent-driven QA

Add:

controller agent
worker subagents
scenario discovery from docs + capability discovery
parallel execution

Exit criteria:

one controller can fan out and synthesize a suite report

Phase 4: auto-fix loop

Add:

bug packet emission
host-side worktree runner
PR creation

Exit criteria:

selected failures can auto-produce draft PRs

Risks

Risk: too much magic in one layer

If the QA channel, bus, and orchestrator all become smart at once, debugging will be painful.

Mitigation:

keep qa-channel transport-focused
keep qa-bus state-focused
keep orchestrator separate

Risk: flaky assertions from model variance

Mitigation:

deterministic lane
quality lane
different pass criteria

Risk: test-only branches leaking into core

Mitigation:

no core special cases for qa-channel
use normal plugin seams
use admin synthetic ingress only as bootstrap

Risk: auto-fix overreach

Mitigation:

keep fix worker host-side
require explicit policy for when PRs can open automatically
gate with scoped tests

Risk: building a fake platform nobody uses

Mitigation:

emulate Slack/Discord/Teams semantics, not an abstract transport
prioritize features that stress shared OpenClaw boundaries

MVP recommendation

If building this now, start with this exact order.

Host-side scenario runner using existing synthetic originating-route support.
qa-bus sidecar with state, events, reset, and wait APIs.
extensions/qa-channel MVP with DMs, channels, threads, reply, read, react, edit, delete, and search.
Markdown report generator for suite + matrix output.
One deterministic end-to-end suite:
- inject inbound DM
- verify reply
- create thread
- verify follow-up in thread
- verify memory recall on later turn
Add curated real-model matrix quality lane.
Add controller subagent orchestration.
Add host-side auto-fix worktree runner.

This order gets real value quickly without requiring the full grand design to land before the first useful signal appears.

Current product decisions

qa-bus lives inside this repo
the first controller is host-side
Slack-class behavior is the MVP target
the quality lane uses a curated matrix
first version produces Markdown reports, not PRs
OpenClaw-native orchestration is a later phase, not a v1 requirement

20 KiB Raw Blame History

QA E2E Automation

Chosen direction

Goals

Non-goals

Why a channel plugin

System overview

High-level flow

Lanes

Lane A: deterministic protocol lane

Lane B: quality lane

Use existing bootstrap seam first

qa-lab extension design

qa-channel plugin design

Package layout

Capabilities

Conversation model

Pairing and security

Threading model

Channel-owned message actions

qa-bus design

Responsibilities

Transport

State model

Assertion API

QA orchestrator design

Inputs

Outputs

Scenario format

Feature discovery

Scenario classes

OpenClaw as orchestrator

Auto-fix workflow

Rollout plan

Phase 0: bootstrap on existing synthetic ingress

Phase 1: MVP qa-channel

Phase 2: protocol expansion

Phase 3: subagent-driven QA

Phase 4: auto-fix loop

Risks

Risk: too much magic in one layer

Risk: flaky assertions from model variance

Risk: test-only branches leaking into core

Risk: auto-fix overreach

Risk: building a fake platform nobody uses

MVP recommendation

Current product decisions

20 KiB

Raw Blame History

`qa-lab` extension design

`qa-channel` plugin design

`qa-bus` design

Phase 1: MVP `qa-channel`