mirror of
https://github.com/openclaw/openclaw.git
synced 2026-04-12 17:51:22 +00:00
527 lines
12 KiB
Markdown
527 lines
12 KiB
Markdown
# QA Refactor
|
|
|
|
Status: foundational migration landed.
|
|
|
|
## Goal
|
|
|
|
Move OpenClaw QA from a split-definition model to a single source of truth:
|
|
|
|
- scenario metadata
|
|
- prompts sent to the model
|
|
- setup and teardown
|
|
- harness logic
|
|
- assertions and success criteria
|
|
- artifacts and report hints
|
|
|
|
The desired end state is a generic QA harness that loads powerful scenario definition files instead of hardcoding most behavior in TypeScript.
|
|
|
|
## Current State
|
|
|
|
Primary source of truth now lives in `qa/scenarios.md`.
|
|
|
|
Implemented:
|
|
|
|
- `qa/scenarios.md`
|
|
- canonical QA pack
|
|
- operator identity
|
|
- kickoff mission
|
|
- scenario metadata
|
|
- handler bindings
|
|
- `extensions/qa-lab/src/scenario-catalog.ts`
|
|
- markdown pack parser + zod validation
|
|
- `extensions/qa-lab/src/qa-agent-bootstrap.ts`
|
|
- plan rendering from the markdown pack
|
|
- `extensions/qa-lab/src/qa-agent-workspace.ts`
|
|
- seeds generated compatibility files plus `QA_SCENARIOS.md`
|
|
- `extensions/qa-lab/src/suite.ts`
|
|
- selects executable scenarios through markdown-defined handler bindings
|
|
- QA bus protocol + UI
|
|
- generic inline attachments for image/video/audio/file rendering
|
|
|
|
Remaining split surfaces:
|
|
|
|
- `extensions/qa-lab/src/suite.ts`
|
|
- still owns most executable custom handler logic
|
|
- `extensions/qa-lab/src/report.ts`
|
|
- still derives report structure from runtime outputs
|
|
|
|
So the source-of-truth split is fixed, but execution is still mostly handler-backed rather than fully declarative.
|
|
|
|
## What The Real Scenario Surface Looks Like
|
|
|
|
Reading the current suite shows a few distinct scenario classes.
|
|
|
|
### Simple interaction
|
|
|
|
- channel baseline
|
|
- DM baseline
|
|
- threaded follow-up
|
|
- model switch
|
|
- approval followthrough
|
|
- reaction/edit/delete
|
|
|
|
### Config and runtime mutation
|
|
|
|
- config patch skill disable
|
|
- config apply restart wake-up
|
|
- config restart capability flip
|
|
- runtime inventory drift check
|
|
|
|
### Filesystem and repo assertions
|
|
|
|
- source/docs discovery report
|
|
- build Lobster Invaders
|
|
- generated image artifact lookup
|
|
|
|
### Memory orchestration
|
|
|
|
- memory recall
|
|
- memory tools in channel context
|
|
- memory failure fallback
|
|
- session memory ranking
|
|
- thread memory isolation
|
|
- memory dreaming sweep
|
|
|
|
### Tool and plugin integration
|
|
|
|
- MCP plugin-tools call
|
|
- skill visibility
|
|
- skill hot install
|
|
- native image generation
|
|
- image roundtrip
|
|
- image understanding from attachment
|
|
|
|
### Multi-turn and multi-actor
|
|
|
|
- subagent handoff
|
|
- subagent fanout synthesis
|
|
- restart recovery style flows
|
|
|
|
These categories matter because they drive DSL requirements. A flat list of prompt + expected text is not enough.
|
|
|
|
## Direction
|
|
|
|
### Single source of truth
|
|
|
|
Use `qa/scenarios.md` as the authored source of truth.
|
|
|
|
The pack should stay:
|
|
|
|
- human-readable in review
|
|
- machine-parseable
|
|
- rich enough to drive:
|
|
- suite execution
|
|
- QA workspace bootstrap
|
|
- QA Lab UI metadata
|
|
- docs/discovery prompts
|
|
- report generation
|
|
|
|
### Preferred authoring format
|
|
|
|
Use markdown as the top-level format, with structured YAML inside it.
|
|
|
|
Recommended shape:
|
|
|
|
- YAML frontmatter
|
|
- id
|
|
- title
|
|
- surface
|
|
- tags
|
|
- docs refs
|
|
- code refs
|
|
- model/provider overrides
|
|
- prerequisites
|
|
- prose sections
|
|
- objective
|
|
- notes
|
|
- debugging hints
|
|
- fenced YAML blocks
|
|
- setup
|
|
- steps
|
|
- assertions
|
|
- cleanup
|
|
|
|
This gives:
|
|
|
|
- better PR readability than giant JSON
|
|
- richer context than pure YAML
|
|
- strict parsing and zod validation
|
|
|
|
Raw JSON is acceptable only as an intermediate generated form.
|
|
|
|
## Proposed Scenario File Shape
|
|
|
|
Example:
|
|
|
|
````md
|
|
---
|
|
id: image-generation-roundtrip
|
|
title: Image generation roundtrip
|
|
surface: image
|
|
tags: [media, image, roundtrip]
|
|
models:
|
|
primary: openai/gpt-5.4
|
|
requires:
|
|
tools: [image_generate]
|
|
plugins: [openai, qa-channel]
|
|
docsRefs:
|
|
- docs/help/testing.md
|
|
- docs/concepts/model-providers.md
|
|
codeRefs:
|
|
- extensions/qa-lab/src/suite.ts
|
|
- src/gateway/chat-attachments.ts
|
|
---
|
|
|
|
# Objective
|
|
|
|
Verify generated media is reattached on the follow-up turn.
|
|
|
|
# Setup
|
|
|
|
```yaml scenario.setup
|
|
- action: config.patch
|
|
patch:
|
|
agents:
|
|
defaults:
|
|
imageGenerationModel:
|
|
primary: openai/gpt-image-1
|
|
- action: session.create
|
|
key: agent:qa:image-roundtrip
|
|
```
|
|
````
|
|
|
|
# Steps
|
|
|
|
```yaml scenario.steps
|
|
- action: agent.send
|
|
session: agent:qa:image-roundtrip
|
|
message: |
|
|
Image generation check: generate a QA lighthouse image and summarize it in one short sentence.
|
|
- action: artifact.capture
|
|
kind: generated-image
|
|
promptSnippet: Image generation check
|
|
saveAs: lighthouseImage
|
|
- action: agent.send
|
|
session: agent:qa:image-roundtrip
|
|
message: |
|
|
Roundtrip image inspection check: describe the generated lighthouse attachment in one short sentence.
|
|
attachments:
|
|
- fromArtifact: lighthouseImage
|
|
```
|
|
|
|
# Expect
|
|
|
|
```yaml scenario.expect
|
|
- assert: outbound.textIncludes
|
|
value: lighthouse
|
|
- assert: requestLog.matches
|
|
where:
|
|
promptIncludes: Roundtrip image inspection check
|
|
imageInputCountGte: 1
|
|
- assert: artifact.exists
|
|
ref: lighthouseImage
|
|
```
|
|
|
|
````
|
|
|
|
## Runner Capabilities The DSL Must Cover
|
|
|
|
Based on the current suite, the generic runner needs more than prompt execution.
|
|
|
|
### Environment and setup actions
|
|
|
|
- `bus.reset`
|
|
- `gateway.waitHealthy`
|
|
- `channel.waitReady`
|
|
- `session.create`
|
|
- `thread.create`
|
|
- `workspace.writeSkill`
|
|
|
|
### Agent turn actions
|
|
|
|
- `agent.send`
|
|
- `agent.wait`
|
|
- `bus.injectInbound`
|
|
- `bus.injectOutbound`
|
|
|
|
### Config and runtime actions
|
|
|
|
- `config.get`
|
|
- `config.patch`
|
|
- `config.apply`
|
|
- `gateway.restart`
|
|
- `tools.effective`
|
|
- `skills.status`
|
|
|
|
### File and artifact actions
|
|
|
|
- `file.write`
|
|
- `file.read`
|
|
- `file.delete`
|
|
- `file.touchTime`
|
|
- `artifact.captureGeneratedImage`
|
|
- `artifact.capturePath`
|
|
|
|
### Memory and cron actions
|
|
|
|
- `memory.indexForce`
|
|
- `memory.searchCli`
|
|
- `doctor.memory.status`
|
|
- `cron.list`
|
|
- `cron.run`
|
|
- `cron.waitCompletion`
|
|
- `sessionTranscript.write`
|
|
|
|
### MCP actions
|
|
|
|
- `mcp.callTool`
|
|
|
|
### Assertions
|
|
|
|
- `outbound.textIncludes`
|
|
- `outbound.inThread`
|
|
- `outbound.notInRoot`
|
|
- `tool.called`
|
|
- `tool.notPresent`
|
|
- `skill.visible`
|
|
- `skill.disabled`
|
|
- `file.contains`
|
|
- `memory.contains`
|
|
- `requestLog.matches`
|
|
- `sessionStore.matches`
|
|
- `cron.managedPresent`
|
|
- `artifact.exists`
|
|
|
|
## Variables and Artifact References
|
|
|
|
The DSL must support saved outputs and later references.
|
|
|
|
Examples from the current suite:
|
|
|
|
- create a thread, then reuse `threadId`
|
|
- create a session, then reuse `sessionKey`
|
|
- generate an image, then attach the file on the next turn
|
|
- generate a wake marker string, then assert that it appears later
|
|
|
|
Needed capabilities:
|
|
|
|
- `saveAs`
|
|
- `${vars.name}`
|
|
- `${artifacts.name}`
|
|
- typed references for paths, session keys, thread ids, markers, tool outputs
|
|
|
|
Without variable support, the harness will keep leaking scenario logic back into TypeScript.
|
|
|
|
## What Should Stay As Escape Hatches
|
|
|
|
A fully pure declarative runner is not realistic in phase 1.
|
|
|
|
Some scenarios are inherently orchestration-heavy:
|
|
|
|
- memory dreaming sweep
|
|
- config apply restart wake-up
|
|
- config restart capability flip
|
|
- generated image artifact resolution by timestamp/path
|
|
- discovery-report evaluation
|
|
|
|
These should use explicit custom handlers for now.
|
|
|
|
Recommended rule:
|
|
|
|
- 85-90% declarative
|
|
- explicit `customHandler` steps for the hard remainder
|
|
- named and documented custom handlers only
|
|
- no anonymous inline code in the scenario file
|
|
|
|
That keeps the generic engine clean while still allowing progress.
|
|
|
|
## Architecture Change
|
|
|
|
### Current
|
|
|
|
Scenario markdown already is the source of truth for:
|
|
|
|
- suite execution
|
|
- workspace bootstrap files
|
|
- QA Lab UI scenario catalog
|
|
- report metadata
|
|
- discovery prompts
|
|
|
|
Generated compatibility:
|
|
|
|
- seeded workspace still includes `QA_KICKOFF_TASK.md`
|
|
- seeded workspace still includes `QA_SCENARIO_PLAN.md`
|
|
- seeded workspace now also includes `QA_SCENARIOS.md`
|
|
|
|
## Refactor Plan
|
|
|
|
### Phase 1: loader and schema
|
|
|
|
Done.
|
|
|
|
- added `qa/scenarios.md`
|
|
- added parser for named markdown YAML pack content
|
|
- validated with zod
|
|
- switched consumers to the parsed pack
|
|
- removed repo-level `qa/seed-scenarios.json` and `qa/QA_KICKOFF_TASK.md`
|
|
|
|
### Phase 2: generic engine
|
|
|
|
- split `extensions/qa-lab/src/suite.ts` into:
|
|
- loader
|
|
- engine
|
|
- action registry
|
|
- assertion registry
|
|
- custom handlers
|
|
- keep existing helper functions as engine operations
|
|
|
|
Deliverable:
|
|
|
|
- engine executes simple declarative scenarios
|
|
|
|
Start with scenarios that are mostly prompt + wait + assert:
|
|
|
|
- threaded follow-up
|
|
- image understanding from attachment
|
|
- skill visibility and invocation
|
|
- channel baseline
|
|
|
|
Deliverable:
|
|
|
|
- first real markdown-defined scenarios shipping through the generic engine
|
|
|
|
### Phase 4: migrate medium scenarios
|
|
|
|
- image generation roundtrip
|
|
- memory tools in channel context
|
|
- session memory ranking
|
|
- subagent handoff
|
|
- subagent fanout synthesis
|
|
|
|
Deliverable:
|
|
|
|
- variables, artifacts, tool assertions, request-log assertions proven out
|
|
|
|
### Phase 5: keep hard scenarios on custom handlers
|
|
|
|
- memory dreaming sweep
|
|
- config apply restart wake-up
|
|
- config restart capability flip
|
|
- runtime inventory drift
|
|
|
|
Deliverable:
|
|
|
|
- same authoring format, but with explicit custom-step blocks where needed
|
|
|
|
### Phase 6: delete hardcoded scenario map
|
|
|
|
Once the pack coverage is good enough:
|
|
|
|
- remove most scenario-specific TypeScript branching from `extensions/qa-lab/src/suite.ts`
|
|
|
|
## Fake Slack / Rich Media Support
|
|
|
|
The current QA bus is text-first.
|
|
|
|
Relevant files:
|
|
|
|
- `extensions/qa-channel/src/protocol.ts`
|
|
- `extensions/qa-lab/src/bus-state.ts`
|
|
- `extensions/qa-lab/src/bus-queries.ts`
|
|
- `extensions/qa-lab/src/bus-server.ts`
|
|
- `extensions/qa-lab/web/src/ui-render.ts`
|
|
|
|
Today the QA bus supports:
|
|
|
|
- text
|
|
- reactions
|
|
- threads
|
|
|
|
It does not yet model inline media attachments.
|
|
|
|
### Needed transport contract
|
|
|
|
Add a generic QA bus attachment model:
|
|
|
|
```ts
|
|
type QaBusAttachment = {
|
|
id: string;
|
|
kind: "image" | "video" | "audio" | "file";
|
|
mimeType: string;
|
|
fileName?: string;
|
|
inline?: boolean;
|
|
url?: string;
|
|
contentBase64?: string;
|
|
width?: number;
|
|
height?: number;
|
|
durationMs?: number;
|
|
altText?: string;
|
|
transcript?: string;
|
|
};
|
|
````
|
|
|
|
Then add `attachments?: QaBusAttachment[]` to:
|
|
|
|
- `QaBusMessage`
|
|
- `QaBusInboundMessageInput`
|
|
- `QaBusOutboundMessageInput`
|
|
|
|
### Why generic first
|
|
|
|
Do not build a Slack-only media model.
|
|
|
|
Instead:
|
|
|
|
- one generic QA transport model
|
|
- multiple renderers on top of it
|
|
- current QA Lab chat
|
|
- future fake Slack web
|
|
- any other fake transport views
|
|
|
|
This prevents duplicate logic and lets media scenarios stay transport-agnostic.
|
|
|
|
### UI work needed
|
|
|
|
Update the QA UI to render:
|
|
|
|
- inline image preview
|
|
- inline audio player
|
|
- inline video player
|
|
- file attachment chip
|
|
|
|
The current UI can already render threads and reactions, so attachment rendering should layer onto the same message card model.
|
|
|
|
### Scenario work enabled by media transport
|
|
|
|
Once attachments flow through QA bus, we can add richer fake-chat scenarios:
|
|
|
|
- inline image reply in fake Slack
|
|
- audio attachment understanding
|
|
- video attachment understanding
|
|
- mixed attachment ordering
|
|
- thread reply with media retained
|
|
|
|
## Recommendation
|
|
|
|
The next implementation chunk should be:
|
|
|
|
1. add markdown scenario loader + zod schema
|
|
2. generate the current catalog from markdown
|
|
3. migrate a few simple scenarios first
|
|
4. add generic QA bus attachment support
|
|
5. render inline image in the QA UI
|
|
6. then expand to audio and video
|
|
|
|
This is the smallest path that proves both goals:
|
|
|
|
- generic markdown-defined QA
|
|
- richer fake messaging surfaces
|
|
|
|
## Open Questions
|
|
|
|
- whether scenario files should allow embedded markdown prompt templates with variable interpolation
|
|
- whether setup/cleanup should be named sections or just ordered action lists
|
|
- whether artifact references should be strongly typed in schema or string-based
|
|
- whether custom handlers should live in one registry or per-surface registries
|
|
- whether the generated JSON compatibility file should remain checked in during migration
|