openclaw/docs/concepts/personal-agent-benchmark-pack.md at 1bbbb44d2ba8c19d3c8c3ba8bbe02bbd0ce18dfb

mirror of https://github.com/openclaw/openclaw.git synced 2026-05-25 11:43:03 +00:00

Files

Firas Alswihry a9eaf0c993 test(qa-lab): add personal no-fake-progress scenario (#83824 )

Summary:
- The PR adds a personal-agent QA-Lab no-fake-progress scenario, registers it in the personal-agent pack, teaches mock-openai the scripted path, and updates focused tests, docs, and changelog.
- Reproducibility: not applicable. This PR adds QA coverage rather than reporting a current-main bug; the branch supplies concrete after-patch QA-Lab/mock-openai commands and copied pass output.

Automerge notes:
- PR branch already contained follow-up commit before automerge: test(qa-lab): add personal no-fake-progress scenario

Validation:
- ClawSweeper review passed for head 95d2e46288.
- Required merge gates passed before the squash merge.

Prepared head SHA: 95d2e46288
Review: https://github.com/openclaw/openclaw/pull/83824#issuecomment-4483439200

Co-authored-by: Firas Alswihry <itzfiras@gmail.com>
Co-authored-by: clawsweeper <274271284+clawsweeper[bot]@users.noreply.github.com>
Co-authored-by: clawsweeper[bot] <274271284+clawsweeper[bot]@users.noreply.github.com>
Approved-by: takhoffman
Co-authored-by: takhoffman <781889+takhoffman@users.noreply.github.com>

2026-05-19 01:16:00 +00:00

3.0 KiB

Raw Blame History

summary, read_when, title

summary

read_when

title

Local qa-channel scenarios for privacy-preserving personal assistant workflow checks.

Running local personal agent reliability checks

Extending the repo-backed QA scenario catalog

Verifying reminder, reply, memory, redaction, safe tool followthrough, task status, share-safe diagnostics, and proof-backed completion claims

Personal agent benchmark pack

The Personal Agent Benchmark Pack is a small repo-backed QA scenario pack for local personal assistant workflows. It is not a generic model benchmark and it does not require a new runner. The pack reuses the private QA stack described in QA overview, the synthetic QA channel, and the existing qa/scenarios markdown catalog.

The first pack is intentionally narrow:

fake personal reminders through local cron delivery
fake DM and thread reply routing through qa-channel
fake preference recall from the temporary QA workspace memory files
fake secret no-echo checks
safe read-backed tool followthrough after a short approval-style turn
approval denial stop behavior for a sensitive local read request
proof-backed task status reporting that keeps pending, blocked, and done separate
share-safe diagnostics artifacts that keep useful status while omitting raw personal content
proof-backed completion claims that avoid fake progress before local evidence exists

Scenarios

The machine-readable pack metadata lives in extensions/qa-lab/src/scenario-packs.ts. Run the pack with --pack personal-agent:

OPENCLAW_ENABLE_PRIVATE_QA_CLI=1 pnpm openclaw qa suite \
  --provider-mode mock-openai \
  --pack personal-agent \
  --concurrency 1

--pack is additive with repeated --scenario flags. Explicit scenarios run first, then the pack scenarios run in QA_PERSONAL_AGENT_SCENARIO_IDS order with duplicates removed.

The pack is designed for qa-channel with mock-openai or another local QA provider lane. It should not be pointed at live chat services or real personal accounts.

Privacy Model

The scenarios use only fake users, fake preferences, fake secrets, and the temporary QA gateway workspace created by the suite. They must not read or write real OpenClaw user memory, sessions, credentials, launch agents, global configs, or live gateway state.

Artifacts stay under the existing QA suite artifact directory and should be treated like test output. Redaction checks use fake markers so failures are safe to inspect and file in issues.

Extending The Pack

Add new cases under qa/scenarios/personal/, then add the scenario id to QA_PERSONAL_AGENT_SCENARIO_IDS. Keep each case small, local, deterministic in mock-openai, and focused on one personal assistant behavior.

Good follow-up candidates:

redacted trajectory export checks
local-only plugin workflow checks

Avoid adding a new runner, plugin, dependency, live transport, or model judge until the scenario catalog has enough stable cases to justify that surface.

3.0 KiB Raw Blame History

Scenarios

Privacy Model

Extending The Pack

3.0 KiB

Raw Blame History