docs(test): explain cheap docker reruns

2026-05-06 08:40:44 +00:00 · 2026-04-26 23:56:08 +01:00
parent 2fe11020d2
commit 199d5f765f
3 changed files with 55 additions and 18 deletions
--- a/.agents/skills/openclaw-testing/SKILL.md
+++ b/.agents/skills/openclaw-testing/SKILL.md
@@ -101,9 +101,11 @@ docker_lanes: install-e2e bundled-channel-update-acpx
 ```

 That skips the three chunk matrix and runs one targeted Docker job against the
-prepared GHCR images and the prepared OpenClaw npm tarball. Live-only targeted
-reruns skip the E2E images and build only the live-test image. Release-path
-normal mode remains max three Docker chunk jobs:
+prepared GHCR images and a fresh OpenClaw npm tarball for the selected ref.
+Reruns usually need that new tarball because the fix being tested changed the
+package contents even if the SHA-tagged GHCR Docker image can be reused.
+Live-only targeted reruns skip the E2E images and build only the live-test
+image. Release-path normal mode remains max three Docker chunk jobs:

 - `core`
 - `package-update`
@@ -112,17 +114,50 @@ normal mode remains max three Docker chunk jobs:
 Docker E2E images never copy repo sources as the app under test: the bare image
 is a Node/Git runner, and the functional image installs the same prebuilt npm
 tarball that bare lanes mount. `scripts/package-openclaw-for-docker.mjs` is the
-single packer for local scripts and CI. `scripts/test-docker-all.mjs
--plan-json` is the scheduler-owned CI plan for image kind, package, live image,
-lane, and credential needs. Docker lane definitions live in the single scenario
-catalog `scripts/lib/docker-e2e-scenarios.mjs`; planner logic lives in
+single packer for local scripts and CI and validates the tarball inventory
+before Docker consumes it. `scripts/test-docker-all.mjs --plan-json` is the
+scheduler-owned CI plan for image kind, package, live image, lane, and
+credential needs. Docker lane definitions live in the single scenario catalog
+`scripts/lib/docker-e2e-scenarios.mjs`; planner logic lives in
 `scripts/lib/docker-e2e-plan.mjs`. `scripts/docker-e2e.mjs` converts plan and
 summary JSON into GitHub outputs and step summaries. Every scheduler run writes
-`.artifacts/docker-tests/**/summary.json`. Read it
+`.artifacts/docker-tests/**/summary.json` plus `failures.json`. Read those
 before rerunning. Lane entries include `command`, `rerunCommand`, status,
 timing, timeout state, image kind, and log file path. The summary also includes
 top-level phase timings for preflight, image build, package prep, lane pools,
-and cleanup.
+and cleanup. Use `pnpm test:docker:timings <summary.json>` to rank slow lanes
+and phases before deciding whether a broader rerun is justified.
+
+## Cheap Docker Reruns
+
+First derive the smallest rerun command from artifacts:
+
+```bash
+pnpm test:docker:rerun <github-run-id>
+pnpm test:docker:rerun .artifacts/docker-tests/<run>/failures.json
+```
+
+The script downloads Docker E2E artifacts for a GitHub run, reads
+`summary.json`/`failures.json`, and prints a combined targeted workflow command
+plus per-lane commands. Prefer the combined targeted command when several lanes
+failed for the same patch:
+
+```bash
+gh workflow run openclaw-live-and-e2e-checks-reusable.yml \
+  -f ref=<sha> \
+  -f include_repo_e2e=false \
+  -f include_release_path_suites=false \
+  -f include_openwebui=false \
+  -f docker_lanes='install-e2e bundled-channel-update-acpx' \
+  -f include_live_suites=false \
+  -f live_models_only=false
+```
+
+That path still runs the prepare job, so it creates a new tarball for `<sha>`.
+If the SHA-tagged GHCR bare/functional image already exists, CI skips rebuilding
+that image and only uploads the fresh package artifact before the targeted lane
+job. Do not rerun the full three-chunk release path unless the failed lane list
+or touched surface really requires it.

 ## Docker Expected Timings

@@ -158,12 +193,14 @@ lane log/artifacts first, not “run the whole thing again.”
 ## Failure Workflow

 1. Identify exact failing job, SHA, lane, and artifact path.
-2. Read `summary.json` and the failed lane log tail.
-3. If the lane has `rerunCommand`, use that command as the starting point.
-4. For Docker release failures, dispatch `docker_lanes=<failed-lane>` on GitHub
-   before considering local Docker.
-5. Patch narrowly, then rerun the failed file/lane only.
-6. Broaden to `pnpm check:changed` or CI only after the isolated proof passes.
+2. Read `failures.json`, `summary.json`, and the failed lane log tail.
+3. Use `pnpm test:docker:rerun <run-id|failures.json>` to generate targeted
+   GitHub rerun commands.
+4. If the lane has `rerunCommand`, use that only as a local starting point.
+5. For Docker release failures, dispatch targeted `docker_lanes=<failed-lane>`
+   on GitHub before considering local Docker.
+6. Patch narrowly, then rerun the failed file/lane only.
+7. Broaden to `pnpm check:changed` or CI only after the isolated proof passes.

 ## When To Escalate

@@ -171,6 +208,6 @@ lane log/artifacts first, not “run the whole thing again.”
  validation.
 - Build output, lazy imports, package boundaries, or published surfaces:
  include `pnpm build`.
- Workflow edits: run `actionlint` or equivalent workflow sanity.
+- Workflow edits: run `pnpm check:workflows`.
 - Release branch or tag validation: use release docs and GitHub workflows; avoid
  local Docker unless Peter explicitly asks.
--- a/docs/ci.md
+++ b/docs/ci.md
--- a/docs/reference/test.md
+++ b/docs/reference/test.md
@@ -33,7 +33,7 @@ title: "Tests"
 - Gateway integration: opt-in via `OPENCLAW_TEST_INCLUDE_GATEWAY=1 pnpm test` or `pnpm test:gateway`.
 - `pnpm test:e2e`: Runs gateway end-to-end smoke tests (multi-instance WS/HTTP/node pairing). Defaults to `threads` + `isolate: false` with adaptive workers in `vitest.e2e.config.ts`; tune with `OPENCLAW_E2E_WORKERS=<n>` and set `OPENCLAW_E2E_VERBOSE=1` for verbose logs.
 - `pnpm test:live`: Runs provider live tests (minimax/zai). Requires API keys and `LIVE=1` (or provider-specific `*_LIVE_TEST=1`) to unskip.
- `pnpm test:docker:all`: Builds the shared live-test image, packs OpenClaw once as an npm tarball, builds/reuses a bare Node/Git runner image plus a functional image that installs that tarball into `/app`, then runs Docker smoke lanes with `OPENCLAW_SKIP_DOCKER_BUILD=1` through a weighted scheduler. The bare image (`OPENCLAW_DOCKER_E2E_BARE_IMAGE`) is used for installer/update/plugin-dependency lanes; those lanes mount the prebuilt tarball instead of using copied repo sources. The functional image (`OPENCLAW_DOCKER_E2E_FUNCTIONAL_IMAGE`) is used for normal built-app functionality lanes. `scripts/package-openclaw-for-docker.mjs` is the single local/CI package packer. Docker lane definitions live in `scripts/lib/docker-e2e-scenarios.mjs`; planner logic lives in `scripts/lib/docker-e2e-plan.mjs`; `scripts/test-docker-all.mjs` executes the selected plan. `node scripts/test-docker-all.mjs --plan-json` emits the scheduler-owned CI plan for selected lanes, image kinds, package/live-image needs, and credential checks without building or running Docker. `OPENCLAW_DOCKER_ALL_PARALLELISM=<n>` controls process slots and defaults to 10; `OPENCLAW_DOCKER_ALL_TAIL_PARALLELISM=<n>` controls the provider-sensitive tail pool and defaults to 10. Heavy lane caps default to `OPENCLAW_DOCKER_ALL_LIVE_LIMIT=9`, `OPENCLAW_DOCKER_ALL_NPM_LIMIT=10`, and `OPENCLAW_DOCKER_ALL_SERVICE_LIMIT=7`; provider caps default to one heavy lane per provider via `OPENCLAW_DOCKER_ALL_LIVE_CLAUDE_LIMIT=4`, `OPENCLAW_DOCKER_ALL_LIVE_CODEX_LIMIT=4`, and `OPENCLAW_DOCKER_ALL_LIVE_GEMINI_LIMIT=4`. Use `OPENCLAW_DOCKER_ALL_WEIGHT_LIMIT` or `OPENCLAW_DOCKER_ALL_DOCKER_LIMIT` for larger hosts. Lane starts are staggered by 2 seconds by default to avoid local Docker daemon create storms; override with `OPENCLAW_DOCKER_ALL_START_STAGGER_MS=<ms>`. The runner preflights Docker by default, cleans stale OpenClaw E2E containers, emits active-lane status every 30 seconds, shares provider CLI tool caches between compatible lanes, retries transient live-provider failures once by default (`OPENCLAW_DOCKER_ALL_LIVE_RETRIES=<n>`), and stores lane timings in `.artifacts/docker-tests/lane-timings.json` for longest-first ordering on later runs. Use `OPENCLAW_DOCKER_ALL_DRY_RUN=1` to print the lane manifest without running Docker, `OPENCLAW_DOCKER_ALL_STATUS_INTERVAL_MS=<ms>` to tune status output, or `OPENCLAW_DOCKER_ALL_TIMINGS=0` to disable timing reuse. Use `OPENCLAW_DOCKER_ALL_LIVE_MODE=skip` for deterministic/local lanes only or `OPENCLAW_DOCKER_ALL_LIVE_MODE=only` for live-provider lanes only; package aliases are `pnpm test:docker:local:all` and `pnpm test:docker:live:all`. Live-only mode merges main and tail live lanes into one longest-first pool so provider buckets can pack Claude, Codex, and Gemini work together. The runner stops scheduling new pooled lanes after the first failure unless `OPENCLAW_DOCKER_ALL_FAIL_FAST=0` is set, and each lane has a 120-minute fallback timeout overrideable with `OPENCLAW_DOCKER_ALL_LANE_TIMEOUT_MS`; selected live/tail lanes use tighter per-lane caps. CLI backend Docker setup commands have their own timeout via `OPENCLAW_LIVE_CLI_BACKEND_SETUP_TIMEOUT_SECONDS` (default 180). Per-lane logs and `summary.json` phase timings are written under `.artifacts/docker-tests/<run-id>/`.
+- `pnpm test:docker:all`: Builds the shared live-test image, packs OpenClaw once as an npm tarball, builds/reuses a bare Node/Git runner image plus a functional image that installs that tarball into `/app`, then runs Docker smoke lanes with `OPENCLAW_SKIP_DOCKER_BUILD=1` through a weighted scheduler. The bare image (`OPENCLAW_DOCKER_E2E_BARE_IMAGE`) is used for installer/update/plugin-dependency lanes; those lanes mount the prebuilt tarball instead of using copied repo sources. The functional image (`OPENCLAW_DOCKER_E2E_FUNCTIONAL_IMAGE`) is used for normal built-app functionality lanes. `scripts/package-openclaw-for-docker.mjs` is the single local/CI package packer and validates the tarball plus `dist/postinstall-inventory.json` before Docker consumes it. Docker lane definitions live in `scripts/lib/docker-e2e-scenarios.mjs`; planner logic lives in `scripts/lib/docker-e2e-plan.mjs`; `scripts/test-docker-all.mjs` executes the selected plan. `node scripts/test-docker-all.mjs --plan-json` emits the scheduler-owned CI plan for selected lanes, image kinds, package/live-image needs, and credential checks without building or running Docker. `OPENCLAW_DOCKER_ALL_PARALLELISM=<n>` controls process slots and defaults to 10; `OPENCLAW_DOCKER_ALL_TAIL_PARALLELISM=<n>` controls the provider-sensitive tail pool and defaults to 10. Heavy lane caps default to `OPENCLAW_DOCKER_ALL_LIVE_LIMIT=9`, `OPENCLAW_DOCKER_ALL_NPM_LIMIT=10`, and `OPENCLAW_DOCKER_ALL_SERVICE_LIMIT=7`; provider caps default to one heavy lane per provider via `OPENCLAW_DOCKER_ALL_LIVE_CLAUDE_LIMIT=4`, `OPENCLAW_DOCKER_ALL_LIVE_CODEX_LIMIT=4`, and `OPENCLAW_DOCKER_ALL_LIVE_GEMINI_LIMIT=4`. Use `OPENCLAW_DOCKER_ALL_WEIGHT_LIMIT` or `OPENCLAW_DOCKER_ALL_DOCKER_LIMIT` for larger hosts. Lane starts are staggered by 2 seconds by default to avoid local Docker daemon create storms; override with `OPENCLAW_DOCKER_ALL_START_STAGGER_MS=<ms>`. The runner preflights Docker by default, cleans stale OpenClaw E2E containers, emits active-lane status every 30 seconds, shares provider CLI tool caches between compatible lanes, retries transient live-provider failures once by default (`OPENCLAW_DOCKER_ALL_LIVE_RETRIES=<n>`), and stores lane timings in `.artifacts/docker-tests/lane-timings.json` for longest-first ordering on later runs. Use `OPENCLAW_DOCKER_ALL_DRY_RUN=1` to print the lane manifest without running Docker, `OPENCLAW_DOCKER_ALL_STATUS_INTERVAL_MS=<ms>` to tune status output, or `OPENCLAW_DOCKER_ALL_TIMINGS=0` to disable timing reuse. Use `OPENCLAW_DOCKER_ALL_LIVE_MODE=skip` for deterministic/local lanes only or `OPENCLAW_DOCKER_ALL_LIVE_MODE=only` for live-provider lanes only; package aliases are `pnpm test:docker:local:all` and `pnpm test:docker:live:all`. Live-only mode merges main and tail live lanes into one longest-first pool so provider buckets can pack Claude, Codex, and Gemini work together. The runner stops scheduling new pooled lanes after the first failure unless `OPENCLAW_DOCKER_ALL_FAIL_FAST=0` is set, and each lane has a 120-minute fallback timeout overrideable with `OPENCLAW_DOCKER_ALL_LANE_TIMEOUT_MS`; selected live/tail lanes use tighter per-lane caps. CLI backend Docker setup commands have their own timeout via `OPENCLAW_LIVE_CLI_BACKEND_SETUP_TIMEOUT_SECONDS` (default 180). Per-lane logs, `summary.json`, `failures.json`, and phase timings are written under `.artifacts/docker-tests/<run-id>/`; use `pnpm test:docker:timings <summary.json>` to inspect slow lanes and `pnpm test:docker:rerun <run-id|summary.json|failures.json>` to print cheap targeted rerun commands.
 - `pnpm test:docker:browser-cdp-snapshot`: Builds a Chromium-backed source E2E container, starts raw CDP plus an isolated Gateway, runs `browser doctor --deep`, and verifies CDP role snapshots include link URLs, cursor-promoted clickables, iframe refs, and frame metadata.
 - CLI backend live Docker probes can be run as focused lanes, for example `pnpm test:docker:live-cli-backend:codex`, `pnpm test:docker:live-cli-backend:codex:resume`, or `pnpm test:docker:live-cli-backend:codex:mcp`. Claude and Gemini have matching `:resume` and `:mcp` aliases.
 - `pnpm test:docker:openwebui`: Starts Dockerized OpenClaw + Open WebUI, signs in through Open WebUI, checks `/api/models`, then runs a real proxied chat through `/api/chat/completions`. Requires a usable live model key (for example OpenAI in `~/.profile`), pulls an external Open WebUI image, and is not expected to be CI-stable like the normal unit/e2e suites.