ci: run full docker checks daily

2026-05-06 06:20:43 +00:00 · 2026-04-23 14:32:30 +01:00
parent be81fa4424
commit 1bbca1d910
4 changed files with 65 additions and 14 deletions
--- a/.github/workflows/openclaw-scheduled-live-checks.yml
+++ b/.github/workflows/openclaw-scheduled-live-checks.yml
@@ -27,7 +27,7 @@ jobs:
    with:
      ref: ${{ github.sha }}
      include_repo_e2e: true
-      include_release_path_suites: false
+      include_release_path_suites: true
      include_openwebui: true
      include_live_suites: true
    secrets:
--- a/docs/ci.md
+++ b/docs/ci.md
@@ -56,7 +56,7 @@ Jobs are ordered so cheap checks fail before expensive ones run:
 Scope logic lives in `scripts/ci-changed-scope.mjs` and is covered by unit tests in `src/scripts/ci-changed-scope.test.ts`.
 CI workflow edits validate the Node CI graph plus workflow linting, but do not force Windows, Android, or macOS native builds by themselves; those platform lanes stay scoped to platform source changes.
 Windows Node checks are scoped to Windows-specific process/path wrappers, npm/pnpm/UI runner helpers, package manager config, and the CI workflow surfaces that execute that lane; unrelated source, plugin, install-smoke, and test-only changes stay on the Linux Node lanes so they do not reserve a 16-vCPU Windows worker for coverage that is already exercised by the normal test shards.
-The separate `install-smoke` workflow reuses the same scope script through its own `preflight` job. It computes `run_install_smoke` from the narrower changed-smoke signal, so Docker/install smoke runs for install, packaging, container-relevant changes, bundled extension production changes, and the core plugin/channel/gateway/Plugin SDK surfaces that the Docker smoke jobs exercise. Test-only and docs-only edits do not reserve Docker workers. Its QR package smoke forces the Docker `pnpm install` layer to rerun while preserving the BuildKit pnpm store cache, so it still exercises installation without redownloading dependencies on every run. Its gateway-network e2e reuses the runtime image built earlier in the job, so it adds real container-to-container WebSocket coverage without adding another Docker build. Local `test:docker:all` prebuilds one shared live-test image and one shared `scripts/e2e/Dockerfile` built-app image, then runs the live/E2E smoke lanes in parallel with `OPENCLAW_SKIP_DOCKER_BUILD=1`; tune the default concurrency of 4 with `OPENCLAW_DOCKER_ALL_PARALLELISM`. Startup- or provider-sensitive lanes run exclusively after the parallel pool. The reusable live/E2E workflow mirrors the shared-image pattern by building and pushing one SHA-tagged GHCR Docker E2E image before the Docker matrix, then running the matrix with `OPENCLAW_SKIP_DOCKER_BUILD=1`. QR and installer Docker tests keep their own install-focused Dockerfiles. A separate `docker-e2e-fast` job runs the bounded bundled-plugin Docker profile under a 120-second command timeout: setup-entry dependency repair plus synthetic bundled-loader failure isolation. The full bundled update/channel matrix remains manual/full-suite because it performs repeated real npm update and doctor repair passes.
+The separate `install-smoke` workflow reuses the same scope script through its own `preflight` job. It computes `run_install_smoke` from the narrower changed-smoke signal, so Docker/install smoke runs for install, packaging, container-relevant changes, bundled extension production changes, and the core plugin/channel/gateway/Plugin SDK surfaces that the Docker smoke jobs exercise. Test-only and docs-only edits do not reserve Docker workers. Its QR package smoke forces the Docker `pnpm install` layer to rerun while preserving the BuildKit pnpm store cache, so it still exercises installation without redownloading dependencies on every run. Its gateway-network e2e reuses the runtime image built earlier in the job, so it adds real container-to-container WebSocket coverage without adding another Docker build. Local `test:docker:all` prebuilds one shared live-test image and one shared `scripts/e2e/Dockerfile` built-app image, then runs the live/E2E smoke lanes in parallel with `OPENCLAW_SKIP_DOCKER_BUILD=1`; tune the default concurrency of 4 with `OPENCLAW_DOCKER_ALL_PARALLELISM`. The local aggregate stops scheduling new pooled lanes after the first failure by default, and each lane has a 120-minute timeout overrideable with `OPENCLAW_DOCKER_ALL_LANE_TIMEOUT_MS`. Startup- or provider-sensitive lanes run exclusively after the parallel pool. The reusable live/E2E workflow mirrors the shared-image pattern by building and pushing one SHA-tagged GHCR Docker E2E image before the Docker matrix, then running the matrix with `OPENCLAW_SKIP_DOCKER_BUILD=1`. The scheduled live/E2E workflow runs the full release-path Docker suite daily. QR and installer Docker tests keep their own install-focused Dockerfiles. A separate `docker-e2e-fast` job runs the bounded bundled-plugin Docker profile under a 120-second command timeout: setup-entry dependency repair plus synthetic bundled-loader failure isolation. The full bundled update/channel matrix remains manual/full-suite because it performs repeated real npm update and doctor repair passes.

 Local changed-lane logic lives in `scripts/changed-lanes.mjs` and is executed by `scripts/check-changed.mjs`. That local gate is stricter about architecture boundaries than the broad CI platform scope: core production changes run core prod typecheck plus core tests, core test-only changes run only core test typecheck/tests, extension production changes run extension prod typecheck plus extension tests, and extension test-only changes run only extension test typecheck/tests. Public Plugin SDK or plugin-contract changes expand to extension validation because extensions depend on those core contracts. Release metadata-only version bumps run targeted version/config/root-dependency checks. Unknown root/config changes fail safe to all lanes.

--- a/docs/reference/test.md
+++ b/docs/reference/test.md
@@ -32,7 +32,7 @@ title: "Tests"
 - Gateway integration: opt-in via `OPENCLAW_TEST_INCLUDE_GATEWAY=1 pnpm test` or `pnpm test:gateway`.
 - `pnpm test:e2e`: Runs gateway end-to-end smoke tests (multi-instance WS/HTTP/node pairing). Defaults to `threads` + `isolate: false` with adaptive workers in `vitest.e2e.config.ts`; tune with `OPENCLAW_E2E_WORKERS=<n>` and set `OPENCLAW_E2E_VERBOSE=1` for verbose logs.
 - `pnpm test:live`: Runs provider live tests (minimax/zai). Requires API keys and `LIVE=1` (or provider-specific `*_LIVE_TEST=1`) to unskip.
- `pnpm test:docker:all`: Builds the shared live-test image and Docker E2E image once, then runs the Docker smoke lanes with `OPENCLAW_SKIP_DOCKER_BUILD=1` at concurrency 4 by default. Tune with `OPENCLAW_DOCKER_ALL_PARALLELISM=<n>`. Startup- or provider-sensitive lanes run exclusively after the parallel pool. Per-lane logs are written under `.artifacts/docker-tests/<run-id>/`.
+- `pnpm test:docker:all`: Builds the shared live-test image and Docker E2E image once, then runs the Docker smoke lanes with `OPENCLAW_SKIP_DOCKER_BUILD=1` at concurrency 4 by default. Tune with `OPENCLAW_DOCKER_ALL_PARALLELISM=<n>`. The runner stops scheduling new pooled lanes after the first failure unless `OPENCLAW_DOCKER_ALL_FAIL_FAST=0` is set, and each lane has a 120-minute timeout overrideable with `OPENCLAW_DOCKER_ALL_LANE_TIMEOUT_MS`. Startup- or provider-sensitive lanes run exclusively after the parallel pool. Per-lane logs are written under `.artifacts/docker-tests/<run-id>/`.
 - `pnpm test:docker:openwebui`: Starts Dockerized OpenClaw + Open WebUI, signs in through Open WebUI, checks `/api/models`, then runs a real proxied chat through `/api/chat/completions`. Requires a usable live model key (for example OpenAI in `~/.profile`), pulls an external Open WebUI image, and is not expected to be CI-stable like the normal unit/e2e suites.
 - `pnpm test:docker:mcp-channels`: Starts a seeded Gateway container and a second client container that spawns `openclaw mcp serve`, then verifies routed conversation discovery, transcript reads, attachment metadata, live event queue behavior, outbound send routing, and Claude-style channel + permission notifications over the real stdio bridge. The Claude notification assertion reads the raw stdio MCP frames directly so the smoke reflects what the bridge actually emits.

--- a/scripts/test-docker-all.mjs
+++ b/scripts/test-docker-all.mjs
@@ -8,6 +8,7 @@ const ROOT_DIR = path.resolve(path.dirname(fileURLToPath(import.meta.url)), ".."
 const DEFAULT_E2E_IMAGE = "openclaw-docker-e2e:local";
 const DEFAULT_PARALLELISM = 4;
 const DEFAULT_FAILURE_TAIL_LINES = 80;
+const DEFAULT_LANE_TIMEOUT_MS = 120 * 60 * 1000;

 const lanes = [
  ["live-models", "OPENCLAW_SKIP_DOCKER_BUILD=1 pnpm test:docker:live-models"],
@@ -62,6 +63,13 @@ function parsePositiveInt(raw, fallback, label) {
  return parsed;
 }

+function parseBool(raw, fallback) {
+  if (raw === undefined || raw === "") {
+    return fallback;
+  }
+  return !/^(?:0|false|no)$/i.test(raw);
+}
+
 function utcStampForPath() {
  return new Date().toISOString().replaceAll("-", "").replaceAll(":", "").replace(/\..*$/, "Z");
 }
@@ -86,7 +94,7 @@ function commandEnv(extra = {}) {
  };
 }

-function runShellCommand({ command, env, label, logFile }) {
+function runShellCommand({ command, env, label, logFile, timeoutMs }) {
  return new Promise((resolve) => {
    const child = spawn("bash", ["-lc", command], {
      cwd: ROOT_DIR,
@@ -94,6 +102,21 @@ function runShellCommand({ command, env, label, logFile }) {
      stdio: logFile ? ["ignore", "pipe", "pipe"] : "inherit",
    });
    activeChildren.add(child);
+    let timedOut = false;
+    let killTimer;
+    const timeoutTimer =
+      timeoutMs > 0
+        ? setTimeout(() => {
+            timedOut = true;
+            if (stream) {
+              stream.write(`\n==> [${label}] timeout after ${timeoutMs}ms; sending SIGTERM\n`);
+            }
+            child.kill("SIGTERM");
+            killTimer = setTimeout(() => child.kill("SIGKILL"), 10_000);
+            killTimer.unref?.();
+          }, timeoutMs)
+        : undefined;
+    timeoutTimer?.unref?.();

    let stream;
    if (logFile) {
@@ -105,13 +128,19 @@ function runShellCommand({ command, env, label, logFile }) {
    }

    child.on("close", (status, signal) => {
+      if (timeoutTimer) {
+        clearTimeout(timeoutTimer);
+      }
+      if (killTimer) {
+        clearTimeout(killTimer);
+      }
      activeChildren.delete(child);
      const exitCode = typeof status === "number" ? status : signal ? 128 : 1;
      if (stream) {
        stream.write(`\n==> [${label}] finished: ${utcStamp()} status=${exitCode}\n`);
        stream.end();
      }
-      resolve({ status: exitCode, signal });
+      resolve({ signal, status: exitCode, timedOut });
    });
  });
 }
@@ -137,7 +166,7 @@ function laneEnv(name, baseEnv, logDir) {
  return env;
 }

-async function runLane(lane, baseEnv, logDir) {
+async function runLane(lane, baseEnv, logDir, timeoutMs) {
  const [name, command] = lane;
  const logFile = path.join(logDir, `${name}.log`);
  const env = laneEnv(name, baseEnv, logDir);
@@ -153,31 +182,41 @@ async function runLane(lane, baseEnv, logDir) {
  );
  console.log(`==> [${name}] start`);
  const startedAt = Date.now();
-  const result = await runShellCommand({ command, env, label: name, logFile });
+  const result = await runShellCommand({ command, env, label: name, logFile, timeoutMs });
  const elapsedSeconds = Math.round((Date.now() - startedAt) / 1000);
  if (result.status === 0) {
    console.log(`==> [${name}] pass ${elapsedSeconds}s`);
  } else {
-    console.error(`==> [${name}] fail status=${result.status} ${elapsedSeconds}s log=${logFile}`);
+    const timeoutLabel = result.timedOut ? " timeout" : "";
+    console.error(
+      `==> [${name}] fail${timeoutLabel} status=${result.status} ${elapsedSeconds}s log=${logFile}`,
+    );
  }
  return {
    command,
    logFile,
    name,
    status: result.status,
+    timedOut: result.timedOut,
  };
 }

-async function runLanePool(poolLanes, baseEnv, logDir, parallelism) {
+async function runLanePool(poolLanes, baseEnv, logDir, parallelism, options) {
  const failures = [];
  let nextIndex = 0;

  async function worker() {
    while (nextIndex < poolLanes.length) {
+      if (options.failFast && failures.length > 0) {
+        return;
+      }
      const lane = poolLanes[nextIndex++];
-      const result = await runLane(lane, baseEnv, logDir);
+      const result = await runLane(lane, baseEnv, logDir, options.timeoutMs);
      if (result.status !== 0) {
        failures.push(result);
+        if (options.failFast) {
+          return;
+        }
      }
    }
  }
@@ -187,12 +226,15 @@ async function runLanePool(poolLanes, baseEnv, logDir, parallelism) {
  return failures;
 }

-async function runLanesSerial(serialLanes, baseEnv, logDir) {
+async function runLanesSerial(serialLanes, baseEnv, logDir, options) {
  const failures = [];
  for (const lane of serialLanes) {
-    const result = await runLane(lane, baseEnv, logDir);
+    const result = await runLane(lane, baseEnv, logDir, options.timeoutMs);
    if (result.status !== 0) {
      failures.push(result);
+      if (options.failFast) {
+        break;
+      }
    }
  }
  return failures;
@@ -242,6 +284,12 @@ async function main() {
    DEFAULT_FAILURE_TAIL_LINES,
    "OPENCLAW_DOCKER_ALL_FAILURE_TAIL_LINES",
  );
+  const laneTimeoutMs = parsePositiveInt(
+    process.env.OPENCLAW_DOCKER_ALL_LANE_TIMEOUT_MS,
+    DEFAULT_LANE_TIMEOUT_MS,
+    "OPENCLAW_DOCKER_ALL_LANE_TIMEOUT_MS",
+  );
+  const failFast = parseBool(process.env.OPENCLAW_DOCKER_ALL_FAIL_FAST, true);
  const runId = process.env.OPENCLAW_DOCKER_ALL_RUN_ID || utcStampForPath();
  const logDir = path.resolve(
    process.env.OPENCLAW_DOCKER_ALL_LOG_DIR ||
@@ -258,6 +306,8 @@ async function main() {

  console.log(`==> Docker test logs: ${logDir}`);
  console.log(`==> Parallelism: ${parallelism}`);
+  console.log(`==> Lane timeout: ${laneTimeoutMs}ms`);
+  console.log(`==> Fail fast: ${failFast ? "yes" : "no"}`);
  console.log(`==> Live-test bundled plugin deps: ${baseEnv.OPENCLAW_DOCKER_BUILD_EXTENSIONS}`);

  await runForeground("Build shared live-test image once", "pnpm test:docker:live-build", baseEnv);
@@ -267,14 +317,15 @@ async function main() {
    baseEnv,
  );

-  const failures = await runLanePool(lanes, baseEnv, logDir, parallelism);
+  const options = { failFast, timeoutMs: laneTimeoutMs };
+  const failures = await runLanePool(lanes, baseEnv, logDir, parallelism, options);
  if (failures.length > 0) {
    await printFailureSummary(failures, tailLines);
    process.exit(1);
  }

  console.log("==> Running provider-sensitive Docker lanes exclusively");
-  failures.push(...(await runLanesSerial(exclusiveLanes, baseEnv, logDir)));
+  failures.push(...(await runLanesSerial(exclusiveLanes, baseEnv, logDir, options)));
  if (failures.length > 0) {
    await printFailureSummary(failures, tailLines);
    process.exit(1);