ci: rebalance test workers

This commit is contained in:
Peter Steinberger
2026-04-22 22:26:02 +01:00
parent 65ae1e54de
commit 77dbc1cda6
20 changed files with 460 additions and 111 deletions

View File

@@ -53,7 +53,7 @@ Local changed-lane logic lives in `scripts/changed-lanes.mjs` and is executed by
On pushes, the `checks` matrix adds the push-only `compat-node22` lane. On pull requests, that lane is skipped and the matrix stays focused on the normal test/channel lanes.
The slowest Node test families are split or balanced so each job stays small: channel contracts split registry and core coverage into eight weighted shards each, bundled plugin tests balance across six extension workers, auto-reply runs as three balanced workers instead of six tiny workers, and agentic gateway/plugin configs are spread across the existing source-only agentic Node jobs instead of waiting on built artifacts. The broad agents lane uses the shared Vitest file-parallel scheduler because it is import/scheduling dominated rather than owned by a single slow test file. `check-additional` keeps package-boundary compile/canary work together and separates it from runtime topology gateway/architecture work; the boundary guard shard runs its small independent guards concurrently inside one job, and the gateway watch regression uses the minimal `gatewayWatch` build profile instead of rebuilding the full CI artifact sidecar set.
The slowest Node test families are split or balanced so each job stays small: channel contracts split registry and core coverage into six weighted shards total, bundled plugin tests balance across six extension workers, auto-reply runs as three balanced workers instead of six tiny workers, and agentic gateway/plugin configs are spread across the existing source-only agentic Node jobs instead of waiting on built artifacts. Broad browser, QA, media, and miscellaneous plugin tests use their dedicated Vitest configs instead of the shared plugin catch-all. The broad agents lane uses the shared Vitest file-parallel scheduler because it is import/scheduling dominated rather than owned by a single slow test file. `runtime-config` runs with the infra core-runtime shard to keep the shared runtime shard from owning the tail. `check-additional` keeps package-boundary compile/canary work together and separates it from runtime topology gateway/architecture work; the boundary guard shard runs its small independent guards concurrently inside one job, and the gateway watch regression uses the minimal `gatewayWatch` build profile instead of rebuilding the full CI artifact sidecar set.
GitHub may mark superseded jobs as `cancelled` when a newer push lands on the same PR or `main` ref. Treat that as CI noise unless the newest run for the same ref is also failing. Aggregate shard checks use `!cancelled() && always()` so they still report normal shard failures but do not queue after the whole workflow has already been superseded.
The CI concurrency key is versioned (`CI-v6-*`) so a GitHub-side zombie in an old queue group cannot indefinitely block newer main runs.
@@ -85,4 +85,5 @@ pnpm test:channels
pnpm test:contracts:channels
pnpm check:docs # docs format + lint + broken links
pnpm build # build dist when CI artifact/build-smoke lanes matter
node scripts/ci-run-timings.mjs <run-id> # summarize wall time, queue time, and slowest jobs
```