diff --git a/.agents/skills/openclaw-test-heap-leaks/SKILL.md b/.agents/skills/openclaw-test-heap-leaks/SKILL.md new file mode 100644 index 00000000000..a2ab2878430 --- /dev/null +++ b/.agents/skills/openclaw-test-heap-leaks/SKILL.md @@ -0,0 +1,71 @@ +--- +name: openclaw-test-heap-leaks +description: Investigate `pnpm test` memory growth, Vitest worker OOMs, and suspicious RSS increases in OpenClaw using the `scripts/test-parallel.mjs` heap snapshot tooling. Use when Codex needs to reproduce test-lane memory growth, collect repeated `.heapsnapshot` files, compare snapshots from the same worker PID, distinguish transformed-module retention from real data leaks, and fix or reduce the impact by patching cleanup logic or isolating hotspot tests. +--- + +# OpenClaw Test Heap Leaks + +Use this skill for test-memory investigations. Do not guess from RSS alone when heap snapshots are available. + +## Workflow + +1. Reproduce the failing shape first. + - Match the real entrypoint if possible. For Linux CI-style unit failures, start with: + - `pnpm canvas:a2ui:bundle && OPENCLAW_TEST_MEMORY_TRACE=1 OPENCLAW_TEST_HEAPSNAPSHOT_INTERVAL_MS=60000 OPENCLAW_TEST_HEAPSNAPSHOT_DIR=.tmp/heapsnap OPENCLAW_TEST_WORKERS=2 OPENCLAW_TEST_MAX_OLD_SPACE_SIZE_MB=6144 pnpm test` + - Keep `OPENCLAW_TEST_MEMORY_TRACE=1` enabled so the wrapper prints per-file RSS summaries alongside the snapshots. + - If the report is about a specific shard or worker budget, preserve that shape. + +2. Wait for repeated snapshots before concluding anything. + - Take at least two intervals from the same lane. + - Compare snapshots from the same PID inside one lane directory such as `.tmp/heapsnap/unit-fast/`. + - Use `scripts/heapsnapshot-delta.mjs` to compare either two files directly or the earliest/latest pair per PID in one lane directory. + +3. Classify the growth before choosing a fix. + - If growth is dominated by Vite/Vitest transformed source strings, `Module`, `system / Context`, bytecode, descriptor arrays, or property maps, treat it as retained module graph growth in long-lived workers. + - If growth is dominated by app objects, caches, buffers, server handles, timers, mock state, sqlite state, or similar runtime objects, treat it as a likely cleanup or lifecycle leak. + +4. Fix the right layer. + - For retained transformed-module growth in shared workers: + - Move hotspot files out of `unit-fast` by updating `test/fixtures/test-parallel.behavior.json`. + - Prefer `singletonIsolated` for files that are safe alone but inflate shared worker heaps. + - If the file should already have been peeled out by timings but is absent from `test/fixtures/test-timings.unit.json`, call that out explicitly. Missing timings are a scheduling blind spot. + - For real leaks: + - Patch the implicated test or runtime cleanup path. + - Look for missing `afterEach`/`afterAll`, module-reset gaps, retained global state, unreleased DB handles, or listeners/timers that survive the file. + +5. Verify with the most direct proof. + - Re-run the targeted lane or file with heap snapshots enabled if the suite still finishes in reasonable time. + - If snapshot overhead pushes tests over Vitest timeouts, fall back to the same lane without snapshots and confirm the RSS trend or OOM is reduced. + - For wrapper-only changes, at minimum verify the expected lanes start and the snapshot files are written. + +## Heuristics + +- Do not call everything a leak. In this repo, large `unit-fast` growth can be a worker-lifetime problem rather than an application object leak. +- `scripts/test-parallel.mjs` and `scripts/test-parallel-memory.mjs` are the primary control points for wrapper diagnostics. +- The lane names printed by `[test-parallel] start ...` and `[test-parallel][mem] summary ...` tell you where to focus. +- When one or two files account for most of the delta and they are missing from timings, reducing impact by isolating them is usually the first pragmatic fix. +- When the same retained object families grow across multiple intervals in the same worker PID, trust the snapshots over intuition. + +## Snapshot Comparison + +- Direct comparison: + - `node .agents/skills/openclaw-test-heap-leaks/scripts/heapsnapshot-delta.mjs before.heapsnapshot after.heapsnapshot` +- Auto-select earliest/latest snapshots per PID within one lane: + - `node .agents/skills/openclaw-test-heap-leaks/scripts/heapsnapshot-delta.mjs --lane-dir .tmp/heapsnap/unit-fast` +- Useful flags: + - `--top 40` + - `--min-kb 32` + - `--pid 16133` + +Read the top positive deltas first. Large positive growth in module-transform artifacts suggests lane isolation; large positive growth in runtime objects suggests a real leak. + +## Output Expectations + +When using this skill, report: + +- The exact reproduce command. +- Which lane and PID were compared. +- The dominant retained object families from the snapshot delta. +- Whether the issue is a real leak or shared-worker retained module growth. +- The concrete fix or impact-reduction patch. +- What you verified, and what snapshot overhead prevented you from verifying. diff --git a/.agents/skills/openclaw-test-heap-leaks/agents/openai.yaml b/.agents/skills/openclaw-test-heap-leaks/agents/openai.yaml new file mode 100644 index 00000000000..b5157911b77 --- /dev/null +++ b/.agents/skills/openclaw-test-heap-leaks/agents/openai.yaml @@ -0,0 +1,4 @@ +interface: + display_name: "Test Heap Leaks" + short_description: "Investigate test OOMs with heap snapshots" + default_prompt: "Use $openclaw-test-heap-leaks to investigate test memory growth with heap snapshots and reduce its impact." diff --git a/.agents/skills/openclaw-test-heap-leaks/scripts/heapsnapshot-delta.mjs b/.agents/skills/openclaw-test-heap-leaks/scripts/heapsnapshot-delta.mjs new file mode 100644 index 00000000000..ccb705c4c82 --- /dev/null +++ b/.agents/skills/openclaw-test-heap-leaks/scripts/heapsnapshot-delta.mjs @@ -0,0 +1,265 @@ +#!/usr/bin/env node + +import fs from "node:fs"; +import path from "node:path"; + +function printUsage() { + console.error( + "Usage: node heapsnapshot-delta.mjs [--top N] [--min-kb N]", + ); + console.error( + " or: node heapsnapshot-delta.mjs --lane-dir [--pid PID] [--top N] [--min-kb N]", + ); +} + +function fail(message) { + console.error(message); + process.exit(1); +} + +function parseArgs(argv) { + const options = { + top: 30, + minKb: 64, + laneDir: null, + pid: null, + files: [], + }; + + for (let index = 0; index < argv.length; index += 1) { + const arg = argv[index]; + if (arg === "--top") { + options.top = Number.parseInt(argv[index + 1] ?? "", 10); + index += 1; + continue; + } + if (arg === "--min-kb") { + options.minKb = Number.parseInt(argv[index + 1] ?? "", 10); + index += 1; + continue; + } + if (arg === "--lane-dir") { + options.laneDir = argv[index + 1] ?? null; + index += 1; + continue; + } + if (arg === "--pid") { + options.pid = Number.parseInt(argv[index + 1] ?? "", 10); + index += 1; + continue; + } + options.files.push(arg); + } + + if (!Number.isFinite(options.top) || options.top <= 0) { + fail("--top must be a positive integer"); + } + if (!Number.isFinite(options.minKb) || options.minKb < 0) { + fail("--min-kb must be a non-negative integer"); + } + if (options.pid !== null && (!Number.isInteger(options.pid) || options.pid <= 0)) { + fail("--pid must be a positive integer"); + } + + return options; +} + +function parseHeapFilename(filePath) { + const base = path.basename(filePath); + const match = base.match( + /^Heap\.(?\d{8}\.\d{6})\.(?\d+)\.0\.(?\d+)\.heapsnapshot$/u, + ); + if (!match?.groups) { + return null; + } + return { + filePath, + pid: Number.parseInt(match.groups.pid, 10), + stamp: match.groups.stamp, + sequence: Number.parseInt(match.groups.seq, 10), + }; +} + +function resolvePair(options) { + if (options.laneDir) { + const entries = fs + .readdirSync(options.laneDir) + .map((name) => parseHeapFilename(path.join(options.laneDir, name))) + .filter((entry) => entry !== null) + .filter((entry) => options.pid === null || entry.pid === options.pid) + .toSorted((left, right) => { + if (left.pid !== right.pid) { + return left.pid - right.pid; + } + if (left.stamp !== right.stamp) { + return left.stamp.localeCompare(right.stamp); + } + return left.sequence - right.sequence; + }); + + if (entries.length === 0) { + fail(`No matching heap snapshots found in ${options.laneDir}`); + } + + const groups = new Map(); + for (const entry of entries) { + const group = groups.get(entry.pid) ?? []; + group.push(entry); + groups.set(entry.pid, group); + } + + const candidates = Array.from(groups.values()) + .map((group) => ({ + pid: group[0].pid, + before: group[0], + after: group.at(-1), + count: group.length, + })) + .filter((entry) => entry.count >= 2); + + if (candidates.length === 0) { + fail(`Need at least two snapshots for one PID in ${options.laneDir}`); + } + + const chosen = + options.pid !== null + ? (candidates.find((entry) => entry.pid === options.pid) ?? null) + : candidates.toSorted((left, right) => right.count - left.count || left.pid - right.pid)[0]; + + if (!chosen) { + fail(`No PID with at least two snapshots matched in ${options.laneDir}`); + } + + return { + before: chosen.before.filePath, + after: chosen.after.filePath, + pid: chosen.pid, + snapshotCount: chosen.count, + }; + } + + if (options.files.length !== 2) { + printUsage(); + process.exit(1); + } + + return { + before: options.files[0], + after: options.files[1], + pid: null, + snapshotCount: 2, + }; +} + +function loadSummary(filePath) { + const data = JSON.parse(fs.readFileSync(filePath, "utf8")); + const meta = data.snapshot?.meta; + if (!meta) { + fail(`Invalid heap snapshot: ${filePath}`); + } + + const nodeFieldCount = meta.node_fields.length; + const typeNames = meta.node_types[0]; + const strings = data.strings; + const typeIndex = meta.node_fields.indexOf("type"); + const nameIndex = meta.node_fields.indexOf("name"); + const selfSizeIndex = meta.node_fields.indexOf("self_size"); + + const summary = new Map(); + for (let offset = 0; offset < data.nodes.length; offset += nodeFieldCount) { + const type = typeNames[data.nodes[offset + typeIndex]]; + const name = strings[data.nodes[offset + nameIndex]]; + const selfSize = data.nodes[offset + selfSizeIndex]; + const key = `${type}\t${name}`; + const current = summary.get(key) ?? { + type, + name, + selfSize: 0, + count: 0, + }; + current.selfSize += selfSize; + current.count += 1; + summary.set(key, current); + } + return { + nodeCount: data.snapshot.node_count, + summary, + }; +} + +function formatBytes(bytes) { + if (Math.abs(bytes) >= 1024 ** 2) { + return `${(bytes / 1024 ** 2).toFixed(2)} MiB`; + } + if (Math.abs(bytes) >= 1024) { + return `${(bytes / 1024).toFixed(1)} KiB`; + } + return `${bytes} B`; +} + +function formatDelta(bytes) { + return `${bytes >= 0 ? "+" : "-"}${formatBytes(Math.abs(bytes))}`; +} + +function truncate(text, maxLength) { + return text.length <= maxLength ? text : `${text.slice(0, maxLength - 1)}…`; +} + +function main() { + const options = parseArgs(process.argv.slice(2)); + const pair = resolvePair(options); + const before = loadSummary(pair.before); + const after = loadSummary(pair.after); + const minBytes = options.minKb * 1024; + + const rows = []; + for (const [key, next] of after.summary) { + const previous = before.summary.get(key) ?? { selfSize: 0, count: 0 }; + const sizeDelta = next.selfSize - previous.selfSize; + const countDelta = next.count - previous.count; + if (sizeDelta < minBytes) { + continue; + } + rows.push({ + type: next.type, + name: next.name, + sizeDelta, + countDelta, + afterSize: next.selfSize, + afterCount: next.count, + }); + } + + rows.sort( + (left, right) => right.sizeDelta - left.sizeDelta || right.countDelta - left.countDelta, + ); + + console.log(`before: ${pair.before}`); + console.log(`after: ${pair.after}`); + if (pair.pid !== null) { + console.log(`pid: ${pair.pid} (${pair.snapshotCount} snapshots found)`); + } + console.log( + `nodes: ${before.nodeCount} -> ${after.nodeCount} (${after.nodeCount - before.nodeCount >= 0 ? "+" : ""}${after.nodeCount - before.nodeCount})`, + ); + console.log(`filter: top=${options.top} min=${options.minKb} KiB`); + console.log(""); + + if (rows.length === 0) { + console.log("No entries exceeded the minimum delta."); + return; + } + + for (const row of rows.slice(0, options.top)) { + console.log( + [ + formatDelta(row.sizeDelta).padStart(11), + `count ${row.countDelta >= 0 ? "+" : ""}${row.countDelta}`.padStart(10), + row.type.padEnd(16), + truncate(row.name || "(empty)", 96), + ].join(" "), + ); + } +} + +main(); diff --git a/scripts/test-parallel-memory.mjs b/scripts/test-parallel-memory.mjs index a4fa2602cd1..b036fc22fa6 100644 --- a/scripts/test-parallel-memory.mjs +++ b/scripts/test-parallel-memory.mjs @@ -11,7 +11,7 @@ const ANSI_ESCAPE_PATTERN = new RegExp( const COMPLETED_TEST_FILE_LINE_PATTERN = /(?(?:src|extensions|test|ui)\/\S+?\.(?:live\.test|e2e\.test|test)\.ts)\s+\(.*\)\s+(?\d+(?:\.\d+)?)(?ms|s)\s*$/; -const PS_COLUMNS = ["pid=", "ppid=", "rss="]; +const PS_COLUMNS = ["pid=", "ppid=", "rss=", "comm="]; function parseDurationMs(rawValue, unit) { const parsed = Number.parseFloat(rawValue); @@ -41,7 +41,7 @@ export function parseCompletedTestFileLines(text) { .filter((entry) => entry !== null); } -export function sampleProcessTreeRssKb(rootPid) { +export function getProcessTreeRecords(rootPid) { if (!Number.isInteger(rootPid) || rootPid <= 0 || process.platform === "win32") { return null; } @@ -54,13 +54,13 @@ export function sampleProcessTreeRssKb(rootPid) { } const childPidsByParent = new Map(); - const rssByPid = new Map(); + const recordsByPid = new Map(); for (const line of result.stdout.split(/\r?\n/u)) { const trimmed = line.trim(); if (!trimmed) { continue; } - const [pidRaw, parentRaw, rssRaw] = trimmed.split(/\s+/u); + const [pidRaw, parentRaw, rssRaw, commandRaw] = trimmed.split(/\s+/u, 4); const pid = Number.parseInt(pidRaw ?? "", 10); const parentPid = Number.parseInt(parentRaw ?? "", 10); const rssKb = Number.parseInt(rssRaw ?? "", 10); @@ -70,27 +70,30 @@ export function sampleProcessTreeRssKb(rootPid) { const siblings = childPidsByParent.get(parentPid) ?? []; siblings.push(pid); childPidsByParent.set(parentPid, siblings); - rssByPid.set(pid, rssKb); + recordsByPid.set(pid, { + pid, + parentPid, + rssKb, + command: commandRaw ?? "", + }); } - if (!rssByPid.has(rootPid)) { + if (!recordsByPid.has(rootPid)) { return null; } - let rssKb = 0; - let processCount = 0; const queue = [rootPid]; const visited = new Set(); + const records = []; while (queue.length > 0) { const pid = queue.shift(); if (pid === undefined || visited.has(pid)) { continue; } visited.add(pid); - const currentRssKb = rssByPid.get(pid); - if (currentRssKb !== undefined) { - rssKb += currentRssKb; - processCount += 1; + const record = recordsByPid.get(pid); + if (record) { + records.push(record); } for (const childPid of childPidsByParent.get(pid) ?? []) { if (!visited.has(childPid)) { @@ -99,5 +102,21 @@ export function sampleProcessTreeRssKb(rootPid) { } } + return records; +} + +export function sampleProcessTreeRssKb(rootPid) { + const records = getProcessTreeRecords(rootPid); + if (!records) { + return null; + } + + let rssKb = 0; + let processCount = 0; + for (const record of records) { + rssKb += record.rssKb; + processCount += 1; + } + return { rssKb, processCount }; } diff --git a/scripts/test-parallel.mjs b/scripts/test-parallel.mjs index 841132d69e0..c908ede7e4a 100644 --- a/scripts/test-parallel.mjs +++ b/scripts/test-parallel.mjs @@ -4,7 +4,11 @@ import os from "node:os"; import path from "node:path"; import { channelTestPrefixes } from "../vitest.channel-paths.mjs"; import { isUnitConfigTestFile } from "../vitest.unit-paths.mjs"; -import { parseCompletedTestFileLines, sampleProcessTreeRssKb } from "./test-parallel-memory.mjs"; +import { + getProcessTreeRecords, + parseCompletedTestFileLines, + sampleProcessTreeRssKb, +} from "./test-parallel-memory.mjs"; import { appendCapturedOutput, hasFatalTestRunOutput, @@ -725,6 +729,25 @@ const memoryTraceEnabled = (rawMemoryTrace !== "0" && rawMemoryTrace !== "false" && isCI)); const memoryTracePollMs = Math.max(250, parseEnvNumber("OPENCLAW_TEST_MEMORY_TRACE_POLL_MS", 1000)); const memoryTraceTopCount = Math.max(1, parseEnvNumber("OPENCLAW_TEST_MEMORY_TRACE_TOP_COUNT", 6)); +const heapSnapshotIntervalMs = Math.max( + 0, + parseEnvNumber("OPENCLAW_TEST_HEAPSNAPSHOT_INTERVAL_MS", 0), +); +const heapSnapshotMinIntervalMs = 5000; +const heapSnapshotEnabled = + process.platform !== "win32" && + heapSnapshotIntervalMs >= heapSnapshotMinIntervalMs; +const heapSnapshotEnabled = process.platform !== "win32" && heapSnapshotIntervalMs > 0; +const heapSnapshotSignal = process.env.OPENCLAW_TEST_HEAPSNAPSHOT_SIGNAL?.trim() || "SIGUSR2"; +const heapSnapshotBaseDir = heapSnapshotEnabled + ? path.resolve( + process.env.OPENCLAW_TEST_HEAPSNAPSHOT_DIR?.trim() || + path.join(os.tmpdir(), `openclaw-heapsnapshots-${Date.now()}`), + ) + : null; +const ensureNodeOptionFlag = (nodeOptions, flagPrefix, nextValue) => + nodeOptions.includes(flagPrefix) ? nodeOptions : `${nodeOptions} ${nextValue}`.trim(); +const isNodeLikeProcess = (command) => /(?:^|\/)node(?:$|\.exe$)/iu.test(command); const runOnce = (entry, extraArgs = []) => new Promise((resolve) => { @@ -757,23 +780,44 @@ const runOnce = (entry, extraArgs = []) => (acc, flag) => (acc.includes(flag) ? acc : `${acc} ${flag}`.trim()), nodeOptions, ); - const heapFlag = + const heapSnapshotDir = + heapSnapshotBaseDir === null ? null : path.join(heapSnapshotBaseDir, entry.name); + let resolvedNodeOptions = maxOldSpaceSizeMb && !nextNodeOptions.includes("--max-old-space-size=") - ? `--max-old-space-size=${maxOldSpaceSizeMb}` - : null; - const resolvedNodeOptions = heapFlag - ? `${nextNodeOptions} ${heapFlag}`.trim() - : nextNodeOptions; + ? `${nextNodeOptions} --max-old-space-size=${maxOldSpaceSizeMb}`.trim() + : nextNodeOptions; + if (heapSnapshotEnabled && heapSnapshotDir) { + try { + fs.mkdirSync(heapSnapshotDir, { recursive: true }); + } catch (err) { + console.error(`[test-parallel] failed to create heap snapshot dir ${heapSnapshotDir}: ${String(err)}`); + resolve(1); + return; + } + resolvedNodeOptions = ensureNodeOptionFlag( + resolvedNodeOptions, + "--diagnostic-dir=", + `--diagnostic-dir=${heapSnapshotDir}`, + ); + resolvedNodeOptions = ensureNodeOptionFlag( + resolvedNodeOptions, + "--heapsnapshot-signal=", + `--heapsnapshot-signal=${heapSnapshotSignal}`, + ); + } + } let output = ""; let fatalSeen = false; let childError = null; let child; let pendingLine = ""; let memoryPollTimer = null; + let heapSnapshotTimer = null; const memoryFileRecords = []; let initialTreeSample = null; let latestTreeSample = null; let peakTreeSample = null; + let heapSnapshotSequence = 0; const updatePeakTreeSample = (sample, reason) => { if (!sample) { return; @@ -782,6 +826,35 @@ const runOnce = (entry, extraArgs = []) => peakTreeSample = { ...sample, reason }; } }; + const triggerHeapSnapshot = (reason) => { + if (!heapSnapshotEnabled || !child?.pid || !heapSnapshotDir) { + return; + } + const records = getProcessTreeRecords(child.pid) ?? []; + const targetPids = records + .filter((record) => record.pid !== process.pid && isNodeLikeProcess(record.command)) + .map((record) => record.pid); + if (targetPids.length === 0) { + return; + } + heapSnapshotSequence += 1; + let signaledCount = 0; + for (const pid of targetPids) { + try { + process.kill(pid, heapSnapshotSignal); + signaledCount += 1; + } catch { + // Process likely exited between ps sampling and signal delivery. + } + } + if (signaledCount > 0) { + console.log( + `[test-parallel][heap] ${entry.name} seq=${String(heapSnapshotSequence)} reason=${reason} signaled=${String( + signaledCount, + )}/${String(targetPids.length)} dir=${heapSnapshotDir}`, + ); + } + }; const captureTreeSample = (reason) => { if (!memoryTraceEnabled || !child?.pid) { return null; @@ -877,6 +950,11 @@ const runOnce = (entry, extraArgs = []) => captureTreeSample("poll"); }, memoryTracePollMs); } + if (heapSnapshotEnabled) { + heapSnapshotTimer = setInterval(() => { + triggerHeapSnapshot("interval"); + }, heapSnapshotIntervalMs); + } } catch (err) { console.error(`[test-parallel] spawn failed: ${String(err)}`); resolve(1); @@ -905,6 +983,9 @@ const runOnce = (entry, extraArgs = []) => if (memoryPollTimer) { clearInterval(memoryPollTimer); } + if (heapSnapshotTimer) { + clearInterval(heapSnapshotTimer); + } children.delete(child); const resolvedCode = resolveTestRunExitCode({ code, signal, output, fatalSeen, childError }); logMemoryTraceSummary();