Tests: Add tooling / skill for detecting and fixing memory leaks in tests (#50654)

* Tests: add periodic heap snapshot tooling

* Skills: add test heap leak workflow

* Apply suggestion from @greptile-apps[bot]

Co-authored-by: greptile-apps[bot] <165735046+greptile-apps[bot]@users.noreply.github.com>

* Update scripts/test-parallel.mjs

Co-authored-by: greptile-apps[bot] <165735046+greptile-apps[bot]@users.noreply.github.com>

---------

Co-authored-by: Vincent Koc <vincentkoc@ieee.org>
Co-authored-by: greptile-apps[bot] <165735046+greptile-apps[bot]@users.noreply.github.com>
This commit is contained in:
Harold Hunt
2026-03-19 17:59:13 -04:00
committed by GitHub
parent da8fb70525
commit bbd62469fa
5 changed files with 459 additions and 19 deletions

View File

@@ -0,0 +1,71 @@
---
name: openclaw-test-heap-leaks
description: Investigate `pnpm test` memory growth, Vitest worker OOMs, and suspicious RSS increases in OpenClaw using the `scripts/test-parallel.mjs` heap snapshot tooling. Use when Codex needs to reproduce test-lane memory growth, collect repeated `.heapsnapshot` files, compare snapshots from the same worker PID, distinguish transformed-module retention from real data leaks, and fix or reduce the impact by patching cleanup logic or isolating hotspot tests.
---
# OpenClaw Test Heap Leaks
Use this skill for test-memory investigations. Do not guess from RSS alone when heap snapshots are available.
## Workflow
1. Reproduce the failing shape first.
- Match the real entrypoint if possible. For Linux CI-style unit failures, start with:
- `pnpm canvas:a2ui:bundle && OPENCLAW_TEST_MEMORY_TRACE=1 OPENCLAW_TEST_HEAPSNAPSHOT_INTERVAL_MS=60000 OPENCLAW_TEST_HEAPSNAPSHOT_DIR=.tmp/heapsnap OPENCLAW_TEST_WORKERS=2 OPENCLAW_TEST_MAX_OLD_SPACE_SIZE_MB=6144 pnpm test`
- Keep `OPENCLAW_TEST_MEMORY_TRACE=1` enabled so the wrapper prints per-file RSS summaries alongside the snapshots.
- If the report is about a specific shard or worker budget, preserve that shape.
2. Wait for repeated snapshots before concluding anything.
- Take at least two intervals from the same lane.
- Compare snapshots from the same PID inside one lane directory such as `.tmp/heapsnap/unit-fast/`.
- Use `scripts/heapsnapshot-delta.mjs` to compare either two files directly or the earliest/latest pair per PID in one lane directory.
3. Classify the growth before choosing a fix.
- If growth is dominated by Vite/Vitest transformed source strings, `Module`, `system / Context`, bytecode, descriptor arrays, or property maps, treat it as retained module graph growth in long-lived workers.
- If growth is dominated by app objects, caches, buffers, server handles, timers, mock state, sqlite state, or similar runtime objects, treat it as a likely cleanup or lifecycle leak.
4. Fix the right layer.
- For retained transformed-module growth in shared workers:
- Move hotspot files out of `unit-fast` by updating `test/fixtures/test-parallel.behavior.json`.
- Prefer `singletonIsolated` for files that are safe alone but inflate shared worker heaps.
- If the file should already have been peeled out by timings but is absent from `test/fixtures/test-timings.unit.json`, call that out explicitly. Missing timings are a scheduling blind spot.
- For real leaks:
- Patch the implicated test or runtime cleanup path.
- Look for missing `afterEach`/`afterAll`, module-reset gaps, retained global state, unreleased DB handles, or listeners/timers that survive the file.
5. Verify with the most direct proof.
- Re-run the targeted lane or file with heap snapshots enabled if the suite still finishes in reasonable time.
- If snapshot overhead pushes tests over Vitest timeouts, fall back to the same lane without snapshots and confirm the RSS trend or OOM is reduced.
- For wrapper-only changes, at minimum verify the expected lanes start and the snapshot files are written.
## Heuristics
- Do not call everything a leak. In this repo, large `unit-fast` growth can be a worker-lifetime problem rather than an application object leak.
- `scripts/test-parallel.mjs` and `scripts/test-parallel-memory.mjs` are the primary control points for wrapper diagnostics.
- The lane names printed by `[test-parallel] start ...` and `[test-parallel][mem] summary ...` tell you where to focus.
- When one or two files account for most of the delta and they are missing from timings, reducing impact by isolating them is usually the first pragmatic fix.
- When the same retained object families grow across multiple intervals in the same worker PID, trust the snapshots over intuition.
## Snapshot Comparison
- Direct comparison:
- `node .agents/skills/openclaw-test-heap-leaks/scripts/heapsnapshot-delta.mjs before.heapsnapshot after.heapsnapshot`
- Auto-select earliest/latest snapshots per PID within one lane:
- `node .agents/skills/openclaw-test-heap-leaks/scripts/heapsnapshot-delta.mjs --lane-dir .tmp/heapsnap/unit-fast`
- Useful flags:
- `--top 40`
- `--min-kb 32`
- `--pid 16133`
Read the top positive deltas first. Large positive growth in module-transform artifacts suggests lane isolation; large positive growth in runtime objects suggests a real leak.
## Output Expectations
When using this skill, report:
- The exact reproduce command.
- Which lane and PID were compared.
- The dominant retained object families from the snapshot delta.
- Whether the issue is a real leak or shared-worker retained module growth.
- The concrete fix or impact-reduction patch.
- What you verified, and what snapshot overhead prevented you from verifying.

View File

@@ -0,0 +1,4 @@
interface:
display_name: "Test Heap Leaks"
short_description: "Investigate test OOMs with heap snapshots"
default_prompt: "Use $openclaw-test-heap-leaks to investigate test memory growth with heap snapshots and reduce its impact."

View File

@@ -0,0 +1,265 @@
#!/usr/bin/env node
import fs from "node:fs";
import path from "node:path";
function printUsage() {
console.error(
"Usage: node heapsnapshot-delta.mjs <before.heapsnapshot> <after.heapsnapshot> [--top N] [--min-kb N]",
);
console.error(
" or: node heapsnapshot-delta.mjs --lane-dir <dir> [--pid PID] [--top N] [--min-kb N]",
);
}
function fail(message) {
console.error(message);
process.exit(1);
}
function parseArgs(argv) {
const options = {
top: 30,
minKb: 64,
laneDir: null,
pid: null,
files: [],
};
for (let index = 0; index < argv.length; index += 1) {
const arg = argv[index];
if (arg === "--top") {
options.top = Number.parseInt(argv[index + 1] ?? "", 10);
index += 1;
continue;
}
if (arg === "--min-kb") {
options.minKb = Number.parseInt(argv[index + 1] ?? "", 10);
index += 1;
continue;
}
if (arg === "--lane-dir") {
options.laneDir = argv[index + 1] ?? null;
index += 1;
continue;
}
if (arg === "--pid") {
options.pid = Number.parseInt(argv[index + 1] ?? "", 10);
index += 1;
continue;
}
options.files.push(arg);
}
if (!Number.isFinite(options.top) || options.top <= 0) {
fail("--top must be a positive integer");
}
if (!Number.isFinite(options.minKb) || options.minKb < 0) {
fail("--min-kb must be a non-negative integer");
}
if (options.pid !== null && (!Number.isInteger(options.pid) || options.pid <= 0)) {
fail("--pid must be a positive integer");
}
return options;
}
function parseHeapFilename(filePath) {
const base = path.basename(filePath);
const match = base.match(
/^Heap\.(?<stamp>\d{8}\.\d{6})\.(?<pid>\d+)\.0\.(?<seq>\d+)\.heapsnapshot$/u,
);
if (!match?.groups) {
return null;
}
return {
filePath,
pid: Number.parseInt(match.groups.pid, 10),
stamp: match.groups.stamp,
sequence: Number.parseInt(match.groups.seq, 10),
};
}
function resolvePair(options) {
if (options.laneDir) {
const entries = fs
.readdirSync(options.laneDir)
.map((name) => parseHeapFilename(path.join(options.laneDir, name)))
.filter((entry) => entry !== null)
.filter((entry) => options.pid === null || entry.pid === options.pid)
.toSorted((left, right) => {
if (left.pid !== right.pid) {
return left.pid - right.pid;
}
if (left.stamp !== right.stamp) {
return left.stamp.localeCompare(right.stamp);
}
return left.sequence - right.sequence;
});
if (entries.length === 0) {
fail(`No matching heap snapshots found in ${options.laneDir}`);
}
const groups = new Map();
for (const entry of entries) {
const group = groups.get(entry.pid) ?? [];
group.push(entry);
groups.set(entry.pid, group);
}
const candidates = Array.from(groups.values())
.map((group) => ({
pid: group[0].pid,
before: group[0],
after: group.at(-1),
count: group.length,
}))
.filter((entry) => entry.count >= 2);
if (candidates.length === 0) {
fail(`Need at least two snapshots for one PID in ${options.laneDir}`);
}
const chosen =
options.pid !== null
? (candidates.find((entry) => entry.pid === options.pid) ?? null)
: candidates.toSorted((left, right) => right.count - left.count || left.pid - right.pid)[0];
if (!chosen) {
fail(`No PID with at least two snapshots matched in ${options.laneDir}`);
}
return {
before: chosen.before.filePath,
after: chosen.after.filePath,
pid: chosen.pid,
snapshotCount: chosen.count,
};
}
if (options.files.length !== 2) {
printUsage();
process.exit(1);
}
return {
before: options.files[0],
after: options.files[1],
pid: null,
snapshotCount: 2,
};
}
function loadSummary(filePath) {
const data = JSON.parse(fs.readFileSync(filePath, "utf8"));
const meta = data.snapshot?.meta;
if (!meta) {
fail(`Invalid heap snapshot: ${filePath}`);
}
const nodeFieldCount = meta.node_fields.length;
const typeNames = meta.node_types[0];
const strings = data.strings;
const typeIndex = meta.node_fields.indexOf("type");
const nameIndex = meta.node_fields.indexOf("name");
const selfSizeIndex = meta.node_fields.indexOf("self_size");
const summary = new Map();
for (let offset = 0; offset < data.nodes.length; offset += nodeFieldCount) {
const type = typeNames[data.nodes[offset + typeIndex]];
const name = strings[data.nodes[offset + nameIndex]];
const selfSize = data.nodes[offset + selfSizeIndex];
const key = `${type}\t${name}`;
const current = summary.get(key) ?? {
type,
name,
selfSize: 0,
count: 0,
};
current.selfSize += selfSize;
current.count += 1;
summary.set(key, current);
}
return {
nodeCount: data.snapshot.node_count,
summary,
};
}
function formatBytes(bytes) {
if (Math.abs(bytes) >= 1024 ** 2) {
return `${(bytes / 1024 ** 2).toFixed(2)} MiB`;
}
if (Math.abs(bytes) >= 1024) {
return `${(bytes / 1024).toFixed(1)} KiB`;
}
return `${bytes} B`;
}
function formatDelta(bytes) {
return `${bytes >= 0 ? "+" : "-"}${formatBytes(Math.abs(bytes))}`;
}
function truncate(text, maxLength) {
return text.length <= maxLength ? text : `${text.slice(0, maxLength - 1)}`;
}
function main() {
const options = parseArgs(process.argv.slice(2));
const pair = resolvePair(options);
const before = loadSummary(pair.before);
const after = loadSummary(pair.after);
const minBytes = options.minKb * 1024;
const rows = [];
for (const [key, next] of after.summary) {
const previous = before.summary.get(key) ?? { selfSize: 0, count: 0 };
const sizeDelta = next.selfSize - previous.selfSize;
const countDelta = next.count - previous.count;
if (sizeDelta < minBytes) {
continue;
}
rows.push({
type: next.type,
name: next.name,
sizeDelta,
countDelta,
afterSize: next.selfSize,
afterCount: next.count,
});
}
rows.sort(
(left, right) => right.sizeDelta - left.sizeDelta || right.countDelta - left.countDelta,
);
console.log(`before: ${pair.before}`);
console.log(`after: ${pair.after}`);
if (pair.pid !== null) {
console.log(`pid: ${pair.pid} (${pair.snapshotCount} snapshots found)`);
}
console.log(
`nodes: ${before.nodeCount} -> ${after.nodeCount} (${after.nodeCount - before.nodeCount >= 0 ? "+" : ""}${after.nodeCount - before.nodeCount})`,
);
console.log(`filter: top=${options.top} min=${options.minKb} KiB`);
console.log("");
if (rows.length === 0) {
console.log("No entries exceeded the minimum delta.");
return;
}
for (const row of rows.slice(0, options.top)) {
console.log(
[
formatDelta(row.sizeDelta).padStart(11),
`count ${row.countDelta >= 0 ? "+" : ""}${row.countDelta}`.padStart(10),
row.type.padEnd(16),
truncate(row.name || "(empty)", 96),
].join(" "),
);
}
}
main();

View File

@@ -11,7 +11,7 @@ const ANSI_ESCAPE_PATTERN = new RegExp(
const COMPLETED_TEST_FILE_LINE_PATTERN =
/(?<file>(?:src|extensions|test|ui)\/\S+?\.(?:live\.test|e2e\.test|test)\.ts)\s+\(.*\)\s+(?<duration>\d+(?:\.\d+)?)(?<unit>ms|s)\s*$/;
const PS_COLUMNS = ["pid=", "ppid=", "rss="];
const PS_COLUMNS = ["pid=", "ppid=", "rss=", "comm="];
function parseDurationMs(rawValue, unit) {
const parsed = Number.parseFloat(rawValue);
@@ -41,7 +41,7 @@ export function parseCompletedTestFileLines(text) {
.filter((entry) => entry !== null);
}
export function sampleProcessTreeRssKb(rootPid) {
export function getProcessTreeRecords(rootPid) {
if (!Number.isInteger(rootPid) || rootPid <= 0 || process.platform === "win32") {
return null;
}
@@ -54,13 +54,13 @@ export function sampleProcessTreeRssKb(rootPid) {
}
const childPidsByParent = new Map();
const rssByPid = new Map();
const recordsByPid = new Map();
for (const line of result.stdout.split(/\r?\n/u)) {
const trimmed = line.trim();
if (!trimmed) {
continue;
}
const [pidRaw, parentRaw, rssRaw] = trimmed.split(/\s+/u);
const [pidRaw, parentRaw, rssRaw, commandRaw] = trimmed.split(/\s+/u, 4);
const pid = Number.parseInt(pidRaw ?? "", 10);
const parentPid = Number.parseInt(parentRaw ?? "", 10);
const rssKb = Number.parseInt(rssRaw ?? "", 10);
@@ -70,27 +70,30 @@ export function sampleProcessTreeRssKb(rootPid) {
const siblings = childPidsByParent.get(parentPid) ?? [];
siblings.push(pid);
childPidsByParent.set(parentPid, siblings);
rssByPid.set(pid, rssKb);
recordsByPid.set(pid, {
pid,
parentPid,
rssKb,
command: commandRaw ?? "",
});
}
if (!rssByPid.has(rootPid)) {
if (!recordsByPid.has(rootPid)) {
return null;
}
let rssKb = 0;
let processCount = 0;
const queue = [rootPid];
const visited = new Set();
const records = [];
while (queue.length > 0) {
const pid = queue.shift();
if (pid === undefined || visited.has(pid)) {
continue;
}
visited.add(pid);
const currentRssKb = rssByPid.get(pid);
if (currentRssKb !== undefined) {
rssKb += currentRssKb;
processCount += 1;
const record = recordsByPid.get(pid);
if (record) {
records.push(record);
}
for (const childPid of childPidsByParent.get(pid) ?? []) {
if (!visited.has(childPid)) {
@@ -99,5 +102,21 @@ export function sampleProcessTreeRssKb(rootPid) {
}
}
return records;
}
export function sampleProcessTreeRssKb(rootPid) {
const records = getProcessTreeRecords(rootPid);
if (!records) {
return null;
}
let rssKb = 0;
let processCount = 0;
for (const record of records) {
rssKb += record.rssKb;
processCount += 1;
}
return { rssKb, processCount };
}

View File

@@ -4,7 +4,11 @@ import os from "node:os";
import path from "node:path";
import { channelTestPrefixes } from "../vitest.channel-paths.mjs";
import { isUnitConfigTestFile } from "../vitest.unit-paths.mjs";
import { parseCompletedTestFileLines, sampleProcessTreeRssKb } from "./test-parallel-memory.mjs";
import {
getProcessTreeRecords,
parseCompletedTestFileLines,
sampleProcessTreeRssKb,
} from "./test-parallel-memory.mjs";
import {
appendCapturedOutput,
hasFatalTestRunOutput,
@@ -725,6 +729,25 @@ const memoryTraceEnabled =
(rawMemoryTrace !== "0" && rawMemoryTrace !== "false" && isCI));
const memoryTracePollMs = Math.max(250, parseEnvNumber("OPENCLAW_TEST_MEMORY_TRACE_POLL_MS", 1000));
const memoryTraceTopCount = Math.max(1, parseEnvNumber("OPENCLAW_TEST_MEMORY_TRACE_TOP_COUNT", 6));
const heapSnapshotIntervalMs = Math.max(
0,
parseEnvNumber("OPENCLAW_TEST_HEAPSNAPSHOT_INTERVAL_MS", 0),
);
const heapSnapshotMinIntervalMs = 5000;
const heapSnapshotEnabled =
process.platform !== "win32" &&
heapSnapshotIntervalMs >= heapSnapshotMinIntervalMs;
const heapSnapshotEnabled = process.platform !== "win32" && heapSnapshotIntervalMs > 0;
const heapSnapshotSignal = process.env.OPENCLAW_TEST_HEAPSNAPSHOT_SIGNAL?.trim() || "SIGUSR2";
const heapSnapshotBaseDir = heapSnapshotEnabled
? path.resolve(
process.env.OPENCLAW_TEST_HEAPSNAPSHOT_DIR?.trim() ||
path.join(os.tmpdir(), `openclaw-heapsnapshots-${Date.now()}`),
)
: null;
const ensureNodeOptionFlag = (nodeOptions, flagPrefix, nextValue) =>
nodeOptions.includes(flagPrefix) ? nodeOptions : `${nodeOptions} ${nextValue}`.trim();
const isNodeLikeProcess = (command) => /(?:^|\/)node(?:$|\.exe$)/iu.test(command);
const runOnce = (entry, extraArgs = []) =>
new Promise((resolve) => {
@@ -757,23 +780,44 @@ const runOnce = (entry, extraArgs = []) =>
(acc, flag) => (acc.includes(flag) ? acc : `${acc} ${flag}`.trim()),
nodeOptions,
);
const heapFlag =
const heapSnapshotDir =
heapSnapshotBaseDir === null ? null : path.join(heapSnapshotBaseDir, entry.name);
let resolvedNodeOptions =
maxOldSpaceSizeMb && !nextNodeOptions.includes("--max-old-space-size=")
? `--max-old-space-size=${maxOldSpaceSizeMb}`
: null;
const resolvedNodeOptions = heapFlag
? `${nextNodeOptions} ${heapFlag}`.trim()
: nextNodeOptions;
? `${nextNodeOptions} --max-old-space-size=${maxOldSpaceSizeMb}`.trim()
: nextNodeOptions;
if (heapSnapshotEnabled && heapSnapshotDir) {
try {
fs.mkdirSync(heapSnapshotDir, { recursive: true });
} catch (err) {
console.error(`[test-parallel] failed to create heap snapshot dir ${heapSnapshotDir}: ${String(err)}`);
resolve(1);
return;
}
resolvedNodeOptions = ensureNodeOptionFlag(
resolvedNodeOptions,
"--diagnostic-dir=",
`--diagnostic-dir=${heapSnapshotDir}`,
);
resolvedNodeOptions = ensureNodeOptionFlag(
resolvedNodeOptions,
"--heapsnapshot-signal=",
`--heapsnapshot-signal=${heapSnapshotSignal}`,
);
}
}
let output = "";
let fatalSeen = false;
let childError = null;
let child;
let pendingLine = "";
let memoryPollTimer = null;
let heapSnapshotTimer = null;
const memoryFileRecords = [];
let initialTreeSample = null;
let latestTreeSample = null;
let peakTreeSample = null;
let heapSnapshotSequence = 0;
const updatePeakTreeSample = (sample, reason) => {
if (!sample) {
return;
@@ -782,6 +826,35 @@ const runOnce = (entry, extraArgs = []) =>
peakTreeSample = { ...sample, reason };
}
};
const triggerHeapSnapshot = (reason) => {
if (!heapSnapshotEnabled || !child?.pid || !heapSnapshotDir) {
return;
}
const records = getProcessTreeRecords(child.pid) ?? [];
const targetPids = records
.filter((record) => record.pid !== process.pid && isNodeLikeProcess(record.command))
.map((record) => record.pid);
if (targetPids.length === 0) {
return;
}
heapSnapshotSequence += 1;
let signaledCount = 0;
for (const pid of targetPids) {
try {
process.kill(pid, heapSnapshotSignal);
signaledCount += 1;
} catch {
// Process likely exited between ps sampling and signal delivery.
}
}
if (signaledCount > 0) {
console.log(
`[test-parallel][heap] ${entry.name} seq=${String(heapSnapshotSequence)} reason=${reason} signaled=${String(
signaledCount,
)}/${String(targetPids.length)} dir=${heapSnapshotDir}`,
);
}
};
const captureTreeSample = (reason) => {
if (!memoryTraceEnabled || !child?.pid) {
return null;
@@ -877,6 +950,11 @@ const runOnce = (entry, extraArgs = []) =>
captureTreeSample("poll");
}, memoryTracePollMs);
}
if (heapSnapshotEnabled) {
heapSnapshotTimer = setInterval(() => {
triggerHeapSnapshot("interval");
}, heapSnapshotIntervalMs);
}
} catch (err) {
console.error(`[test-parallel] spawn failed: ${String(err)}`);
resolve(1);
@@ -905,6 +983,9 @@ const runOnce = (entry, extraArgs = []) =>
if (memoryPollTimer) {
clearInterval(memoryPollTimer);
}
if (heapSnapshotTimer) {
clearInterval(heapSnapshotTimer);
}
children.delete(child);
const resolvedCode = resolveTestRunExitCode({ code, signal, output, fatalSeen, childError });
logMemoryTraceSummary();