Skills: harden heap snapshot diffing

This commit is contained in:
Gustavo Madeira Santana
2026-03-30 19:36:12 -04:00
parent bbd495ed63
commit ca6432b0d9
2 changed files with 324 additions and 32 deletions

View File

@@ -1,11 +1,11 @@
---
name: openclaw-test-heap-leaks
description: Investigate `pnpm test` memory growth, Vitest worker OOMs, and suspicious RSS increases in OpenClaw using the `scripts/test-parallel.mjs` heap snapshot tooling. Use when Codex needs to reproduce test-lane memory growth, collect repeated `.heapsnapshot` files, compare snapshots from the same worker PID, distinguish transformed-module retention from real data leaks, and fix or reduce the impact by patching cleanup logic or isolating hotspot tests.
description: Investigate `pnpm test` memory growth, Vitest worker OOMs, and suspicious RSS increases in OpenClaw using the `scripts/test-parallel.mjs` heap snapshot tooling. Use when Codex needs to reproduce test-lane memory growth, collect repeated `.heapsnapshot` files, compare snapshots from the same worker PID, triage likely transformed-module retention versus likely runtime leaks, and fix or reduce the impact by patching cleanup logic or isolating hotspot tests.
---
# OpenClaw Test Heap Leaks
Use this skill for test-memory investigations. Do not guess from RSS alone when heap snapshots are available.
Use this skill for test-memory investigations. Do not guess from RSS alone when heap snapshots are available. Treat snapshot-name deltas as triage evidence, not proof, until retainers or dominators support the call.
## Workflow
@@ -14,19 +14,23 @@ Use this skill for test-memory investigations. Do not guess from RSS alone when
- `pnpm canvas:a2ui:bundle && OPENCLAW_TEST_MEMORY_TRACE=1 OPENCLAW_TEST_HEAPSNAPSHOT_INTERVAL_MS=60000 OPENCLAW_TEST_HEAPSNAPSHOT_DIR=.tmp/heapsnap OPENCLAW_TEST_WORKERS=2 OPENCLAW_TEST_MAX_OLD_SPACE_SIZE_MB=6144 pnpm test`
- Keep `OPENCLAW_TEST_MEMORY_TRACE=1` enabled so the wrapper prints per-file RSS summaries alongside the snapshots.
- If the report is about a specific shard or worker budget, preserve that shape.
- Before you analyze snapshots, identify the real lane names from `[test-parallel] start ...` lines or `pnpm test --plan`. Do not assume a single `unit-fast` lane; local plans often split into `unit-fast-batch-*`.
2. Wait for repeated snapshots before concluding anything.
- Take at least two intervals from the same lane.
- Compare snapshots from the same PID inside one lane directory such as `.tmp/heapsnap/unit-fast/`.
- Use `scripts/heapsnapshot-delta.mjs` to compare either two files directly or the earliest/latest pair per PID in one lane directory.
- Compare snapshots from the same PID inside the real lane directory such as `.tmp/heapsnap/unit-fast-batch-2/`.
- Use `.agents/skills/openclaw-test-heap-leaks/scripts/heapsnapshot-delta.mjs` to compare either two files directly or the earliest/latest pair per PID in one lane directory.
- If the helper suggests transformed-module retention, confirm the top entries in DevTools retainers/dominators before calling it solved.
3. Classify the growth before choosing a fix.
- If growth is dominated by Vite/Vitest transformed source strings, `Module`, `system / Context`, bytecode, descriptor arrays, or property maps, treat it as retained module graph growth in long-lived workers.
- If growth is dominated by Vite/Vitest transformed source strings, `Module`, `system / Context`, bytecode, descriptor arrays, or property maps, treat it as likely retained module graph growth in long-lived workers.
- If growth is dominated by app objects, caches, buffers, server handles, timers, mock state, sqlite state, or similar runtime objects, treat it as a likely cleanup or lifecycle leak.
- If the names are ambiguous, stop short of a confident label and inspect retainers/dominators in DevTools for the top deltas.
4. Fix the right layer.
- For retained transformed-module growth in shared workers:
- Move hotspot files out of `unit-fast` by updating `test/fixtures/test-parallel.behavior.json`.
- For likely retained transformed-module growth in shared workers:
- Prefer timing and hotspot-driven scheduling fixes first. Check whether the file is already represented in `test/fixtures/test-timings.unit.json` and whether `scripts/test-update-memory-hotspots.mjs` should refresh the measured hotspot manifest before hand-editing behavior overrides.
- Move hotspot files out of the real shared lane by updating `test/fixtures/test-parallel.behavior.json` only when timing-driven peeling is insufficient.
- Prefer `singletonIsolated` for files that are safe alone but inflate shared worker heaps.
- If the file should already have been peeled out by timings but is absent from `test/fixtures/test-timings.unit.json`, call that out explicitly. Missing timings are a scheduling blind spot.
- For real leaks:
@@ -40,24 +44,24 @@ Use this skill for test-memory investigations. Do not guess from RSS alone when
## Heuristics
- Do not call everything a leak. In this repo, large `unit-fast` growth can be a worker-lifetime problem rather than an application object leak.
- Do not call everything a leak. In this repo, large `unit-fast` or `unit-fast-batch-*` growth can be a worker-lifetime problem rather than an application object leak.
- `scripts/test-parallel.mjs` and `scripts/test-parallel-memory.mjs` are the primary control points for wrapper diagnostics.
- The lane names printed by `[test-parallel] start ...` and `[test-parallel][mem] summary ...` tell you where to focus.
- When one or two files account for most of the delta and they are missing from timings, reducing impact by isolating them is usually the first pragmatic fix.
- When the same retained object families grow across multiple intervals in the same worker PID, trust the snapshots over intuition.
- When the same retained object families grow across multiple intervals in the same worker PID, trust the snapshots over intuition, then confirm ambiguous calls with retainer evidence.
## Snapshot Comparison
- Direct comparison:
- `node .agents/skills/openclaw-test-heap-leaks/scripts/heapsnapshot-delta.mjs before.heapsnapshot after.heapsnapshot`
- Auto-select earliest/latest snapshots per PID within one lane:
- `node .agents/skills/openclaw-test-heap-leaks/scripts/heapsnapshot-delta.mjs --lane-dir .tmp/heapsnap/unit-fast`
- `node .agents/skills/openclaw-test-heap-leaks/scripts/heapsnapshot-delta.mjs --lane-dir .tmp/heapsnap/unit-fast-batch-2`
- Useful flags:
- `--top 40`
- `--min-kb 32`
- `--pid 16133`
Read the top positive deltas first. Large positive growth in module-transform artifacts suggests lane isolation; large positive growth in runtime objects suggests a real leak.
Read the top positive deltas first. Large positive growth in module-transform artifacts suggests lane isolation; large positive growth in runtime objects suggests a real leak. If the names alone do not settle it, open the same snapshot pair in DevTools and inspect retainers/dominators for the top rows before declaring root cause.
## Output Expectations
@@ -66,6 +70,6 @@ When using this skill, report:
- The exact reproduce command.
- Which lane and PID were compared.
- The dominant retained object families from the snapshot delta.
- Whether the issue is a real leak or shared-worker retained module growth.
- Whether the issue is a likely real leak or likely shared-worker retained module growth, plus whether retainers/dominators confirmed it.
- The concrete fix or impact-reduction patch.
- What you verified, and what snapshot overhead prevented you from verifying.

View File

@@ -64,6 +64,243 @@ function parseArgs(argv) {
return options;
}
class JsonStreamScanner {
constructor(filePath) {
this.stream = fs.createReadStream(filePath, {
encoding: "utf8",
highWaterMark: 1024 * 1024,
});
this.iterator = this.stream[Symbol.asyncIterator]();
this.buffer = "";
this.offset = 0;
this.done = false;
}
compactBuffer() {
if (this.offset > 65536) {
this.buffer = this.buffer.slice(this.offset);
this.offset = 0;
}
}
async ensureAvailable(count = 1) {
while (!this.done && this.buffer.length - this.offset < count) {
const next = await this.iterator.next();
if (next.done) {
this.done = true;
break;
}
this.buffer += next.value;
}
}
async peek() {
await this.ensureAvailable(1);
return this.buffer[this.offset] ?? null;
}
async next() {
await this.ensureAvailable(1);
if (this.offset >= this.buffer.length) {
return null;
}
const char = this.buffer[this.offset];
this.offset += 1;
this.compactBuffer();
return char;
}
async skipWhitespace() {
while (true) {
const char = await this.peek();
if (char === null || !/\s/u.test(char)) {
return;
}
await this.next();
}
}
async expectChar(expected) {
const char = await this.next();
if (char !== expected) {
fail(`Expected ${expected} but found ${char ?? "<eof>"}`);
}
}
async find(sequence) {
let matched = 0;
while (true) {
const char = await this.next();
if (char === null) {
fail(`Could not find ${sequence}`);
}
if (char === sequence[matched]) {
matched += 1;
if (matched === sequence.length) {
return;
}
continue;
}
matched = char === sequence[0] ? 1 : 0;
if (matched === sequence.length) {
return;
}
}
}
async readBalancedObject() {
const start = await this.next();
if (start !== "{") {
fail(`Expected { but found ${start ?? "<eof>"}`);
}
let text = "{";
let depth = 1;
let inString = false;
let escaped = false;
while (depth > 0) {
const char = await this.next();
if (char === null) {
fail("Unexpected EOF while reading JSON object");
}
text += char;
if (inString) {
if (escaped) {
escaped = false;
} else if (char === "\\") {
escaped = true;
} else if (char === '"') {
inString = false;
}
continue;
}
if (char === '"') {
inString = true;
} else if (char === "{") {
depth += 1;
} else if (char === "}") {
depth -= 1;
}
}
return text;
}
async parseNumberArray(onValue) {
await this.skipWhitespace();
await this.expectChar("[");
await this.skipWhitespace();
if ((await this.peek()) === "]") {
await this.next();
return;
}
let token = "";
let index = 0;
const flush = () => {
if (token.length === 0) {
fail("Unexpected empty number token");
}
const value = Number.parseInt(token, 10);
if (!Number.isFinite(value)) {
fail(`Invalid numeric token: ${token}`);
}
onValue(value, index);
index += 1;
token = "";
};
while (true) {
const char = await this.next();
if (char === null) {
fail("Unexpected EOF while reading number array");
}
if (char === "]") {
flush();
return;
}
if (char === ",") {
flush();
continue;
}
if (/\s/u.test(char)) {
continue;
}
token += char;
}
}
async readJsonString() {
await this.expectChar('"');
let value = "";
while (true) {
const char = await this.next();
if (char === null) {
fail("Unexpected EOF while reading JSON string");
}
if (char === '"') {
return value;
}
if (char !== "\\") {
value += char;
continue;
}
const escaped = await this.next();
if (escaped === null) {
fail("Unexpected EOF while reading JSON string escape");
}
if (escaped === "u") {
let hex = "";
for (let index = 0; index < 4; index += 1) {
const hexChar = await this.next();
if (hexChar === null) {
fail("Unexpected EOF while reading JSON unicode escape");
}
hex += hexChar;
}
value += String.fromCharCode(Number.parseInt(hex, 16));
continue;
}
value +=
escaped === "b"
? "\b"
: escaped === "f"
? "\f"
: escaped === "n"
? "\n"
: escaped === "r"
? "\r"
: escaped === "t"
? "\t"
: escaped;
}
}
async parseStringArray(onValue) {
await this.skipWhitespace();
await this.expectChar("[");
await this.skipWhitespace();
if ((await this.peek()) === "]") {
await this.next();
return;
}
let index = 0;
while (true) {
const value = await this.readJsonString();
onValue(value, index);
index += 1;
await this.skipWhitespace();
const separator = await this.next();
if (separator === "]") {
return;
}
if (separator !== ",") {
fail(`Expected , or ] but found ${separator ?? "<eof>"}`);
}
await this.skipWhitespace();
}
}
}
function parseHeapFilename(filePath) {
const base = path.basename(filePath);
const match = base.match(
@@ -151,38 +388,89 @@ function resolvePair(options) {
};
}
function loadSummary(filePath) {
const data = JSON.parse(fs.readFileSync(filePath, "utf8"));
const meta = data.snapshot?.meta;
async function parseSnapshotMeta(scanner) {
await scanner.find('"snapshot":');
await scanner.skipWhitespace();
const metaObjectText = await scanner.readBalancedObject();
const parsed = JSON.parse(metaObjectText);
return parsed?.meta ?? null;
}
async function buildSummary(filePath) {
const scanner = new JsonStreamScanner(filePath);
const meta = await parseSnapshotMeta(scanner);
if (!meta) {
fail(`Invalid heap snapshot: ${filePath}`);
}
const nodeFieldCount = meta.node_fields.length;
const typeNames = meta.node_types[0];
const strings = data.strings;
const typeIndex = meta.node_fields.indexOf("type");
const nameIndex = meta.node_fields.indexOf("name");
const selfSizeIndex = meta.node_fields.indexOf("self_size");
if (typeIndex === -1 || nameIndex === -1 || selfSizeIndex === -1) {
fail(`Unsupported heap snapshot schema: ${filePath}`);
}
const summary = new Map();
for (let offset = 0; offset < data.nodes.length; offset += nodeFieldCount) {
const type = typeNames[data.nodes[offset + typeIndex]];
const name = strings[data.nodes[offset + nameIndex]];
const selfSize = data.nodes[offset + selfSizeIndex];
const key = `${type}\t${name}`;
const current = summary.get(key) ?? {
type,
name,
const summaryByIndex = new Map();
let nodeCount = 0;
let currentTypeId = 0;
let currentNameId = 0;
let currentSelfSize = 0;
await scanner.find('"nodes":');
await scanner.parseNumberArray((value, index) => {
const fieldIndex = index % nodeFieldCount;
if (fieldIndex === typeIndex) {
currentTypeId = value;
return;
}
if (fieldIndex === nameIndex) {
currentNameId = value;
return;
}
if (fieldIndex === selfSizeIndex) {
currentSelfSize = value;
}
if (fieldIndex !== nodeFieldCount - 1) {
return;
}
const key = `${currentTypeId}\t${currentNameId}`;
const current = summaryByIndex.get(key) ?? {
typeId: currentTypeId,
nameId: currentNameId,
selfSize: 0,
count: 0,
};
current.selfSize += selfSize;
current.selfSize += currentSelfSize;
current.count += 1;
summary.set(key, current);
summaryByIndex.set(key, current);
nodeCount += 1;
});
const requiredNameIds = new Set(
Array.from(summaryByIndex.values(), (entry) => entry.nameId).filter((value) => value >= 0),
);
const nameStrings = new Map();
await scanner.find('"strings":');
await scanner.parseStringArray((value, index) => {
if (requiredNameIds.has(index)) {
nameStrings.set(index, value);
}
});
const summary = new Map();
for (const entry of summaryByIndex.values()) {
const key = `${typeNames[entry.typeId] ?? "unknown"}\t${nameStrings.get(entry.nameId) ?? ""}`;
summary.set(key, {
type: typeNames[entry.typeId] ?? "unknown",
name: nameStrings.get(entry.nameId) ?? "",
selfSize: entry.selfSize,
count: entry.count,
});
}
return {
nodeCount: data.snapshot.node_count,
nodeCount,
summary,
};
}
@@ -205,11 +493,11 @@ function truncate(text, maxLength) {
return text.length <= maxLength ? text : `${text.slice(0, maxLength - 1)}`;
}
function main() {
async function main() {
const options = parseArgs(process.argv.slice(2));
const pair = resolvePair(options);
const before = loadSummary(pair.before);
const after = loadSummary(pair.after);
const before = await buildSummary(pair.before);
const after = await buildSummary(pair.after);
const minBytes = options.minKb * 1024;
const rows = [];
@@ -262,4 +550,4 @@ function main() {
}
}
main();
await main();