Skills: harden heap snapshot diffing

2026-04-21 14:11:26 +00:00 · 2026-03-30 19:36:12 -04:00
parent bbd495ed63
commit ca6432b0d9
2 changed files with 324 additions and 32 deletions
--- a/.agents/skills/openclaw-test-heap-leaks/SKILL.md
+++ b/.agents/skills/openclaw-test-heap-leaks/SKILL.md
@@ -1,11 +1,11 @@
 ---
 name: openclaw-test-heap-leaks
-description: Investigate `pnpm test` memory growth, Vitest worker OOMs, and suspicious RSS increases in OpenClaw using the `scripts/test-parallel.mjs` heap snapshot tooling. Use when Codex needs to reproduce test-lane memory growth, collect repeated `.heapsnapshot` files, compare snapshots from the same worker PID, distinguish transformed-module retention from real data leaks, and fix or reduce the impact by patching cleanup logic or isolating hotspot tests.
+description: Investigate `pnpm test` memory growth, Vitest worker OOMs, and suspicious RSS increases in OpenClaw using the `scripts/test-parallel.mjs` heap snapshot tooling. Use when Codex needs to reproduce test-lane memory growth, collect repeated `.heapsnapshot` files, compare snapshots from the same worker PID, triage likely transformed-module retention versus likely runtime leaks, and fix or reduce the impact by patching cleanup logic or isolating hotspot tests.
 ---

 # OpenClaw Test Heap Leaks

-Use this skill for test-memory investigations. Do not guess from RSS alone when heap snapshots are available.
+Use this skill for test-memory investigations. Do not guess from RSS alone when heap snapshots are available. Treat snapshot-name deltas as triage evidence, not proof, until retainers or dominators support the call.

 ## Workflow

@@ -14,19 +14,23 @@ Use this skill for test-memory investigations. Do not guess from RSS alone when
   - `pnpm canvas:a2ui:bundle && OPENCLAW_TEST_MEMORY_TRACE=1 OPENCLAW_TEST_HEAPSNAPSHOT_INTERVAL_MS=60000 OPENCLAW_TEST_HEAPSNAPSHOT_DIR=.tmp/heapsnap OPENCLAW_TEST_WORKERS=2 OPENCLAW_TEST_MAX_OLD_SPACE_SIZE_MB=6144 pnpm test`
   - Keep `OPENCLAW_TEST_MEMORY_TRACE=1` enabled so the wrapper prints per-file RSS summaries alongside the snapshots.
   - If the report is about a specific shard or worker budget, preserve that shape.
+   - Before you analyze snapshots, identify the real lane names from `[test-parallel] start ...` lines or `pnpm test --plan`. Do not assume a single `unit-fast` lane; local plans often split into `unit-fast-batch-*`.

 2. Wait for repeated snapshots before concluding anything.
   - Take at least two intervals from the same lane.
-   - Compare snapshots from the same PID inside one lane directory such as `.tmp/heapsnap/unit-fast/`.
-   - Use `scripts/heapsnapshot-delta.mjs` to compare either two files directly or the earliest/latest pair per PID in one lane directory.
+   - Compare snapshots from the same PID inside the real lane directory such as `.tmp/heapsnap/unit-fast-batch-2/`.
+   - Use `.agents/skills/openclaw-test-heap-leaks/scripts/heapsnapshot-delta.mjs` to compare either two files directly or the earliest/latest pair per PID in one lane directory.
+   - If the helper suggests transformed-module retention, confirm the top entries in DevTools retainers/dominators before calling it solved.

 3. Classify the growth before choosing a fix.
-   - If growth is dominated by Vite/Vitest transformed source strings, `Module`, `system / Context`, bytecode, descriptor arrays, or property maps, treat it as retained module graph growth in long-lived workers.
+   - If growth is dominated by Vite/Vitest transformed source strings, `Module`, `system / Context`, bytecode, descriptor arrays, or property maps, treat it as likely retained module graph growth in long-lived workers.
   - If growth is dominated by app objects, caches, buffers, server handles, timers, mock state, sqlite state, or similar runtime objects, treat it as a likely cleanup or lifecycle leak.
+   - If the names are ambiguous, stop short of a confident label and inspect retainers/dominators in DevTools for the top deltas.

 4. Fix the right layer.
-   - For retained transformed-module growth in shared workers:
-   - Move hotspot files out of `unit-fast` by updating `test/fixtures/test-parallel.behavior.json`.
+   - For likely retained transformed-module growth in shared workers:
+   - Prefer timing and hotspot-driven scheduling fixes first. Check whether the file is already represented in `test/fixtures/test-timings.unit.json` and whether `scripts/test-update-memory-hotspots.mjs` should refresh the measured hotspot manifest before hand-editing behavior overrides.
+   - Move hotspot files out of the real shared lane by updating `test/fixtures/test-parallel.behavior.json` only when timing-driven peeling is insufficient.
   - Prefer `singletonIsolated` for files that are safe alone but inflate shared worker heaps.
   - If the file should already have been peeled out by timings but is absent from `test/fixtures/test-timings.unit.json`, call that out explicitly. Missing timings are a scheduling blind spot.
   - For real leaks:
@@ -40,24 +44,24 @@ Use this skill for test-memory investigations. Do not guess from RSS alone when

 ## Heuristics

- Do not call everything a leak. In this repo, large `unit-fast` growth can be a worker-lifetime problem rather than an application object leak.
+- Do not call everything a leak. In this repo, large `unit-fast` or `unit-fast-batch-*` growth can be a worker-lifetime problem rather than an application object leak.
 - `scripts/test-parallel.mjs` and `scripts/test-parallel-memory.mjs` are the primary control points for wrapper diagnostics.
 - The lane names printed by `[test-parallel] start ...` and `[test-parallel][mem] summary ...` tell you where to focus.
 - When one or two files account for most of the delta and they are missing from timings, reducing impact by isolating them is usually the first pragmatic fix.
- When the same retained object families grow across multiple intervals in the same worker PID, trust the snapshots over intuition.
+- When the same retained object families grow across multiple intervals in the same worker PID, trust the snapshots over intuition, then confirm ambiguous calls with retainer evidence.

 ## Snapshot Comparison

 - Direct comparison:
  - `node .agents/skills/openclaw-test-heap-leaks/scripts/heapsnapshot-delta.mjs before.heapsnapshot after.heapsnapshot`
 - Auto-select earliest/latest snapshots per PID within one lane:
-  - `node .agents/skills/openclaw-test-heap-leaks/scripts/heapsnapshot-delta.mjs --lane-dir .tmp/heapsnap/unit-fast`
+  - `node .agents/skills/openclaw-test-heap-leaks/scripts/heapsnapshot-delta.mjs --lane-dir .tmp/heapsnap/unit-fast-batch-2`
 - Useful flags:
  - `--top 40`
  - `--min-kb 32`
  - `--pid 16133`

-Read the top positive deltas first. Large positive growth in module-transform artifacts suggests lane isolation; large positive growth in runtime objects suggests a real leak.
+Read the top positive deltas first. Large positive growth in module-transform artifacts suggests lane isolation; large positive growth in runtime objects suggests a real leak. If the names alone do not settle it, open the same snapshot pair in DevTools and inspect retainers/dominators for the top rows before declaring root cause.

 ## Output Expectations

@@ -66,6 +70,6 @@ When using this skill, report:
 - The exact reproduce command.
 - Which lane and PID were compared.
 - The dominant retained object families from the snapshot delta.
- Whether the issue is a real leak or shared-worker retained module growth.
+- Whether the issue is a likely real leak or likely shared-worker retained module growth, plus whether retainers/dominators confirmed it.
 - The concrete fix or impact-reduction patch.
 - What you verified, and what snapshot overhead prevented you from verifying.
--- a/.agents/skills/openclaw-test-heap-leaks/scripts/heapsnapshot-delta.mjs
+++ b/.agents/skills/openclaw-test-heap-leaks/scripts/heapsnapshot-delta.mjs
@@ -64,6 +64,243 @@ function parseArgs(argv) {
  return options;
 }

+class JsonStreamScanner {
+  constructor(filePath) {
+    this.stream = fs.createReadStream(filePath, {
+      encoding: "utf8",
+      highWaterMark: 1024 * 1024,
+    });
+    this.iterator = this.stream[Symbol.asyncIterator]();
+    this.buffer = "";
+    this.offset = 0;
+    this.done = false;
+  }
+
+  compactBuffer() {
+    if (this.offset > 65536) {
+      this.buffer = this.buffer.slice(this.offset);
+      this.offset = 0;
+    }
+  }
+
+  async ensureAvailable(count = 1) {
+    while (!this.done && this.buffer.length - this.offset < count) {
+      const next = await this.iterator.next();
+      if (next.done) {
+        this.done = true;
+        break;
+      }
+      this.buffer += next.value;
+    }
+  }
+
+  async peek() {
+    await this.ensureAvailable(1);
+    return this.buffer[this.offset] ?? null;
+  }
+
+  async next() {
+    await this.ensureAvailable(1);
+    if (this.offset >= this.buffer.length) {
+      return null;
+    }
+    const char = this.buffer[this.offset];
+    this.offset += 1;
+    this.compactBuffer();
+    return char;
+  }
+
+  async skipWhitespace() {
+    while (true) {
+      const char = await this.peek();
+      if (char === null || !/\s/u.test(char)) {
+        return;
+      }
+      await this.next();
+    }
+  }
+
+  async expectChar(expected) {
+    const char = await this.next();
+    if (char !== expected) {
+      fail(`Expected ${expected} but found ${char ?? "<eof>"}`);
+    }
+  }
+
+  async find(sequence) {
+    let matched = 0;
+    while (true) {
+      const char = await this.next();
+      if (char === null) {
+        fail(`Could not find ${sequence}`);
+      }
+      if (char === sequence[matched]) {
+        matched += 1;
+        if (matched === sequence.length) {
+          return;
+        }
+        continue;
+      }
+      matched = char === sequence[0] ? 1 : 0;
+      if (matched === sequence.length) {
+        return;
+      }
+    }
+  }
+
+  async readBalancedObject() {
+    const start = await this.next();
+    if (start !== "{") {
+      fail(`Expected { but found ${start ?? "<eof>"}`);
+    }
+    let text = "{";
+    let depth = 1;
+    let inString = false;
+    let escaped = false;
+    while (depth > 0) {
+      const char = await this.next();
+      if (char === null) {
+        fail("Unexpected EOF while reading JSON object");
+      }
+      text += char;
+      if (inString) {
+        if (escaped) {
+          escaped = false;
+        } else if (char === "\\") {
+          escaped = true;
+        } else if (char === '"') {
+          inString = false;
+        }
+        continue;
+      }
+      if (char === '"') {
+        inString = true;
+      } else if (char === "{") {
+        depth += 1;
+      } else if (char === "}") {
+        depth -= 1;
+      }
+    }
+    return text;
+  }
+
+  async parseNumberArray(onValue) {
+    await this.skipWhitespace();
+    await this.expectChar("[");
+    await this.skipWhitespace();
+    if ((await this.peek()) === "]") {
+      await this.next();
+      return;
+    }
+
+    let token = "";
+    let index = 0;
+    const flush = () => {
+      if (token.length === 0) {
+        fail("Unexpected empty number token");
+      }
+      const value = Number.parseInt(token, 10);
+      if (!Number.isFinite(value)) {
+        fail(`Invalid numeric token: ${token}`);
+      }
+      onValue(value, index);
+      index += 1;
+      token = "";
+    };
+
+    while (true) {
+      const char = await this.next();
+      if (char === null) {
+        fail("Unexpected EOF while reading number array");
+      }
+      if (char === "]") {
+        flush();
+        return;
+      }
+      if (char === ",") {
+        flush();
+        continue;
+      }
+      if (/\s/u.test(char)) {
+        continue;
+      }
+      token += char;
+    }
+  }
+
+  async readJsonString() {
+    await this.expectChar('"');
+    let value = "";
+    while (true) {
+      const char = await this.next();
+      if (char === null) {
+        fail("Unexpected EOF while reading JSON string");
+      }
+      if (char === '"') {
+        return value;
+      }
+      if (char !== "\\") {
+        value += char;
+        continue;
+      }
+      const escaped = await this.next();
+      if (escaped === null) {
+        fail("Unexpected EOF while reading JSON string escape");
+      }
+      if (escaped === "u") {
+        let hex = "";
+        for (let index = 0; index < 4; index += 1) {
+          const hexChar = await this.next();
+          if (hexChar === null) {
+            fail("Unexpected EOF while reading JSON unicode escape");
+          }
+          hex += hexChar;
+        }
+        value += String.fromCharCode(Number.parseInt(hex, 16));
+        continue;
+      }
+      value +=
+        escaped === "b"
+          ? "\b"
+          : escaped === "f"
+            ? "\f"
+            : escaped === "n"
+              ? "\n"
+              : escaped === "r"
+                ? "\r"
+                : escaped === "t"
+                  ? "\t"
+                  : escaped;
+    }
+  }
+
+  async parseStringArray(onValue) {
+    await this.skipWhitespace();
+    await this.expectChar("[");
+    await this.skipWhitespace();
+    if ((await this.peek()) === "]") {
+      await this.next();
+      return;
+    }
+
+    let index = 0;
+    while (true) {
+      const value = await this.readJsonString();
+      onValue(value, index);
+      index += 1;
+      await this.skipWhitespace();
+      const separator = await this.next();
+      if (separator === "]") {
+        return;
+      }
+      if (separator !== ",") {
+        fail(`Expected , or ] but found ${separator ?? "<eof>"}`);
+      }
+      await this.skipWhitespace();
+    }
+  }
+}
+
 function parseHeapFilename(filePath) {
  const base = path.basename(filePath);
  const match = base.match(
@@ -151,38 +388,89 @@ function resolvePair(options) {
  };
 }

-function loadSummary(filePath) {
-  const data = JSON.parse(fs.readFileSync(filePath, "utf8"));
-  const meta = data.snapshot?.meta;
+async function parseSnapshotMeta(scanner) {
+  await scanner.find('"snapshot":');
+  await scanner.skipWhitespace();
+  const metaObjectText = await scanner.readBalancedObject();
+  const parsed = JSON.parse(metaObjectText);
+  return parsed?.meta ?? null;
+}
+
+async function buildSummary(filePath) {
+  const scanner = new JsonStreamScanner(filePath);
+  const meta = await parseSnapshotMeta(scanner);
  if (!meta) {
    fail(`Invalid heap snapshot: ${filePath}`);
  }

  const nodeFieldCount = meta.node_fields.length;
  const typeNames = meta.node_types[0];
-  const strings = data.strings;
  const typeIndex = meta.node_fields.indexOf("type");
  const nameIndex = meta.node_fields.indexOf("name");
  const selfSizeIndex = meta.node_fields.indexOf("self_size");
+  if (typeIndex === -1 || nameIndex === -1 || selfSizeIndex === -1) {
+    fail(`Unsupported heap snapshot schema: ${filePath}`);
+  }

-  const summary = new Map();
-  for (let offset = 0; offset < data.nodes.length; offset += nodeFieldCount) {
-    const type = typeNames[data.nodes[offset + typeIndex]];
-    const name = strings[data.nodes[offset + nameIndex]];
-    const selfSize = data.nodes[offset + selfSizeIndex];
-    const key = `${type}\t${name}`;
-    const current = summary.get(key) ?? {
-      type,
-      name,
+  const summaryByIndex = new Map();
+  let nodeCount = 0;
+  let currentTypeId = 0;
+  let currentNameId = 0;
+  let currentSelfSize = 0;
+  await scanner.find('"nodes":');
+  await scanner.parseNumberArray((value, index) => {
+    const fieldIndex = index % nodeFieldCount;
+    if (fieldIndex === typeIndex) {
+      currentTypeId = value;
+      return;
+    }
+    if (fieldIndex === nameIndex) {
+      currentNameId = value;
+      return;
+    }
+    if (fieldIndex === selfSizeIndex) {
+      currentSelfSize = value;
+    }
+    if (fieldIndex !== nodeFieldCount - 1) {
+      return;
+    }
+    const key = `${currentTypeId}\t${currentNameId}`;
+    const current = summaryByIndex.get(key) ?? {
+      typeId: currentTypeId,
+      nameId: currentNameId,
      selfSize: 0,
      count: 0,
    };
-    current.selfSize += selfSize;
+    current.selfSize += currentSelfSize;
    current.count += 1;
-    summary.set(key, current);
+    summaryByIndex.set(key, current);
+    nodeCount += 1;
+  });
+
+  const requiredNameIds = new Set(
+    Array.from(summaryByIndex.values(), (entry) => entry.nameId).filter((value) => value >= 0),
+  );
+  const nameStrings = new Map();
+  await scanner.find('"strings":');
+  await scanner.parseStringArray((value, index) => {
+    if (requiredNameIds.has(index)) {
+      nameStrings.set(index, value);
+    }
+  });
+
+  const summary = new Map();
+  for (const entry of summaryByIndex.values()) {
+    const key = `${typeNames[entry.typeId] ?? "unknown"}\t${nameStrings.get(entry.nameId) ?? ""}`;
+    summary.set(key, {
+      type: typeNames[entry.typeId] ?? "unknown",
+      name: nameStrings.get(entry.nameId) ?? "",
+      selfSize: entry.selfSize,
+      count: entry.count,
+    });
  }
+
  return {
-    nodeCount: data.snapshot.node_count,
+    nodeCount,
    summary,
  };
 }
@@ -205,11 +493,11 @@ function truncate(text, maxLength) {
  return text.length <= maxLength ? text : `${text.slice(0, maxLength - 1)}…`;
 }

-function main() {
+async function main() {
  const options = parseArgs(process.argv.slice(2));
  const pair = resolvePair(options);
-  const before = loadSummary(pair.before);
-  const after = loadSummary(pair.after);
+  const before = await buildSummary(pair.before);
+  const after = await buildSummary(pair.after);
  const minBytes = options.minKb * 1024;

  const rows = [];
@@ -262,4 +550,4 @@ function main() {
  }
 }

-main();
+await main();