diff --git a/CHANGELOG.md b/CHANGELOG.md index b02eed075c6..69cf9bdf3c1 100644 --- a/CHANGELOG.md +++ b/CHANGELOG.md @@ -45,6 +45,8 @@ Docs: https://docs.openclaw.ai - Discord: own the Carbon interaction listener and hand off Discord slash/component handling asynchronously, so compaction or long session locks no longer trip `InteractionEventListener` listener timeouts. Fixes #73204. Thanks @slideshow-dingo. - Compaction/diagnostics: keep unknown compaction failure classifications stable while logging sanitized detail for unclassified provider errors such as missing Ollama provider adapters. Thanks @gzsiang. - Models/fallbacks: record first-class `model.fallback_step` trajectory events with from/to models, failure detail, chain position, and final outcome so support exports preserve the primary model failure even when a later fallback also fails. Fixes #71744. Thanks @nikolaykazakovvs-ux. +- Gateway/agents: block agent `exec` from launching interactive `openclaw channels login` flows and abort active agent runs after invalid-config recovery restores last-known-good config, preventing known channel-login and reload paths from wedging replies. Refs #72338. Thanks @midhunmonachan. +- Gateway/diagnostics: emit payload-free liveness warnings with event-loop delay, event-loop utilization, CPU-core ratio, and active-session counts so live-but-stalled Gateways capture CPU-spin context in stability bundles. Refs #72338. Thanks @midhunmonachan and @DougButdorf. - Gateway/startup: keep value-option foreground starts on the gateway fast path and skip proxy bootstrap unless proxy env is configured, reducing normal gateway startup RSS and avoiding full CLI graph loading. Thanks @vincentkoc. - Heartbeat/models: show heartbeat model bleed guidance on context-overflow resets when the last runtime model matches configured `heartbeat.model`, so smaller local heartbeat models point users to `isolatedSession` or `lightContext` instead of only compaction-buffer tuning. Fixes #67314. Thanks @Knightmare6890. - Subagents/models: persist `sessions_spawn.model` and configured subagent models as child-session model overrides before the first turn, so spawned subagents actually run on the requested provider/model instead of reverting to the target agent default. Fixes #73180. Thanks @danielzinhu99. diff --git a/docs/cli/channels.md b/docs/cli/channels.md index 230c0de2131..3af98b8be85 100644 --- a/docs/cli/channels.md +++ b/docs/cli/channels.md @@ -92,6 +92,7 @@ openclaw channels logout --channel whatsapp - `channels login` supports `--verbose`. - `channels login` and `logout` can infer the channel when only one supported login target is configured. +- Run `channels login` from a terminal on the gateway host. Agent `exec` blocks this interactive login flow; channel-native agent login tools, such as `whatsapp_login`, should be used from chat when available. ## Troubleshooting diff --git a/docs/gateway/diagnostics.md b/docs/gateway/diagnostics.md index 0a558d7d5d5..02c5f8fae44 100644 --- a/docs/gateway/diagnostics.md +++ b/docs/gateway/diagnostics.md @@ -71,6 +71,12 @@ export keeps only that a message was omitted and the byte count. The Gateway records a bounded, payload-free stability stream by default when diagnostics are enabled. It is for operational facts, not content. +The same diagnostic heartbeat records liveness warnings when the Gateway keeps +running but the Node.js event loop or CPU looks saturated. These +`diagnostic.liveness.warning` events include event-loop delay, event-loop +utilization, CPU-core ratio, and active/waiting/queued session counts. They do +not restart the Gateway by themselves. + Inspect the live recorder: ```bash diff --git a/docs/gateway/health.md b/docs/gateway/health.md index 59a599978ce..9bf1681652b 100644 --- a/docs/gateway/health.md +++ b/docs/gateway/health.md @@ -24,7 +24,7 @@ Short guide to verify channel connectivity without guessing. - Creds on disk: `ls -l ~/.openclaw/credentials/whatsapp//creds.json` (mtime should be recent). - Session store: `ls -l ~/.openclaw/agents//sessions/sessions.json` (path can be overridden in config). Count and recent recipients are surfaced via `status`. - Relink flow: `openclaw channels logout && openclaw channels login --verbose` when status codes 409–515 or `loggedOut` appear in logs. (Note: the QR login flow auto-restarts once for status 515 after pairing.) -- Diagnostics are enabled by default. The gateway records operational facts unless `diagnostics.enabled: false` is set. Memory events record RSS/heap byte counts, threshold pressure, and growth pressure. Oversized-payload events record what was rejected, truncated, or chunked, plus sizes and limits when available. They do not record the message text, attachment contents, webhook body, raw request or response body, tokens, cookies, or secret values. The same heartbeat starts the bounded stability recorder, which is available through `openclaw gateway stability` or the `diagnostics.stability` Gateway RPC. Fatal Gateway exits, shutdown timeouts, and restart startup failures persist the latest recorder snapshot under `~/.openclaw/logs/stability/` when events exist; inspect the newest saved bundle with `openclaw gateway stability --bundle latest`. +- Diagnostics are enabled by default. The gateway records operational facts unless `diagnostics.enabled: false` is set. Memory events record RSS/heap byte counts, threshold pressure, and growth pressure. Liveness warnings record event-loop delay, event-loop utilization, CPU-core ratio, and active/waiting/queued session counts when the process is running but saturated. Oversized-payload events record what was rejected, truncated, or chunked, plus sizes and limits when available. They do not record the message text, attachment contents, webhook body, raw request or response body, tokens, cookies, or secret values. The same heartbeat starts the bounded stability recorder, which is available through `openclaw gateway stability` or the `diagnostics.stability` Gateway RPC. Fatal Gateway exits, shutdown timeouts, and restart startup failures persist the latest recorder snapshot under `~/.openclaw/logs/stability/` when events exist; inspect the newest saved bundle with `openclaw gateway stability --bundle latest`. - For bug reports, run `openclaw gateway diagnostics export` and attach the generated zip. The export combines a Markdown summary, the newest stability bundle, sanitized log metadata, sanitized Gateway status/health snapshots, and config shape. It is meant to be shared: chat text, webhook bodies, tool outputs, credentials, cookies, account/message identifiers, and secret values are omitted or redacted. See [Diagnostics Export](/gateway/diagnostics). ## Health monitor config diff --git a/docs/tools/exec.md b/docs/tools/exec.md index 69bf097ba36..d8fcafcd1d5 100644 --- a/docs/tools/exec.md +++ b/docs/tools/exec.md @@ -78,6 +78,7 @@ Notes: - Host execution (`gateway`/`node`) rejects `env.PATH` and loader overrides (`LD_*`/`DYLD_*`) to prevent binary hijacking or injected code. - OpenClaw sets `OPENCLAW_SHELL=exec` in the spawned command environment (including PTY and sandbox execution) so shell/profile rules can detect exec-tool context. +- `openclaw channels login` is blocked from `exec` because it is an interactive channel-auth flow; run it in a terminal on the gateway host, or use the channel-native login tool from chat when one exists. - Important: sandboxing is **off by default**. If sandboxing is off, implicit `host=auto` resolves to `gateway`. Explicit `host=sandbox` still fails closed instead of silently running on the gateway host. Enable sandboxing or use `host=gateway` with approvals. diff --git a/src/agents/bash-tools.exec.script-preflight.test.ts b/src/agents/bash-tools.exec.script-preflight.test.ts index de6426bacfc..a6bdc32b417 100644 --- a/src/agents/bash-tools.exec.script-preflight.test.ts +++ b/src/agents/bash-tools.exec.script-preflight.test.ts @@ -24,6 +24,7 @@ const isWin = process.platform === "win32"; const describeNonWin = isWin ? describe.skip : describe; const describeWin = isWin ? describe : describe.skip; +const parseOpenClawChannelsLoginShellCommand = __testing.parseOpenClawChannelsLoginShellCommand; const validateExecScriptPreflight = __testing.validateScriptFileForShellBleed; const createPreflightTool = () => createExecTool({ host: "gateway", security: "full", ask: "on-miss" }); @@ -66,6 +67,35 @@ async function expectSymlinkSwapDuringPreflightToAvoidErrors(params: { }); } +describe("exec interactive OpenClaw channel login guard", () => { + it("recognizes direct and package-runner channel login commands before execution", () => { + expect( + parseOpenClawChannelsLoginShellCommand("openclaw channels login --channel whatsapp"), + ).toBe(true); + expect( + parseOpenClawChannelsLoginShellCommand( + "pnpm exec openclaw channels login --channel whatsapp --verbose", + ), + ).toBe(true); + expect(parseOpenClawChannelsLoginShellCommand("openclaw channels status --deep")).toBe(false); + }); + + it("blocks interactive channel login commands from exec", async () => { + const tool = createPreflightTool(); + + await expect( + tool.execute("call-openclaw-channel-login", { + command: "openclaw channels login --channel whatsapp --verbose", + }), + ).rejects.toThrow(/exec cannot run interactive OpenClaw channel login commands/); + await expect( + tool.execute("call-wrapped-openclaw-channel-login", { + command: "sudo -u openclaw bash -lc 'openclaw channels login --channel whatsapp'", + }), + ).rejects.toThrow(/exec cannot run interactive OpenClaw channel login commands/); + }); +}); + describeNonWin("exec script preflight", () => { it("blocks shell env var injection tokens in python scripts before execution", async () => { await withTempDir("openclaw-exec-preflight-", async (tmp) => { diff --git a/src/agents/bash-tools.exec.ts b/src/agents/bash-tools.exec.ts index 871fcb947f0..05e93c87c9c 100644 --- a/src/agents/bash-tools.exec.ts +++ b/src/agents/bash-tools.exec.ts @@ -1074,7 +1074,69 @@ function parseExecApprovalShellCommand(raw: string): ParsedExecApprovalCommand | }; } -function rejectExecApprovalShellCommand(command: string): void { +function normalizeCommandBaseName(token: string | undefined): string { + if (!token) { + return ""; + } + const base = normalizeLowercaseStringOrEmpty(token.split(/[\\/]/u).at(-1)); + return base.replace(/\.(?:cmd|exe)$/u, ""); +} + +function stripOpenClawPackageRunner(argv: string[]): string[] { + const commandName = normalizeCommandBaseName(argv[0]); + if (commandName === "openclaw") { + return argv; + } + if ( + (commandName === "pnpm" || commandName === "npm" || commandName === "yarn") && + normalizeCommandBaseName(argv[1]) === "openclaw" + ) { + return argv.slice(1); + } + if ( + (commandName === "pnpm" || commandName === "npm" || commandName === "yarn") && + (argv[1] === "exec" || argv[1] === "dlx" || argv[1] === "run") && + normalizeCommandBaseName(argv[2]) === "openclaw" + ) { + return argv.slice(2); + } + if (commandName === "npx" || commandName === "bunx") { + let idx = 1; + while (idx < argv.length) { + const token = argv[idx]; + if (token === "--") { + idx += 1; + break; + } + if (!token.startsWith("-") || token === "-") { + break; + } + idx += 1; + if ((token === "-p" || token === "--package") && idx < argv.length) { + idx += 1; + } + } + if (normalizeCommandBaseName(argv[idx]) === "openclaw") { + return argv.slice(idx); + } + } + return argv; +} + +function parseOpenClawChannelsLoginShellCommand(raw: string): boolean { + const argv = splitShellArgs(raw); + if (!argv) { + return false; + } + const openclawArgv = stripOpenClawPackageRunner(argv); + return ( + normalizeCommandBaseName(openclawArgv[0]) === "openclaw" && + (openclawArgv[1] === "channels" || openclawArgv[1] === "channel") && + openclawArgv[2] === "login" + ); +} + +function rejectUnsafeControlShellCommand(command: string): void { const isEnvAssignmentToken = (token: string): boolean => /^[A-Za-z_][A-Za-z0-9_]*=.*$/u.test(token); const shellWrappers = new Set(["bash", "dash", "fish", "ksh", "sh", "zsh"]); @@ -1295,15 +1357,22 @@ function rejectExecApprovalShellCommand(command: string): void { return argv ? buildCandidates(argv) : [line]; }); for (const candidate of candidates) { - if (!parseExecApprovalShellCommand(candidate)) { - continue; + if (parseExecApprovalShellCommand(candidate)) { + throw new Error( + [ + "exec cannot run /approve commands.", + "Show the /approve command to the user as chat text, or route it through the approval command handler instead of shell execution.", + ].join(" "), + ); + } + if (parseOpenClawChannelsLoginShellCommand(candidate)) { + throw new Error( + [ + "exec cannot run interactive OpenClaw channel login commands.", + "Run `openclaw channels login` in a terminal on the gateway host, or use the channel-specific login agent tool when available (for WhatsApp: `whatsapp_login`).", + ].join(" "), + ); } - throw new Error( - [ - "exec cannot run /approve commands.", - "Show the /approve command to the user as chat text, or route it through the approval command handler instead of shell execution.", - ].join(" "), - ); } } @@ -1532,7 +1601,7 @@ export function createExecTool( const rawWorkdir = explicitWorkdir ?? defaultWorkdir ?? process.cwd(); workdir = resolveWorkdir(rawWorkdir, warnings); } - rejectExecApprovalShellCommand(params.command); + rejectUnsafeControlShellCommand(params.command); const inheritedBaseEnv = coerceEnv(process.env); const hostEnvResult = @@ -1813,5 +1882,6 @@ export function createExecTool( export const execTool = createExecTool(); export const __testing = { + parseOpenClawChannelsLoginShellCommand, validateScriptFileForShellBleed, }; diff --git a/src/gateway/server-reload-handlers.test.ts b/src/gateway/server-reload-handlers.test.ts new file mode 100644 index 00000000000..7b37b390719 --- /dev/null +++ b/src/gateway/server-reload-handlers.test.ts @@ -0,0 +1,51 @@ +import { afterEach, describe, expect, it, vi } from "vitest"; +import { + __testing as embeddedRunTesting, + clearActiveEmbeddedRun, + setActiveEmbeddedRun, + type EmbeddedPiQueueHandle, +} from "../agents/pi-embedded-runner/runs.js"; +import { __testing } from "./server-reload-handlers.js"; + +describe("gateway reload recovery handlers", () => { + afterEach(() => { + embeddedRunTesting.resetActiveEmbeddedRuns(); + }); + + it("aborts active agent runs after last-known-good config recovery", () => { + const sessionId = "config-recovery-session"; + const sessionKey = "agent:main:telegram:direct:123"; + let handle!: EmbeddedPiQueueHandle; + handle = { + abort: vi.fn(() => { + clearActiveEmbeddedRun(sessionId, handle, sessionKey); + }), + isCompacting: () => false, + isStreaming: () => false, + queueMessage: async () => {}, + }; + const logReload = { info: vi.fn(), warn: vi.fn() }; + setActiveEmbeddedRun(sessionId, handle, sessionKey); + + __testing.abortActiveAgentRunsAfterConfigRecovery({ + reason: "invalid-config", + logReload, + }); + + expect(handle.abort).toHaveBeenCalledOnce(); + expect(logReload.warn).toHaveBeenCalledWith( + "config recovery aborted active agent run(s) after reload-invalid-config", + ); + }); + + it("does not warn when config recovery has no active agent runs to abort", () => { + const logReload = { info: vi.fn(), warn: vi.fn() }; + + __testing.abortActiveAgentRunsAfterConfigRecovery({ + reason: "invalid-config", + logReload, + }); + + expect(logReload.warn).not.toHaveBeenCalled(); + }); +}); diff --git a/src/gateway/server-reload-handlers.ts b/src/gateway/server-reload-handlers.ts index e6520231383..c0faf00cffa 100644 --- a/src/gateway/server-reload-handlers.ts +++ b/src/gateway/server-reload-handlers.ts @@ -1,6 +1,7 @@ import { resetModelCatalogCache } from "../agents/model-catalog.js"; import { disposeAllSessionMcpRuntimes } from "../agents/pi-bundle-mcp-tools.js"; import { getActiveEmbeddedRunCount } from "../agents/pi-embedded-runner/run-state.js"; +import { abortEmbeddedPiRun } from "../agents/pi-embedded-runner/runs.js"; import { getTotalPendingReplies } from "../auto-reply/reply/dispatcher-registry.js"; import type { CliDeps } from "../cli/deps.types.js"; import { isRestartEnabled } from "../config/commands.flags.js"; @@ -59,6 +60,23 @@ const MCP_RUNTIME_RELOAD_DISPOSE_TIMEOUT_MS = 5_000; const CHANNEL_RELOAD_DEFERRAL_POLL_MS = 500; const CHANNEL_RELOAD_STILL_PENDING_WARN_MS = 30_000; +function abortActiveAgentRunsAfterConfigRecovery(params: { + reason: string; + logReload: GatewayReloadLog; +}) { + const aborted = abortEmbeddedPiRun(undefined, { mode: "all" }); + if (!aborted) { + return; + } + params.logReload.warn( + `config recovery aborted active agent run(s) after reload-${params.reason}`, + ); +} + +export const __testing = { + abortActiveAgentRunsAfterConfigRecovery, +}; + async function disposeMcpRuntimesWithTimeout(params: { dispose: () => Promise; timeoutMs: number; @@ -418,6 +436,7 @@ export function startManagedGatewayConfigReloader(params: ManagedGatewayConfigRe await params.recoverSnapshot({ snapshot, reason: `reload-${reason}` }), promoteSnapshot: async (snapshot, _reason) => await params.promoteSnapshot(snapshot), onRecovered: ({ reason, snapshot, recoveredSnapshot }) => { + abortActiveAgentRunsAfterConfigRecovery({ reason, logReload: params.logReload }); enqueueConfigRecoveryNotice({ cfg: recoveredSnapshot.config, phase: "reload", diff --git a/src/infra/diagnostic-events.ts b/src/infra/diagnostic-events.ts index 8ed0caf6bfb..e00039fb980 100644 --- a/src/infra/diagnostic-events.ts +++ b/src/infra/diagnostic-events.ts @@ -166,6 +166,24 @@ export type DiagnosticHeartbeatEvent = DiagnosticBaseEvent & { queued: number; }; +export type DiagnosticLivenessWarningReason = "event_loop_delay" | "event_loop_utilization" | "cpu"; + +export type DiagnosticLivenessWarningEvent = DiagnosticBaseEvent & { + type: "diagnostic.liveness.warning"; + reasons: DiagnosticLivenessWarningReason[]; + intervalMs: number; + eventLoopDelayP99Ms?: number; + eventLoopDelayMaxMs?: number; + eventLoopUtilization?: number; + cpuUserMs?: number; + cpuSystemMs?: number; + cpuTotalMs?: number; + cpuCoreRatio?: number; + active: number; + waiting: number; + queued: number; +}; + export type DiagnosticToolLoopEvent = DiagnosticBaseEvent & { type: "tool.loop"; sessionKey?: string; @@ -441,6 +459,7 @@ export type DiagnosticEventPayload = | DiagnosticLaneDequeueEvent | DiagnosticRunAttemptEvent | DiagnosticHeartbeatEvent + | DiagnosticLivenessWarningEvent | DiagnosticToolLoopEvent | DiagnosticToolExecutionStartedEvent | DiagnosticToolExecutionCompletedEvent diff --git a/src/logging/diagnostic-stability.ts b/src/logging/diagnostic-stability.ts index 189f4ca7e6f..ec213c76f03 100644 --- a/src/logging/diagnostic-stability.ts +++ b/src/logging/diagnostic-stability.ts @@ -45,6 +45,10 @@ export type DiagnosticStabilityEventRecord = { thresholdBytes?: number; rssGrowthBytes?: number; windowMs?: number; + eventLoopDelayP99Ms?: number; + eventLoopDelayMaxMs?: number; + eventLoopUtilization?: number; + cpuCoreRatio?: number; ageMs?: number; queueDepth?: number; queueSize?: number; @@ -266,6 +270,19 @@ function sanitizeDiagnosticEvent(event: DiagnosticEventPayload): DiagnosticStabi record.waiting = event.waiting; record.queued = event.queued; break; + case "diagnostic.liveness.warning": + record.level = "warning"; + record.durationMs = event.intervalMs; + record.count = event.reasons.length; + assignReasonCode(record, event.reasons[0]); + record.eventLoopDelayP99Ms = event.eventLoopDelayP99Ms; + record.eventLoopDelayMaxMs = event.eventLoopDelayMaxMs; + record.eventLoopUtilization = event.eventLoopUtilization; + record.cpuCoreRatio = event.cpuCoreRatio; + record.active = event.active; + record.waiting = event.waiting; + record.queued = event.queued; + break; case "tool.loop": record.toolName = event.toolName; record.level = event.level; diff --git a/src/logging/diagnostic.test.ts b/src/logging/diagnostic.test.ts index df164f08e26..b5eef251ed0 100644 --- a/src/logging/diagnostic.test.ts +++ b/src/logging/diagnostic.test.ts @@ -191,7 +191,7 @@ describe("stuck session diagnostics threshold", () => { enabled: true, }, }, - { emitMemorySample }, + { emitMemorySample, sampleLiveness: () => null }, ); vi.advanceTimersByTime(30_000); @@ -203,6 +203,93 @@ describe("stuck session diagnostics threshold", () => { expect(emitMemorySample).toHaveBeenLastCalledWith({ emitSample: true }); }); + it("emits idle liveness warnings into the stability recorder", () => { + const emitMemorySample = createEmitMemorySampleMock(); + const events: string[] = []; + const unsubscribe = onDiagnosticEvent((event) => events.push(event.type)); + + try { + startDiagnosticHeartbeat( + { + diagnostics: { + enabled: true, + }, + }, + { + emitMemorySample, + sampleLiveness: () => ({ + reasons: ["cpu"], + intervalMs: 30_000, + eventLoopDelayP99Ms: 12, + eventLoopDelayMaxMs: 22, + eventLoopUtilization: 0.99, + cpuUserMs: 29_000, + cpuSystemMs: 1_000, + cpuTotalMs: 30_000, + cpuCoreRatio: 1, + }), + }, + ); + + vi.advanceTimersByTime(30_000); + } finally { + unsubscribe(); + } + + expect(events).toContain("diagnostic.liveness.warning"); + expect(emitMemorySample).toHaveBeenLastCalledWith({ emitSample: true }); + expect(getDiagnosticStabilitySnapshot({ limit: 10 }).events).toContainEqual( + expect.objectContaining({ + type: "diagnostic.liveness.warning", + level: "warning", + reason: "cpu", + durationMs: 30_000, + count: 1, + eventLoopDelayP99Ms: 12, + eventLoopDelayMaxMs: 22, + eventLoopUtilization: 0.99, + cpuCoreRatio: 1, + active: 0, + waiting: 0, + queued: 0, + }), + ); + }); + + it("throttles repeated liveness warnings", () => { + const events: string[] = []; + const unsubscribe = onDiagnosticEvent((event) => events.push(event.type)); + + try { + startDiagnosticHeartbeat( + { + diagnostics: { + enabled: true, + }, + }, + { + emitMemorySample: createEmitMemorySampleMock(), + sampleLiveness: () => ({ + reasons: ["event_loop_delay"], + intervalMs: 30_000, + eventLoopDelayP99Ms: 1_500, + eventLoopDelayMaxMs: 2_000, + }), + }, + ); + + vi.advanceTimersByTime(30_000); + vi.advanceTimersByTime(90_000); + expect(events.filter((event) => event === "diagnostic.liveness.warning")).toHaveLength(1); + + vi.advanceTimersByTime(30_000); + } finally { + unsubscribe(); + } + + expect(events.filter((event) => event === "diagnostic.liveness.warning")).toHaveLength(2); + }); + it("does not start the heartbeat when diagnostics are disabled by config", () => { const emitMemorySample = createEmitMemorySampleMock(); diff --git a/src/logging/diagnostic.ts b/src/logging/diagnostic.ts index b52e03bd289..3de591c3bcd 100644 --- a/src/logging/diagnostic.ts +++ b/src/logging/diagnostic.ts @@ -1,9 +1,11 @@ +import { monitorEventLoopDelay, performance } from "node:perf_hooks"; import { getRuntimeConfig } from "../config/config.js"; import type { OpenClawConfig } from "../config/types.openclaw.js"; import { areDiagnosticsEnabledForProcess, emitDiagnosticEvent, isDiagnosticsEnabled, + type DiagnosticLivenessWarningReason, } from "../infra/diagnostic-events.js"; import { emitDiagnosticMemorySample, resetDiagnosticMemoryForTest } from "./diagnostic-memory.js"; import { @@ -44,11 +46,18 @@ const DEFAULT_STUCK_SESSION_WARN_MS = 120_000; const MIN_STUCK_SESSION_WARN_MS = 1_000; const MAX_STUCK_SESSION_WARN_MS = 24 * 60 * 60 * 1000; const RECENT_DIAGNOSTIC_ACTIVITY_MS = 120_000; +const DEFAULT_LIVENESS_EVENT_LOOP_DELAY_WARN_MS = 1_000; +const DEFAULT_LIVENESS_EVENT_LOOP_UTILIZATION_WARN = 0.95; +const DEFAULT_LIVENESS_CPU_CORE_RATIO_WARN = 0.9; +const DEFAULT_LIVENESS_WARN_COOLDOWN_MS = 120_000; let commandPollBackoffRuntimePromise: Promise< typeof import("../agents/command-poll-backoff.runtime.js") > | null = null; type EmitDiagnosticMemorySample = typeof emitDiagnosticMemorySample; +type EventLoopDelayMonitor = ReturnType; +type EventLoopUtilization = ReturnType; +type CpuUsage = ReturnType; type DiagnosticWorkSnapshot = { activeCount: number; @@ -56,6 +65,35 @@ type DiagnosticWorkSnapshot = { queuedCount: number; }; +type DiagnosticLivenessSample = { + reasons: DiagnosticLivenessWarningReason[]; + intervalMs: number; + eventLoopDelayP99Ms?: number; + eventLoopDelayMaxMs?: number; + eventLoopUtilization?: number; + cpuUserMs?: number; + cpuSystemMs?: number; + cpuTotalMs?: number; + cpuCoreRatio?: number; +}; + +type SampleDiagnosticLiveness = ( + now: number, + work: DiagnosticWorkSnapshot, +) => DiagnosticLivenessSample | null; + +type StartDiagnosticHeartbeatOptions = { + getConfig?: () => OpenClawConfig; + emitMemorySample?: EmitDiagnosticMemorySample; + sampleLiveness?: SampleDiagnosticLiveness; +}; + +let diagnosticLivenessMonitor: EventLoopDelayMonitor | null = null; +let lastDiagnosticLivenessWallAt = 0; +let lastDiagnosticLivenessCpuUsage: CpuUsage | null = null; +let lastDiagnosticLivenessEventLoopUtilization: EventLoopUtilization | null = null; +let lastDiagnosticLivenessWarnAt = 0; + function loadCommandPollBackoffRuntime() { commandPollBackoffRuntimePromise ??= import("../agents/command-poll-backoff.runtime.js"); return commandPollBackoffRuntimePromise; @@ -87,6 +125,159 @@ function hasRecentDiagnosticActivity(now: number): boolean { return lastActivityAt > 0 && now - lastActivityAt <= RECENT_DIAGNOSTIC_ACTIVITY_MS; } +function roundDiagnosticMetric(value: number, digits = 3): number { + if (!Number.isFinite(value)) { + return 0; + } + const factor = 10 ** digits; + return Math.round(value * factor) / factor; +} + +function nanosecondsToMilliseconds(value: number): number { + return roundDiagnosticMetric(value / 1_000_000, 1); +} + +function formatOptionalDiagnosticMetric(value: number | undefined): string { + return value === undefined ? "unknown" : String(value); +} + +function startDiagnosticLivenessSampler(): void { + lastDiagnosticLivenessWallAt = Date.now(); + lastDiagnosticLivenessCpuUsage = process.cpuUsage(); + lastDiagnosticLivenessEventLoopUtilization = performance.eventLoopUtilization(); + lastDiagnosticLivenessWarnAt = 0; + + if (diagnosticLivenessMonitor) { + diagnosticLivenessMonitor.reset(); + return; + } + + try { + diagnosticLivenessMonitor = monitorEventLoopDelay({ resolution: 20 }); + diagnosticLivenessMonitor.enable(); + diagnosticLivenessMonitor.reset(); + } catch (err) { + diagnosticLivenessMonitor = null; + diag.debug(`diagnostic liveness monitor unavailable: ${String(err)}`); + } +} + +function stopDiagnosticLivenessSampler(): void { + diagnosticLivenessMonitor?.disable(); + diagnosticLivenessMonitor = null; + lastDiagnosticLivenessWallAt = 0; + lastDiagnosticLivenessCpuUsage = null; + lastDiagnosticLivenessEventLoopUtilization = null; + lastDiagnosticLivenessWarnAt = 0; +} + +function sampleDiagnosticLiveness(now: number): DiagnosticLivenessSample | null { + if ( + !diagnosticLivenessMonitor || + !lastDiagnosticLivenessCpuUsage || + !lastDiagnosticLivenessEventLoopUtilization || + lastDiagnosticLivenessWallAt <= 0 + ) { + startDiagnosticLivenessSampler(); + return null; + } + + const intervalMs = Math.max(1, now - lastDiagnosticLivenessWallAt); + const cpuUsage = process.cpuUsage(lastDiagnosticLivenessCpuUsage); + const currentEventLoopUtilization = performance.eventLoopUtilization(); + const eventLoopUtilization = performance.eventLoopUtilization( + currentEventLoopUtilization, + lastDiagnosticLivenessEventLoopUtilization, + ).utilization; + const eventLoopDelayP99Ms = nanosecondsToMilliseconds(diagnosticLivenessMonitor.percentile(99)); + const eventLoopDelayMaxMs = nanosecondsToMilliseconds(diagnosticLivenessMonitor.max); + diagnosticLivenessMonitor.reset(); + lastDiagnosticLivenessWallAt = now; + lastDiagnosticLivenessCpuUsage = process.cpuUsage(); + lastDiagnosticLivenessEventLoopUtilization = currentEventLoopUtilization; + + const cpuUserMs = roundDiagnosticMetric(cpuUsage.user / 1_000, 1); + const cpuSystemMs = roundDiagnosticMetric(cpuUsage.system / 1_000, 1); + const cpuTotalMs = roundDiagnosticMetric(cpuUserMs + cpuSystemMs, 1); + const cpuCoreRatio = roundDiagnosticMetric(cpuTotalMs / intervalMs, 3); + const eventLoopUtilizationRatio = roundDiagnosticMetric(eventLoopUtilization, 3); + const reasons: DiagnosticLivenessWarningReason[] = []; + + if ( + eventLoopDelayP99Ms >= DEFAULT_LIVENESS_EVENT_LOOP_DELAY_WARN_MS || + eventLoopDelayMaxMs >= DEFAULT_LIVENESS_EVENT_LOOP_DELAY_WARN_MS + ) { + reasons.push("event_loop_delay"); + } + if (eventLoopUtilizationRatio >= DEFAULT_LIVENESS_EVENT_LOOP_UTILIZATION_WARN) { + reasons.push("event_loop_utilization"); + } + if (cpuCoreRatio >= DEFAULT_LIVENESS_CPU_CORE_RATIO_WARN) { + reasons.push("cpu"); + } + if (reasons.length === 0) { + return null; + } + + return { + reasons, + intervalMs, + eventLoopDelayP99Ms, + eventLoopDelayMaxMs, + eventLoopUtilization: eventLoopUtilizationRatio, + cpuUserMs, + cpuSystemMs, + cpuTotalMs, + cpuCoreRatio, + }; +} + +function shouldEmitDiagnosticLivenessWarning(now: number): boolean { + if ( + lastDiagnosticLivenessWarnAt > 0 && + now - lastDiagnosticLivenessWarnAt < DEFAULT_LIVENESS_WARN_COOLDOWN_MS + ) { + return false; + } + lastDiagnosticLivenessWarnAt = now; + return true; +} + +function emitDiagnosticLivenessWarning( + sample: DiagnosticLivenessSample, + work: DiagnosticWorkSnapshot, +): void { + diag.warn( + `liveness warning: reasons=${sample.reasons.join(",")} interval=${Math.round( + sample.intervalMs / 1000, + )}s eventLoopDelayP99Ms=${formatOptionalDiagnosticMetric( + sample.eventLoopDelayP99Ms, + )} eventLoopDelayMaxMs=${formatOptionalDiagnosticMetric( + sample.eventLoopDelayMaxMs, + )} eventLoopUtilization=${formatOptionalDiagnosticMetric( + sample.eventLoopUtilization, + )} cpuCoreRatio=${formatOptionalDiagnosticMetric(sample.cpuCoreRatio)} active=${ + work.activeCount + } waiting=${work.waitingCount} queued=${work.queuedCount}`, + ); + emitDiagnosticEvent({ + type: "diagnostic.liveness.warning", + reasons: sample.reasons, + intervalMs: sample.intervalMs, + eventLoopDelayP99Ms: sample.eventLoopDelayP99Ms, + eventLoopDelayMaxMs: sample.eventLoopDelayMaxMs, + eventLoopUtilization: sample.eventLoopUtilization, + cpuUserMs: sample.cpuUserMs, + cpuSystemMs: sample.cpuSystemMs, + cpuTotalMs: sample.cpuTotalMs, + cpuCoreRatio: sample.cpuCoreRatio, + active: work.activeCount, + waiting: work.waitingCount, + queued: work.queuedCount, + }); + markActivity(); +} + export function resolveStuckSessionWarnMs(config?: OpenClawConfig): number { const raw = config?.diagnostics?.stuckSessionWarnMs; if (typeof raw !== "number" || !Number.isFinite(raw)) { @@ -393,7 +584,7 @@ let heartbeatInterval: NodeJS.Timeout | null = null; export function startDiagnosticHeartbeat( config?: OpenClawConfig, - opts?: { getConfig?: () => OpenClawConfig; emitMemorySample?: EmitDiagnosticMemorySample }, + opts?: StartDiagnosticHeartbeatOptions, ) { if (!areDiagnosticsEnabledForProcess() || !isDiagnosticsEnabled(config)) { return; @@ -403,6 +594,7 @@ export function startDiagnosticHeartbeat( if (heartbeatInterval) { return; } + startDiagnosticLivenessSampler(); heartbeatInterval = setInterval(() => { let heartbeatConfig = config; if (!heartbeatConfig) { @@ -416,8 +608,11 @@ export function startDiagnosticHeartbeat( const now = Date.now(); pruneDiagnosticSessionStates(now, true); const work = getDiagnosticWorkSnapshot(); + const livenessSample = (opts?.sampleLiveness ?? sampleDiagnosticLiveness)(now, work); + const shouldEmitLivenessWarning = + livenessSample !== null && shouldEmitDiagnosticLivenessWarning(now); const shouldRecordMemorySample = - hasRecentDiagnosticActivity(now) || hasOpenDiagnosticWork(work); + shouldEmitLivenessWarning || hasRecentDiagnosticActivity(now) || hasOpenDiagnosticWork(work); (opts?.emitMemorySample ?? emitDiagnosticMemorySample)({ emitSample: shouldRecordMemorySample, }); @@ -426,6 +621,10 @@ export function startDiagnosticHeartbeat( return; } + if (shouldEmitLivenessWarning && livenessSample) { + emitDiagnosticLivenessWarning(livenessSample, work); + } + diag.debug( `heartbeat: webhooks=${webhookStats.received}/${webhookStats.processed}/${webhookStats.errors} active=${work.activeCount} waiting=${work.waitingCount} queued=${work.queuedCount}`, ); @@ -471,6 +670,7 @@ export function stopDiagnosticHeartbeat() { clearInterval(heartbeatInterval); heartbeatInterval = null; } + stopDiagnosticLivenessSampler(); stopDiagnosticStabilityRecorder(); uninstallDiagnosticStabilityFatalHook(); }