Improve gateway diagnostics export for support reports (#70324)

Merged via squash.

Prepared head SHA: 3d6ee85993
Co-authored-by: gumadeiras <5599352+gumadeiras@users.noreply.github.com>
Co-authored-by: gumadeiras <5599352+gumadeiras@users.noreply.github.com>
Reviewed-by: @gumadeiras
This commit is contained in:
Gustavo Madeira Santana
2026-04-22 20:47:14 -04:00
committed by GitHub
parent 6b41ef311f
commit 28818f9140
54 changed files with 5385 additions and 56 deletions

View File

@@ -1,4 +1,4 @@
81a8a7de5d4bf02cf3e697a641fe89844f98ed58d47890f12800181fde5a97b1 config-baseline.json
dab963eda8866b8bffd5c9032f92f0f6b08ed54dda837f1f5c513fca5d2c78e9 config-baseline.core.json
b05357fa162ba1f1d4ed192671b758d3905602678ff61148568840c6544d6222 config-baseline.json
a4e167f169db58d71c385a31fa2b980772f9fee963e70dd9553f63536cae5aed config-baseline.core.json
35d132fe176bd2bf9f0e46b29de91baba63ec4db3317cc5b294a982b46d16ba9 config-baseline.channel.json
71b5ff17041bc48a62300ad9f44fa8bb14d9dcd7f4c3549c0576d3059ce6ff36 config-baseline.plugin.json
3703c5345288adb9eee8cda3b592147cf4fed25a7782bed21ca83c88c3ca1cc0 config-baseline.plugin.json

View File

@@ -111,6 +111,59 @@ Options:
- `--days <days>`: number of days to include (default `30`).
### `gateway stability`
Fetch the recent diagnostic stability recorder from a running Gateway.
```bash
openclaw gateway stability
openclaw gateway stability --type payload.large
openclaw gateway stability --bundle latest
openclaw gateway stability --bundle latest --export
openclaw gateway stability --json
```
Options:
- `--limit <limit>`: maximum number of recent events to include (default `25`, max `1000`).
- `--type <type>`: filter by diagnostic event type, such as `payload.large` or `diagnostic.memory.pressure`.
- `--since-seq <seq>`: include only events after a diagnostic sequence number.
- `--bundle [path]`: read a persisted stability bundle instead of calling the running Gateway. Use `--bundle latest` (or just `--bundle`) for the newest bundle under the state directory, or pass a bundle JSON path directly.
- `--export`: write a shareable support diagnostics zip instead of printing stability details.
- `--output <path>`: output path for `--export`.
Notes:
- The recorder is active by default. Set `diagnostics.enabled: false` only when you need to disable Gateway diagnostic heartbeat collection.
- Records keep operational metadata: event names, counts, byte sizes, memory readings, queue/session state, channel/plugin names, and redacted session summaries. They do not keep chat text, webhook bodies, tool outputs, raw request or response bodies, tokens, cookies, secret values, hostnames, or raw session ids.
- On fatal Gateway exits, shutdown timeouts, and restart startup failures, OpenClaw writes the same diagnostic snapshot to `~/.openclaw/logs/stability/openclaw-stability-*.json` when the recorder has events. Inspect the newest bundle with `openclaw gateway stability --bundle latest`; `--limit`, `--type`, and `--since-seq` also apply to bundle output.
### `gateway diagnostics export`
Write a local diagnostics zip that is designed to attach to bug reports.
```bash
openclaw gateway diagnostics export
openclaw gateway diagnostics export --output openclaw-diagnostics.zip
openclaw gateway diagnostics export --json
```
Options:
- `--output <path>`: output zip path. Defaults to a support export under the state directory.
- `--log-lines <count>`: maximum sanitized log lines to include (default `5000`).
- `--log-bytes <bytes>`: maximum log bytes to inspect (default `1000000`).
- `--url <url>`: Gateway WebSocket URL for the health snapshot.
- `--token <token>`: Gateway token for the health snapshot.
- `--password <password>`: Gateway password for the health snapshot.
- `--timeout <ms>`: status/health snapshot timeout (default `3000`).
- `--no-stability-bundle`: skip persisted stability bundle lookup.
- `--json`: print the written path, size, and manifest as JSON.
The export contains a manifest, a Markdown summary, config shape, sanitized config details, sanitized log summaries, sanitized Gateway status/health snapshots, and the newest stability bundle when one exists.
It is meant to be shared. It keeps operational details that help debugging, such as safe OpenClaw log fields, subsystem names, status codes, durations, configured modes, ports, plugin ids, provider ids, non-secret feature settings, and redacted operational log messages. It omits or redacts chat text, webhook bodies, tool outputs, credentials, cookies, account/message identifiers, prompt/instruction text, hostnames, and secret values. When a LogTape-style message looks like user/chat/tool payload text, the export keeps only that a message was omitted plus its byte count.
### `gateway status`
`gateway status` shows the Gateway service (launchd/systemd/schtasks) plus an optional probe of connectivity/auth capability.

View File

@@ -26,6 +26,8 @@ Short guide to verify channel connectivity without guessing.
- Creds on disk: `ls -l ~/.openclaw/credentials/whatsapp/<accountId>/creds.json` (mtime should be recent).
- Session store: `ls -l ~/.openclaw/agents/<agentId>/sessions/sessions.json` (path can be overridden in config). Count and recent recipients are surfaced via `status`.
- Relink flow: `openclaw channels logout && openclaw channels login --verbose` when status codes 409515 or `loggedOut` appear in logs. (Note: the QR login flow auto-restarts once for status 515 after pairing.)
- Diagnostics are enabled by default. The gateway records operational facts unless `diagnostics.enabled: false` is set. Memory events record RSS/heap byte counts, threshold pressure, and growth pressure. Oversized-payload events record what was rejected, truncated, or chunked, plus sizes and limits when available. They do not record the message text, attachment contents, webhook body, raw request or response body, tokens, cookies, or secret values. The same heartbeat starts the bounded stability recorder, which is available through `openclaw gateway stability` or the `diagnostics.stability` Gateway RPC. Fatal Gateway exits, shutdown timeouts, and restart startup failures persist the latest recorder snapshot under `~/.openclaw/logs/stability/` when events exist; inspect the newest saved bundle with `openclaw gateway stability --bundle latest`.
- For bug reports, run `openclaw gateway diagnostics export` and attach the generated zip. The export combines a Markdown summary, the newest stability bundle, sanitized log metadata, sanitized Gateway status/health snapshots, and config shape. It is meant to be shared: chat text, webhook bodies, tool outputs, credentials, cookies, account/message identifiers, and secret values are omitted or redacted.
## Health monitor config

View File

@@ -18,6 +18,13 @@ handshake time.
- WebSocket, text frames with JSON payloads.
- First frame **must** be a `connect` request.
- Pre-connect frames are capped at 64 KiB. After a successful handshake, clients
should follow the `hello-ok.policy.maxPayload` and
`hello-ok.policy.maxBufferedBytes` limits. With diagnostics enabled,
oversized inbound frames and slow outbound buffers emit `payload.large` events
before the gateway closes or drops the affected frame. These events keep
sizes, limits, surfaces, and safe reason codes. They do not keep the message
body, attachment contents, raw frame body, tokens, cookies, or secret values.
## Handshake (connect)
@@ -265,6 +272,12 @@ implemented in `src/gateway/server-methods/*.ts`.
### System and identity
- `health` returns the cached or freshly probed gateway health snapshot.
- `diagnostics.stability` returns the recent bounded diagnostic stability
recorder. It keeps operational metadata such as event names, counts, byte
sizes, memory readings, queue/session state, channel/plugin names, and session
ids. It does not keep chat text, webhook bodies, tool outputs, raw request or
response bodies, tokens, cookies, or secret values. Operator read scope is
required.
- `status` returns the `/status`-style gateway summary; sensitive fields are
included only for admin-scoped operator clients.
- `gateway.identity.get` returns the gateway device identity used by relay and

View File

@@ -329,6 +329,20 @@ Think of the suites as “increasing realism” (and increasing flakiness/cost):
- `pnpm test:perf:profile:main` writes a main-thread CPU profile for Vitest/Vite startup and transform overhead.
- `pnpm test:perf:profile:runner` writes runner CPU+heap profiles for the unit suite with file parallelism disabled.
### Stability (gateway)
- Command: `pnpm test:stability:gateway`
- Config: `vitest.gateway.config.ts`, forced to one worker
- Scope:
- Starts a real loopback Gateway with diagnostics enabled by default
- Drives synthetic gateway message, memory, and large-payload churn through the diagnostic event path
- Queries `diagnostics.stability` over the Gateway WS RPC
- Covers diagnostic stability bundle persistence helpers
- Asserts the recorder remains bounded, synthetic RSS samples stay under the pressure budget, and per-session queue depths drain back to zero
- Expectations:
- CI-safe and keyless
- Narrow lane for stability-regression follow-up, not a substitute for the full Gateway suite
### E2E (gateway smoke)
- Command: `pnpm test:e2e`