diff --git a/docs/docs.json b/docs/docs.json index 269c3a4f55a..18cc6e8691d 100644 --- a/docs/docs.json +++ b/docs/docs.json @@ -1436,11 +1436,12 @@ "group": "Health and diagnostics", "pages": [ "gateway/health", - "gateway/diagnostics", "gateway/heartbeat", "gateway/doctor", - "gateway/logging", "logging", + "gateway/opentelemetry", + "gateway/logging", + "gateway/diagnostics", "gateway/troubleshooting" ] }, diff --git a/docs/gateway/configuration-reference.md b/docs/gateway/configuration-reference.md index eab8453bf1d..52b98097267 100644 --- a/docs/gateway/configuration-reference.md +++ b/docs/gateway/configuration-reference.md @@ -909,7 +909,7 @@ Notes: - `enabled`: master toggle for instrumentation output (default: `true`). - `flags`: array of flag strings enabling targeted log output (supports wildcards like `"telegram.*"` or `"*"`). - `stuckSessionWarnMs`: age threshold in ms for emitting stuck-session warnings while a session remains in processing state. -- `otel.enabled`: enables the OpenTelemetry export pipeline (default: `false`). +- `otel.enabled`: enables the OpenTelemetry export pipeline (default: `false`). For the full configuration, signal catalog, and privacy model, see [OpenTelemetry export](/gateway/opentelemetry). - `otel.endpoint`: collector URL for OTel export. - `otel.protocol`: `"http/protobuf"` (default) or `"grpc"`. - `otel.headers`: extra HTTP/gRPC metadata headers sent with OTel export requests. diff --git a/docs/gateway/diagnostics.md b/docs/gateway/diagnostics.md index 54b05ff9760..0a558d7d5d5 100644 --- a/docs/gateway/diagnostics.md +++ b/docs/gateway/diagnostics.md @@ -129,9 +129,10 @@ diagnostic event collection: Disabling diagnostics reduces bug-report detail. It does not affect normal Gateway logging. -## Related docs +## Related -- [Health Checks](/gateway/health) +- [Health checks](/gateway/health) - [Gateway CLI](/cli/gateway#gateway-diagnostics-export) -- [Gateway Protocol](/gateway/protocol#system-and-identity) +- [Gateway protocol](/gateway/protocol#system-and-identity) - [Logging](/logging) +- [OpenTelemetry export](/gateway/opentelemetry) — separate flow for streaming diagnostics to a collector diff --git a/docs/gateway/logging.md b/docs/gateway/logging.md index 005ef554bc0..881ae5d3ada 100644 --- a/docs/gateway/logging.md +++ b/docs/gateway/logging.md @@ -114,5 +114,6 @@ This keeps existing file logs stable while making interactive output scannable. ## Related -- [Logging overview](/logging) +- [Logging](/logging) +- [OpenTelemetry export](/gateway/opentelemetry) - [Diagnostics export](/gateway/diagnostics) diff --git a/docs/gateway/opentelemetry.md b/docs/gateway/opentelemetry.md new file mode 100644 index 00000000000..697c15216e5 --- /dev/null +++ b/docs/gateway/opentelemetry.md @@ -0,0 +1,304 @@ +--- +summary: "Export OpenClaw diagnostics to any OpenTelemetry collector via the diagnostics-otel plugin (OTLP/HTTP)" +title: "OpenTelemetry export" +read_when: + - You want to send OpenClaw model usage, message flow, or session metrics to an OpenTelemetry collector + - You are wiring traces, metrics, or logs into Grafana, Datadog, Honeycomb, New Relic, Tempo, or another OTLP backend + - You need the exact metric names, span names, or attribute shapes to build dashboards or alerts +--- + +OpenClaw exports diagnostics through the bundled `diagnostics-otel` plugin +using **OTLP/HTTP (protobuf)**. Any collector or backend that accepts OTLP/HTTP +works without code changes. For local file logs and how to read them, see +[Logging](/logging). + +## How it fits together + +- **Diagnostics events** are structured, in-process records emitted by the + Gateway and bundled plugins for model runs, message flow, sessions, queues, + and exec. +- **`diagnostics-otel` plugin** subscribes to those events and exports them as + OpenTelemetry **metrics**, **traces**, and **logs** over OTLP/HTTP. +- Exporters only attach when both the diagnostics surface and the plugin are + enabled, so the in-process cost stays near zero by default. + +## Quick start + +```json5 +{ + plugins: { + allow: ["diagnostics-otel"], + entries: { + "diagnostics-otel": { enabled: true }, + }, + }, + diagnostics: { + enabled: true, + otel: { + enabled: true, + endpoint: "http://otel-collector:4318", + protocol: "http/protobuf", + serviceName: "openclaw-gateway", + traces: true, + metrics: true, + logs: true, + sampleRate: 0.2, + flushIntervalMs: 60000, + }, + }, +} +``` + +You can also enable the plugin from the CLI: + +```bash +openclaw plugins enable diagnostics-otel +``` + + +`protocol` currently supports `http/protobuf` only. `grpc` is ignored. + + +## Signals exported + +| Signal | What goes in it | +| ----------- | --------------------------------------------------------------------------------------------------------------------------------- | +| **Metrics** | Counters and histograms for token usage, cost, run duration, message flow, queue lanes, session state, exec, and memory pressure. | +| **Traces** | Spans for model usage, model calls, tool execution, exec, webhook/message processing, context assembly, and tool loops. | +| **Logs** | Structured `logging.file` records exported over OTLP when `diagnostics.otel.logs` is enabled. | + +Toggle `traces`, `metrics`, and `logs` independently. All three default to on +when `diagnostics.otel.enabled` is true. + +## Configuration reference + +```json5 +{ + diagnostics: { + enabled: true, + otel: { + enabled: true, + endpoint: "http://otel-collector:4318", + protocol: "http/protobuf", // grpc is ignored + serviceName: "openclaw-gateway", + headers: { "x-collector-token": "..." }, + traces: true, + metrics: true, + logs: true, + sampleRate: 0.2, // root-span sampler, 0.0..1.0 + flushIntervalMs: 60000, // metric export interval (min 1000ms) + captureContent: { + enabled: false, + inputMessages: false, + outputMessages: false, + toolInputs: false, + toolOutputs: false, + systemPrompt: false, + }, + }, + }, +} +``` + +### Environment variables + +| Variable | Purpose | +| ------------------------------- | ------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------ | +| `OTEL_EXPORTER_OTLP_ENDPOINT` | Override `diagnostics.otel.endpoint`. If the value already contains `/v1/traces`, `/v1/metrics`, or `/v1/logs`, it is used as-is. | +| `OTEL_SERVICE_NAME` | Override `diagnostics.otel.serviceName`. | +| `OTEL_EXPORTER_OTLP_PROTOCOL` | Override the wire protocol (only `http/protobuf` is honored today). | +| `OTEL_SEMCONV_STABILITY_OPT_IN` | Set to `gen_ai_latest_experimental` to emit the latest experimental GenAI span attribute (`gen_ai.provider.name`) instead of the legacy `gen_ai.system`. GenAI metrics always use bounded, low-cardinality semantic attributes regardless. | +| `OPENCLAW_OTEL_PRELOADED` | Set to `1` when another preload or host process already registered the global OpenTelemetry SDK. The plugin then skips its own NodeSDK lifecycle but still wires diagnostic listeners and honors `traces`/`metrics`/`logs`. | + +## Privacy and content capture + +Raw model/tool content is **not** exported by default. Spans carry bounded +identifiers (channel, provider, model, error category, hash-only request ids) +and never include prompt text, response text, tool inputs, tool outputs, or +session keys. + +Set `diagnostics.otel.captureContent.*` to `true` only when your collector and +retention policy are approved for prompt, response, tool, or system-prompt +text. Each subkey is opt-in independently: + +- `inputMessages` — user prompt content. +- `outputMessages` — model response content. +- `toolInputs` — tool argument payloads. +- `toolOutputs` — tool result payloads. +- `systemPrompt` — assembled system/developer prompt. + +When any subkey is enabled, model and tool spans get bounded, redacted +`openclaw.content.*` attributes for that class only. + +## Sampling and flushing + +- **Traces:** `diagnostics.otel.sampleRate` (root-span only, `0.0` drops all, + `1.0` keeps all). +- **Metrics:** `diagnostics.otel.flushIntervalMs` (minimum `1000`). +- **Logs:** OTLP logs respect `logging.level` (file log level). Console + redaction does **not** apply to OTLP logs. High-volume installs should + prefer OTLP collector sampling/filtering over local sampling. + +## Exported metrics + +### Model usage + +- `openclaw.tokens` (counter, attrs: `openclaw.token`, `openclaw.channel`, `openclaw.provider`, `openclaw.model`) +- `openclaw.cost.usd` (counter, attrs: `openclaw.channel`, `openclaw.provider`, `openclaw.model`) +- `openclaw.run.duration_ms` (histogram, attrs: `openclaw.channel`, `openclaw.provider`, `openclaw.model`) +- `openclaw.context.tokens` (histogram, attrs: `openclaw.context`, `openclaw.channel`, `openclaw.provider`, `openclaw.model`) +- `gen_ai.client.token.usage` (histogram, GenAI semantic-conventions metric, attrs: `gen_ai.token.type` = `input`/`output`, `gen_ai.provider.name`, `gen_ai.operation.name`, `gen_ai.request.model`) +- `gen_ai.client.operation.duration` (histogram, seconds, GenAI semantic-conventions metric, attrs: `gen_ai.provider.name`, `gen_ai.operation.name`, `gen_ai.request.model`, optional `error.type`) + +### Message flow + +- `openclaw.webhook.received` (counter, attrs: `openclaw.channel`, `openclaw.webhook`) +- `openclaw.webhook.error` (counter, attrs: `openclaw.channel`, `openclaw.webhook`) +- `openclaw.webhook.duration_ms` (histogram, attrs: `openclaw.channel`, `openclaw.webhook`) +- `openclaw.message.queued` (counter, attrs: `openclaw.channel`, `openclaw.source`) +- `openclaw.message.processed` (counter, attrs: `openclaw.channel`, `openclaw.outcome`) +- `openclaw.message.duration_ms` (histogram, attrs: `openclaw.channel`, `openclaw.outcome`) +- `openclaw.message.delivery.started` (counter, attrs: `openclaw.channel`, `openclaw.delivery.kind`) +- `openclaw.message.delivery.duration_ms` (histogram, attrs: `openclaw.channel`, `openclaw.delivery.kind`, `openclaw.outcome`, `openclaw.errorCategory`) + +### Queues and sessions + +- `openclaw.queue.lane.enqueue` (counter, attrs: `openclaw.lane`) +- `openclaw.queue.lane.dequeue` (counter, attrs: `openclaw.lane`) +- `openclaw.queue.depth` (histogram, attrs: `openclaw.lane` or `openclaw.channel=heartbeat`) +- `openclaw.queue.wait_ms` (histogram, attrs: `openclaw.lane`) +- `openclaw.session.state` (counter, attrs: `openclaw.state`, `openclaw.reason`) +- `openclaw.session.stuck` (counter, attrs: `openclaw.state`) +- `openclaw.session.stuck_age_ms` (histogram, attrs: `openclaw.state`) +- `openclaw.run.attempt` (counter, attrs: `openclaw.attempt`) + +### Exec + +- `openclaw.exec.duration_ms` (histogram, attrs: `openclaw.exec.target`, `openclaw.exec.mode`, `openclaw.outcome`, `openclaw.failureKind`) + +### Diagnostics internals (memory and tool loop) + +- `openclaw.memory.heap_used_bytes` (histogram, attrs: `openclaw.memory.kind`) +- `openclaw.memory.rss_bytes` (histogram) +- `openclaw.memory.pressure` (counter, attrs: `openclaw.memory.level`) +- `openclaw.tool.loop.iterations` (counter, attrs: `openclaw.toolName`, `openclaw.outcome`) +- `openclaw.tool.loop.duration_ms` (histogram, attrs: `openclaw.toolName`, `openclaw.outcome`) + +## Exported spans + +- `openclaw.model.usage` + - `openclaw.channel`, `openclaw.provider`, `openclaw.model` + - `openclaw.tokens.*` (input/output/cache_read/cache_write/total) + - `gen_ai.system` by default, or `gen_ai.provider.name` when the latest GenAI semantic conventions are opted in + - `gen_ai.request.model`, `gen_ai.operation.name`, `gen_ai.usage.*` +- `openclaw.run` + - `openclaw.outcome`, `openclaw.channel`, `openclaw.provider`, `openclaw.model`, `openclaw.errorCategory` +- `openclaw.model.call` + - `gen_ai.system` by default, or `gen_ai.provider.name` when the latest GenAI semantic conventions are opted in + - `gen_ai.request.model`, `gen_ai.operation.name`, `openclaw.provider`, `openclaw.model`, `openclaw.api`, `openclaw.transport` + - `openclaw.provider.request_id_hash` (bounded SHA-based hash of the upstream provider request id; raw ids are not exported) +- `openclaw.tool.execution` + - `gen_ai.tool.name`, `openclaw.toolName`, `openclaw.errorCategory`, `openclaw.tool.params.*` +- `openclaw.exec` + - `openclaw.exec.target`, `openclaw.exec.mode`, `openclaw.outcome`, `openclaw.failureKind`, `openclaw.exec.command_length`, `openclaw.exec.exit_code`, `openclaw.exec.timed_out` +- `openclaw.webhook.processed` + - `openclaw.channel`, `openclaw.webhook`, `openclaw.chatId` +- `openclaw.webhook.error` + - `openclaw.channel`, `openclaw.webhook`, `openclaw.chatId`, `openclaw.error` +- `openclaw.message.processed` + - `openclaw.channel`, `openclaw.outcome`, `openclaw.chatId`, `openclaw.messageId`, `openclaw.reason` +- `openclaw.message.delivery` + - `openclaw.channel`, `openclaw.delivery.kind`, `openclaw.outcome`, `openclaw.errorCategory`, `openclaw.delivery.result_count` +- `openclaw.session.stuck` + - `openclaw.state`, `openclaw.ageMs`, `openclaw.queueDepth` +- `openclaw.context.assembled` + - `openclaw.prompt.size`, `openclaw.history.size`, `openclaw.context.tokens`, `openclaw.errorCategory` (no prompt, history, response, or session-key content) +- `openclaw.tool.loop` + - `openclaw.toolName`, `openclaw.outcome`, `openclaw.iterations`, `openclaw.errorCategory` (no loop messages, params, or tool output) +- `openclaw.memory.pressure` + - `openclaw.memory.level`, `openclaw.memory.heap_used_bytes`, `openclaw.memory.rss_bytes` + +When content capture is explicitly enabled, model and tool spans can also +include bounded, redacted `openclaw.content.*` attributes for the specific +content classes you opted into. + +## Diagnostic event catalog + +The events below back the metrics and spans above. Plugins can also subscribe +to them directly without OTLP export. + +**Model usage** + +- `model.usage` — tokens, cost, duration, context, provider/model/channel, + session ids. `usage` is provider/turn accounting for cost and telemetry; + `context.used` is the current prompt/context snapshot and can be lower than + provider `usage.total` when cached input or tool-loop calls are involved. + +**Message flow** + +- `webhook.received` / `webhook.processed` / `webhook.error` +- `message.queued` / `message.processed` +- `message.delivery.started` / `message.delivery.completed` / `message.delivery.error` + +**Queue and session** + +- `queue.lane.enqueue` / `queue.lane.dequeue` +- `session.state` / `session.stuck` +- `run.attempt` +- `diagnostic.heartbeat` (aggregate counters: webhooks/queue/session) + +**Exec** + +- `exec.process.completed` — terminal outcome, duration, target, mode, exit + code, and failure kind. Command text and working directories are not + included. + +## Without an exporter + +You can keep diagnostics events available to plugins or custom sinks without +running `diagnostics-otel`: + +```json5 +{ + diagnostics: { enabled: true }, +} +``` + +For targeted debug output without raising `logging.level`, use diagnostics +flags. Flags are case-insensitive and support wildcards (e.g. `telegram.*` or +`*`): + +```json5 +{ + diagnostics: { flags: ["telegram.http"] }, +} +``` + +Or as a one-off env override: + +```bash +OPENCLAW_DIAGNOSTICS=telegram.http,telegram.payload openclaw gateway +``` + +Flag output goes to the standard log file (`logging.file`) and is still +redacted by `logging.redactSensitive`. Full guide: +[Diagnostics flags](/diagnostics/flags). + +## Disable + +```json5 +{ + diagnostics: { otel: { enabled: false } }, +} +``` + +You can also leave `diagnostics-otel` out of `plugins.allow`, or run +`openclaw plugins disable diagnostics-otel`. + +## Related + +- [Logging](/logging) — file logs, console output, CLI tailing, and the Control UI Logs tab +- [Gateway logging internals](/gateway/logging) — WS log styles, subsystem prefixes, and console capture +- [Diagnostics flags](/diagnostics/flags) — targeted debug-log flags +- [Diagnostics export](/gateway/diagnostics) — operator support-bundle tool (separate from OTEL export) +- [Configuration reference](/gateway/configuration-reference#diagnostics) — full `diagnostics.*` field reference diff --git a/docs/logging.md b/docs/logging.md index 6da54994f7b..4d7a3e015b8 100644 --- a/docs/logging.md +++ b/docs/logging.md @@ -1,14 +1,12 @@ --- -summary: "Logging overview: file logs, console output, CLI tailing, and the Control UI" +summary: "File logs, console output, CLI tailing, and the Control UI Logs tab" read_when: - - You need a beginner-friendly overview of logging - - You want to configure log levels or formats + - You need a beginner-friendly overview of OpenClaw logging + - You want to configure log levels, formats, or redaction - You are troubleshooting and need to find logs quickly -title: "Logging overview" +title: "Logging" --- -# Logging - OpenClaw has two main log surfaces: - **File logs** (JSON lines) written by the Gateway. @@ -171,308 +169,35 @@ Tool summaries can redact sensitive tokens before they hit the console: Redaction affects **console output only** and does not alter file logs. -## Diagnostics + OpenTelemetry +## Diagnostics and OpenTelemetry -Diagnostics are structured, machine-readable events for model runs **and** +Diagnostics are structured, machine-readable events for model runs and message-flow telemetry (webhooks, queueing, session state). They do **not** -replace logs; they exist to feed metrics, traces, and other exporters. +replace logs — they feed metrics, traces, and exporters. Events are emitted +in-process whether or not you export them. -Diagnostics events are emitted in-process, but exporters only attach when -diagnostics + the exporter plugin are enabled. +Two adjacent surfaces: -### OpenTelemetry vs OTLP +- **OpenTelemetry export** — send metrics, traces, and logs over OTLP/HTTP to + any OpenTelemetry-compatible collector or backend (Grafana, Datadog, + Honeycomb, New Relic, Tempo, etc.). Full configuration, signal catalog, + metric/span names, env vars, and privacy model live on a dedicated page: + [OpenTelemetry export](/gateway/opentelemetry). +- **Diagnostics flags** — targeted debug-log flags that route extra logs to + `logging.file` without raising `logging.level`. Flags are case-insensitive + and support wildcards (`telegram.*`, `*`). Configure under `diagnostics.flags` + or via the `OPENCLAW_DIAGNOSTICS=...` env override. Full guide: + [Diagnostics flags](/diagnostics/flags). -- **OpenTelemetry (OTel)**: the data model + SDKs for traces, metrics, and logs. -- **OTLP**: the wire protocol used to export OTel data to a collector/backend. -- OpenClaw exports via **OTLP/HTTP (protobuf)** today. +To enable diagnostics events for plugins or custom sinks without OTLP export: -### Signals exported - -- **Metrics**: counters + histograms (token usage, message flow, queueing). -- **Traces**: spans for model usage + webhook/message processing. -- **Logs**: exported over OTLP when `diagnostics.otel.logs` is enabled. Log - volume can be high; keep `logging.level` and exporter filters in mind. - -### Diagnostic event catalog - -Model usage: - -- `model.usage`: tokens, cost, duration, context, provider/model/channel, session ids. - `usage` is provider/turn accounting for cost and telemetry; `context.used` - is the current prompt/context snapshot and can be lower than provider - `usage.total` when cached input or tool-loop calls are involved. - -Message flow: - -- `webhook.received`: webhook ingress per channel. -- `webhook.processed`: webhook handled + duration. -- `webhook.error`: webhook handler errors. -- `message.queued`: message enqueued for processing. -- `message.processed`: outcome + duration + optional error. -- `message.delivery.started`: outbound delivery attempt started. -- `message.delivery.completed`: outbound delivery attempt finished + duration/result count. -- `message.delivery.error`: outbound delivery attempt failed + duration/bounded error category. - -Queue + session: - -- `queue.lane.enqueue`: command queue lane enqueue + depth. -- `queue.lane.dequeue`: command queue lane dequeue + wait time. -- `session.state`: session state transition + reason. -- `session.stuck`: session stuck warning + age. -- `run.attempt`: run retry/attempt metadata. -- `diagnostic.heartbeat`: aggregate counters (webhooks/queue/session). - -Exec: - -- `exec.process.completed`: terminal exec process outcome, duration, target, mode, - exit code, and failure kind. Command text and working directories are not - included. - -### Enable diagnostics (no exporter) - -Use this if you want diagnostics events available to plugins or custom sinks: - -```json +```json5 { - "diagnostics": { - "enabled": true - } + diagnostics: { enabled: true }, } ``` -### Diagnostics flags (targeted logs) - -Use flags to turn on extra, targeted debug logs without raising `logging.level`. -Flags are case-insensitive and support wildcards (e.g. `telegram.*` or `*`). - -```json -{ - "diagnostics": { - "flags": ["telegram.http"] - } -} -``` - -Env override (one-off): - -``` -OPENCLAW_DIAGNOSTICS=telegram.http,telegram.payload -``` - -Notes: - -- Flag logs go to the standard log file (same as `logging.file`). -- Output is still redacted according to `logging.redactSensitive`. -- Full guide: [/diagnostics/flags](/diagnostics/flags). - -### Export to OpenTelemetry - -Diagnostics can be exported via the `diagnostics-otel` plugin (OTLP/HTTP). This -works with any OpenTelemetry collector/backend that accepts OTLP/HTTP. - -```json -{ - "plugins": { - "allow": ["diagnostics-otel"], - "entries": { - "diagnostics-otel": { - "enabled": true - } - } - }, - "diagnostics": { - "enabled": true, - "otel": { - "enabled": true, - "endpoint": "http://otel-collector:4318", - "protocol": "http/protobuf", - "serviceName": "openclaw-gateway", - "traces": true, - "metrics": true, - "logs": true, - "sampleRate": 0.2, - "flushIntervalMs": 60000, - "captureContent": { - "enabled": false, - "inputMessages": false, - "outputMessages": false, - "toolInputs": false, - "toolOutputs": false, - "systemPrompt": false - } - } - } -} -``` - -Notes: - -- You can also enable the plugin with `openclaw plugins enable diagnostics-otel`. -- `protocol` currently supports `http/protobuf` only. `grpc` is ignored. -- Metrics include token usage, cost, context size, run duration, and message-flow - counters/histograms (webhooks, queueing, session state, queue depth/wait), - plus GenAI token usage and model-call duration histograms. -- Traces/metrics can be toggled with `traces` / `metrics` (default: on). Traces - include model usage spans plus webhook/message processing spans when enabled. -- Raw model/tool content is not exported by default. Use - `diagnostics.otel.captureContent` only when your collector and retention policy - are approved for prompt, response, tool, or system prompt text. -- Set `headers` when your collector requires auth. -- Environment variables supported: `OTEL_EXPORTER_OTLP_ENDPOINT`, - `OTEL_SERVICE_NAME`, `OTEL_EXPORTER_OTLP_PROTOCOL`. -- Set `OTEL_SEMCONV_STABILITY_OPT_IN=gen_ai_latest_experimental` to emit the - latest experimental GenAI provider span attribute (`gen_ai.provider.name`) - instead of the legacy span attribute (`gen_ai.system`). GenAI metrics always - use bounded, low-cardinality semantic attributes. -- Set `OPENCLAW_OTEL_PRELOADED=1` when another preload or host process already - registered the global OpenTelemetry SDK. In that mode the plugin does not start - or shut down its own SDK, but it still wires OpenClaw diagnostic listeners and - honors `diagnostics.otel.traces`, `metrics`, and `logs`. - -### Exported metrics (names + types) - -Model usage: - -- `openclaw.tokens` (counter, attrs: `openclaw.token`, `openclaw.channel`, - `openclaw.provider`, `openclaw.model`) -- `openclaw.cost.usd` (counter, attrs: `openclaw.channel`, `openclaw.provider`, - `openclaw.model`) -- `openclaw.run.duration_ms` (histogram, attrs: `openclaw.channel`, - `openclaw.provider`, `openclaw.model`) -- `openclaw.context.tokens` (histogram, attrs: `openclaw.context`, - `openclaw.channel`, `openclaw.provider`, `openclaw.model`) -- `gen_ai.client.token.usage` (histogram, GenAI semantic-conventions metric, - attrs: `gen_ai.token.type` = `input`/`output`, `gen_ai.provider.name`, - `gen_ai.operation.name`, `gen_ai.request.model`) -- `gen_ai.client.operation.duration` (histogram, seconds, GenAI - semantic-conventions metric, attrs: `gen_ai.provider.name`, - `gen_ai.operation.name`, `gen_ai.request.model`, optional `error.type`) - -Message flow: - -- `openclaw.webhook.received` (counter, attrs: `openclaw.channel`, - `openclaw.webhook`) -- `openclaw.webhook.error` (counter, attrs: `openclaw.channel`, - `openclaw.webhook`) -- `openclaw.webhook.duration_ms` (histogram, attrs: `openclaw.channel`, - `openclaw.webhook`) -- `openclaw.message.queued` (counter, attrs: `openclaw.channel`, - `openclaw.source`) -- `openclaw.message.processed` (counter, attrs: `openclaw.channel`, - `openclaw.outcome`) -- `openclaw.message.duration_ms` (histogram, attrs: `openclaw.channel`, - `openclaw.outcome`) -- `openclaw.message.delivery.started` (counter, attrs: `openclaw.channel`, - `openclaw.delivery.kind`) -- `openclaw.message.delivery.duration_ms` (histogram, attrs: - `openclaw.channel`, `openclaw.delivery.kind`, `openclaw.outcome`, - `openclaw.errorCategory`) - -Queues + sessions: - -- `openclaw.queue.lane.enqueue` (counter, attrs: `openclaw.lane`) -- `openclaw.queue.lane.dequeue` (counter, attrs: `openclaw.lane`) -- `openclaw.queue.depth` (histogram, attrs: `openclaw.lane` or - `openclaw.channel=heartbeat`) -- `openclaw.queue.wait_ms` (histogram, attrs: `openclaw.lane`) -- `openclaw.session.state` (counter, attrs: `openclaw.state`, `openclaw.reason`) -- `openclaw.session.stuck` (counter, attrs: `openclaw.state`) -- `openclaw.session.stuck_age_ms` (histogram, attrs: `openclaw.state`) -- `openclaw.run.attempt` (counter, attrs: `openclaw.attempt`) - -Exec: - -- `openclaw.exec.duration_ms` (histogram, attrs: `openclaw.exec.target`, - `openclaw.exec.mode`, `openclaw.outcome`, `openclaw.failureKind`) - -Diagnostics internals (memory + tool loop): - -- `openclaw.memory.heap_used_bytes` (histogram, attrs: `openclaw.memory.kind`) -- `openclaw.memory.rss_bytes` (histogram) -- `openclaw.memory.pressure` (counter, attrs: `openclaw.memory.level`) -- `openclaw.tool.loop.iterations` (counter, attrs: `openclaw.toolName`, - `openclaw.outcome`) -- `openclaw.tool.loop.duration_ms` (histogram, attrs: `openclaw.toolName`, - `openclaw.outcome`) - -### Exported spans (names + key attributes) - -- `openclaw.model.usage` - - `openclaw.channel`, `openclaw.provider`, `openclaw.model` - - `openclaw.tokens.*` (input/output/cache_read/cache_write/total) - - `gen_ai.system` by default, or `gen_ai.provider.name` when latest GenAI - semantic conventions are opted in - - `gen_ai.request.model`, `gen_ai.operation.name`, `gen_ai.usage.*` -- `openclaw.run` - - `openclaw.outcome`, `openclaw.channel`, `openclaw.provider`, - `openclaw.model`, `openclaw.errorCategory` -- `openclaw.model.call` - - `gen_ai.system` by default, or `gen_ai.provider.name` when latest GenAI - semantic conventions are opted in - - `gen_ai.request.model`, `gen_ai.operation.name`, - `openclaw.provider`, `openclaw.model`, `openclaw.api`, - `openclaw.transport`, `openclaw.provider.request_id_hash` (bounded - SHA-based hash of the upstream provider request id; raw ids are not - exported) -- `openclaw.tool.execution` - - `gen_ai.tool.name`, `openclaw.toolName`, `openclaw.errorCategory`, - `openclaw.tool.params.*` -- `openclaw.exec` - - `openclaw.exec.target`, `openclaw.exec.mode`, `openclaw.outcome`, - `openclaw.failureKind`, `openclaw.exec.command_length`, - `openclaw.exec.exit_code`, `openclaw.exec.timed_out` -- `openclaw.webhook.processed` - - `openclaw.channel`, `openclaw.webhook`, `openclaw.chatId` -- `openclaw.webhook.error` - - `openclaw.channel`, `openclaw.webhook`, `openclaw.chatId`, - `openclaw.error` -- `openclaw.message.processed` - - `openclaw.channel`, `openclaw.outcome`, `openclaw.chatId`, - `openclaw.messageId`, `openclaw.reason` -- `openclaw.message.delivery` - - `openclaw.channel`, `openclaw.delivery.kind`, `openclaw.outcome`, - `openclaw.errorCategory`, `openclaw.delivery.result_count` -- `openclaw.session.stuck` - - `openclaw.state`, `openclaw.ageMs`, `openclaw.queueDepth` -- `openclaw.context.assembled` - - `openclaw.prompt.size`, `openclaw.history.size`, - `openclaw.context.tokens`, `openclaw.errorCategory` (no prompt, - history, response, or session-key content) -- `openclaw.tool.loop` - - `openclaw.toolName`, `openclaw.outcome`, `openclaw.iterations`, - `openclaw.errorCategory` (no loop messages, params, or tool output) -- `openclaw.memory.pressure` - - `openclaw.memory.level`, `openclaw.memory.heap_used_bytes`, - `openclaw.memory.rss_bytes` - -When content capture is explicitly enabled, model/tool spans can also include -bounded, redacted `openclaw.content.*` attributes for the specific content -classes you opted into. - -### Sampling + flushing - -- Trace sampling: `diagnostics.otel.sampleRate` (0.0–1.0, root spans only). -- Metric export interval: `diagnostics.otel.flushIntervalMs` (min 1000ms). - -### Protocol notes - -- OTLP/HTTP endpoints can be set via `diagnostics.otel.endpoint` or - `OTEL_EXPORTER_OTLP_ENDPOINT`. -- If the endpoint already contains `/v1/traces` or `/v1/metrics`, it is used as-is. -- If the endpoint already contains `/v1/logs`, it is used as-is for logs. -- `OTEL_SEMCONV_STABILITY_OPT_IN=gen_ai_latest_experimental` controls only the - GenAI span provider attribute shape. Existing dashboards that read - `gen_ai.system` can keep the default until they migrate. -- `OPENCLAW_OTEL_PRELOADED=1` reuses an externally registered OpenTelemetry SDK - for traces/metrics instead of starting a plugin-owned NodeSDK. -- `diagnostics.otel.logs` enables OTLP log export for the main logger output. - -### Log export behavior - -- OTLP logs use the same structured records written to `logging.file`. -- Respect `logging.level` (file log level). Console redaction does **not** apply - to OTLP logs. -- High-volume installs should prefer OTLP collector sampling/filtering. +For OTLP export to a collector, see [OpenTelemetry export](/gateway/opentelemetry). ## Troubleshooting tips @@ -483,5 +208,7 @@ classes you opted into. ## Related -- [Gateway Logging Internals](/gateway/logging) — WS log styles, subsystem prefixes, and console capture -- [Diagnostics](/gateway/configuration-reference#diagnostics) — OpenTelemetry export and cache trace config +- [OpenTelemetry export](/gateway/opentelemetry) — OTLP/HTTP export, metric/span catalog, privacy model +- [Diagnostics flags](/diagnostics/flags) — targeted debug-log flags +- [Gateway logging internals](/gateway/logging) — WS log styles, subsystem prefixes, and console capture +- [Configuration reference](/gateway/configuration-reference#diagnostics) — full `diagnostics.*` field reference