mirror of
https://github.com/openclaw/openclaw.git
synced 2026-05-06 08:20:43 +00:00
docs: split OpenTelemetry export into its own page under gateway
Logging.md had grown to 487 lines with ~300 lines dedicated to OpenTelemetry export — wire protocol, full metric/span catalog, env vars, captureContent shape, sampling, the diagnostic event catalog, and protocol notes — leaving the genuine logging overview buried behind exporter reference material. Move the OTEL surface to a dedicated page and slim logging.md to a focused logs overview: - Add docs/gateway/opentelemetry.md (OpenTelemetry export). Same content reorganized: how it fits together, quick start, signals, configuration reference + env vars table, privacy/captureContent, sampling/flushing, full metric and span catalog, diagnostic event catalog, no-exporter mode, diagnostics flags pointer, disable. - docs/logging.md: drop the OTEL section in favor of a short 'Diagnostics and OpenTelemetry' summary that cross-links the new page and the diagnostics-flags page. Drops 273 lines net. Also drops the redundant body H1, retitles to 'Logging' (was 'Logging overview' which mismatched sidebar usage), and refreshes the Related list. - docs/docs.json: insert gateway/opentelemetry into the 'Health and diagnostics' sidebar group, reorder pages so the user- facing health/run pages come before exporter/internals pages, and put logging next to opentelemetry where readers naturally associate them. - docs/gateway/diagnostics.md, docs/gateway/logging.md, docs/gateway/configuration-reference.md: cross-link the new page and sentence-case stale Title-Cased Related entries on diagnostics.md.
This commit is contained in:
@@ -1436,11 +1436,12 @@
|
||||
"group": "Health and diagnostics",
|
||||
"pages": [
|
||||
"gateway/health",
|
||||
"gateway/diagnostics",
|
||||
"gateway/heartbeat",
|
||||
"gateway/doctor",
|
||||
"gateway/logging",
|
||||
"logging",
|
||||
"gateway/opentelemetry",
|
||||
"gateway/logging",
|
||||
"gateway/diagnostics",
|
||||
"gateway/troubleshooting"
|
||||
]
|
||||
},
|
||||
|
||||
@@ -909,7 +909,7 @@ Notes:
|
||||
- `enabled`: master toggle for instrumentation output (default: `true`).
|
||||
- `flags`: array of flag strings enabling targeted log output (supports wildcards like `"telegram.*"` or `"*"`).
|
||||
- `stuckSessionWarnMs`: age threshold in ms for emitting stuck-session warnings while a session remains in processing state.
|
||||
- `otel.enabled`: enables the OpenTelemetry export pipeline (default: `false`).
|
||||
- `otel.enabled`: enables the OpenTelemetry export pipeline (default: `false`). For the full configuration, signal catalog, and privacy model, see [OpenTelemetry export](/gateway/opentelemetry).
|
||||
- `otel.endpoint`: collector URL for OTel export.
|
||||
- `otel.protocol`: `"http/protobuf"` (default) or `"grpc"`.
|
||||
- `otel.headers`: extra HTTP/gRPC metadata headers sent with OTel export requests.
|
||||
|
||||
@@ -129,9 +129,10 @@ diagnostic event collection:
|
||||
Disabling diagnostics reduces bug-report detail. It does not affect normal
|
||||
Gateway logging.
|
||||
|
||||
## Related docs
|
||||
## Related
|
||||
|
||||
- [Health Checks](/gateway/health)
|
||||
- [Health checks](/gateway/health)
|
||||
- [Gateway CLI](/cli/gateway#gateway-diagnostics-export)
|
||||
- [Gateway Protocol](/gateway/protocol#system-and-identity)
|
||||
- [Gateway protocol](/gateway/protocol#system-and-identity)
|
||||
- [Logging](/logging)
|
||||
- [OpenTelemetry export](/gateway/opentelemetry) — separate flow for streaming diagnostics to a collector
|
||||
|
||||
@@ -114,5 +114,6 @@ This keeps existing file logs stable while making interactive output scannable.
|
||||
|
||||
## Related
|
||||
|
||||
- [Logging overview](/logging)
|
||||
- [Logging](/logging)
|
||||
- [OpenTelemetry export](/gateway/opentelemetry)
|
||||
- [Diagnostics export](/gateway/diagnostics)
|
||||
|
||||
304
docs/gateway/opentelemetry.md
Normal file
304
docs/gateway/opentelemetry.md
Normal file
@@ -0,0 +1,304 @@
|
||||
---
|
||||
summary: "Export OpenClaw diagnostics to any OpenTelemetry collector via the diagnostics-otel plugin (OTLP/HTTP)"
|
||||
title: "OpenTelemetry export"
|
||||
read_when:
|
||||
- You want to send OpenClaw model usage, message flow, or session metrics to an OpenTelemetry collector
|
||||
- You are wiring traces, metrics, or logs into Grafana, Datadog, Honeycomb, New Relic, Tempo, or another OTLP backend
|
||||
- You need the exact metric names, span names, or attribute shapes to build dashboards or alerts
|
||||
---
|
||||
|
||||
OpenClaw exports diagnostics through the bundled `diagnostics-otel` plugin
|
||||
using **OTLP/HTTP (protobuf)**. Any collector or backend that accepts OTLP/HTTP
|
||||
works without code changes. For local file logs and how to read them, see
|
||||
[Logging](/logging).
|
||||
|
||||
## How it fits together
|
||||
|
||||
- **Diagnostics events** are structured, in-process records emitted by the
|
||||
Gateway and bundled plugins for model runs, message flow, sessions, queues,
|
||||
and exec.
|
||||
- **`diagnostics-otel` plugin** subscribes to those events and exports them as
|
||||
OpenTelemetry **metrics**, **traces**, and **logs** over OTLP/HTTP.
|
||||
- Exporters only attach when both the diagnostics surface and the plugin are
|
||||
enabled, so the in-process cost stays near zero by default.
|
||||
|
||||
## Quick start
|
||||
|
||||
```json5
|
||||
{
|
||||
plugins: {
|
||||
allow: ["diagnostics-otel"],
|
||||
entries: {
|
||||
"diagnostics-otel": { enabled: true },
|
||||
},
|
||||
},
|
||||
diagnostics: {
|
||||
enabled: true,
|
||||
otel: {
|
||||
enabled: true,
|
||||
endpoint: "http://otel-collector:4318",
|
||||
protocol: "http/protobuf",
|
||||
serviceName: "openclaw-gateway",
|
||||
traces: true,
|
||||
metrics: true,
|
||||
logs: true,
|
||||
sampleRate: 0.2,
|
||||
flushIntervalMs: 60000,
|
||||
},
|
||||
},
|
||||
}
|
||||
```
|
||||
|
||||
You can also enable the plugin from the CLI:
|
||||
|
||||
```bash
|
||||
openclaw plugins enable diagnostics-otel
|
||||
```
|
||||
|
||||
<Note>
|
||||
`protocol` currently supports `http/protobuf` only. `grpc` is ignored.
|
||||
</Note>
|
||||
|
||||
## Signals exported
|
||||
|
||||
| Signal | What goes in it |
|
||||
| ----------- | --------------------------------------------------------------------------------------------------------------------------------- |
|
||||
| **Metrics** | Counters and histograms for token usage, cost, run duration, message flow, queue lanes, session state, exec, and memory pressure. |
|
||||
| **Traces** | Spans for model usage, model calls, tool execution, exec, webhook/message processing, context assembly, and tool loops. |
|
||||
| **Logs** | Structured `logging.file` records exported over OTLP when `diagnostics.otel.logs` is enabled. |
|
||||
|
||||
Toggle `traces`, `metrics`, and `logs` independently. All three default to on
|
||||
when `diagnostics.otel.enabled` is true.
|
||||
|
||||
## Configuration reference
|
||||
|
||||
```json5
|
||||
{
|
||||
diagnostics: {
|
||||
enabled: true,
|
||||
otel: {
|
||||
enabled: true,
|
||||
endpoint: "http://otel-collector:4318",
|
||||
protocol: "http/protobuf", // grpc is ignored
|
||||
serviceName: "openclaw-gateway",
|
||||
headers: { "x-collector-token": "..." },
|
||||
traces: true,
|
||||
metrics: true,
|
||||
logs: true,
|
||||
sampleRate: 0.2, // root-span sampler, 0.0..1.0
|
||||
flushIntervalMs: 60000, // metric export interval (min 1000ms)
|
||||
captureContent: {
|
||||
enabled: false,
|
||||
inputMessages: false,
|
||||
outputMessages: false,
|
||||
toolInputs: false,
|
||||
toolOutputs: false,
|
||||
systemPrompt: false,
|
||||
},
|
||||
},
|
||||
},
|
||||
}
|
||||
```
|
||||
|
||||
### Environment variables
|
||||
|
||||
| Variable | Purpose |
|
||||
| ------------------------------- | ------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------ |
|
||||
| `OTEL_EXPORTER_OTLP_ENDPOINT` | Override `diagnostics.otel.endpoint`. If the value already contains `/v1/traces`, `/v1/metrics`, or `/v1/logs`, it is used as-is. |
|
||||
| `OTEL_SERVICE_NAME` | Override `diagnostics.otel.serviceName`. |
|
||||
| `OTEL_EXPORTER_OTLP_PROTOCOL` | Override the wire protocol (only `http/protobuf` is honored today). |
|
||||
| `OTEL_SEMCONV_STABILITY_OPT_IN` | Set to `gen_ai_latest_experimental` to emit the latest experimental GenAI span attribute (`gen_ai.provider.name`) instead of the legacy `gen_ai.system`. GenAI metrics always use bounded, low-cardinality semantic attributes regardless. |
|
||||
| `OPENCLAW_OTEL_PRELOADED` | Set to `1` when another preload or host process already registered the global OpenTelemetry SDK. The plugin then skips its own NodeSDK lifecycle but still wires diagnostic listeners and honors `traces`/`metrics`/`logs`. |
|
||||
|
||||
## Privacy and content capture
|
||||
|
||||
Raw model/tool content is **not** exported by default. Spans carry bounded
|
||||
identifiers (channel, provider, model, error category, hash-only request ids)
|
||||
and never include prompt text, response text, tool inputs, tool outputs, or
|
||||
session keys.
|
||||
|
||||
Set `diagnostics.otel.captureContent.*` to `true` only when your collector and
|
||||
retention policy are approved for prompt, response, tool, or system-prompt
|
||||
text. Each subkey is opt-in independently:
|
||||
|
||||
- `inputMessages` — user prompt content.
|
||||
- `outputMessages` — model response content.
|
||||
- `toolInputs` — tool argument payloads.
|
||||
- `toolOutputs` — tool result payloads.
|
||||
- `systemPrompt` — assembled system/developer prompt.
|
||||
|
||||
When any subkey is enabled, model and tool spans get bounded, redacted
|
||||
`openclaw.content.*` attributes for that class only.
|
||||
|
||||
## Sampling and flushing
|
||||
|
||||
- **Traces:** `diagnostics.otel.sampleRate` (root-span only, `0.0` drops all,
|
||||
`1.0` keeps all).
|
||||
- **Metrics:** `diagnostics.otel.flushIntervalMs` (minimum `1000`).
|
||||
- **Logs:** OTLP logs respect `logging.level` (file log level). Console
|
||||
redaction does **not** apply to OTLP logs. High-volume installs should
|
||||
prefer OTLP collector sampling/filtering over local sampling.
|
||||
|
||||
## Exported metrics
|
||||
|
||||
### Model usage
|
||||
|
||||
- `openclaw.tokens` (counter, attrs: `openclaw.token`, `openclaw.channel`, `openclaw.provider`, `openclaw.model`)
|
||||
- `openclaw.cost.usd` (counter, attrs: `openclaw.channel`, `openclaw.provider`, `openclaw.model`)
|
||||
- `openclaw.run.duration_ms` (histogram, attrs: `openclaw.channel`, `openclaw.provider`, `openclaw.model`)
|
||||
- `openclaw.context.tokens` (histogram, attrs: `openclaw.context`, `openclaw.channel`, `openclaw.provider`, `openclaw.model`)
|
||||
- `gen_ai.client.token.usage` (histogram, GenAI semantic-conventions metric, attrs: `gen_ai.token.type` = `input`/`output`, `gen_ai.provider.name`, `gen_ai.operation.name`, `gen_ai.request.model`)
|
||||
- `gen_ai.client.operation.duration` (histogram, seconds, GenAI semantic-conventions metric, attrs: `gen_ai.provider.name`, `gen_ai.operation.name`, `gen_ai.request.model`, optional `error.type`)
|
||||
|
||||
### Message flow
|
||||
|
||||
- `openclaw.webhook.received` (counter, attrs: `openclaw.channel`, `openclaw.webhook`)
|
||||
- `openclaw.webhook.error` (counter, attrs: `openclaw.channel`, `openclaw.webhook`)
|
||||
- `openclaw.webhook.duration_ms` (histogram, attrs: `openclaw.channel`, `openclaw.webhook`)
|
||||
- `openclaw.message.queued` (counter, attrs: `openclaw.channel`, `openclaw.source`)
|
||||
- `openclaw.message.processed` (counter, attrs: `openclaw.channel`, `openclaw.outcome`)
|
||||
- `openclaw.message.duration_ms` (histogram, attrs: `openclaw.channel`, `openclaw.outcome`)
|
||||
- `openclaw.message.delivery.started` (counter, attrs: `openclaw.channel`, `openclaw.delivery.kind`)
|
||||
- `openclaw.message.delivery.duration_ms` (histogram, attrs: `openclaw.channel`, `openclaw.delivery.kind`, `openclaw.outcome`, `openclaw.errorCategory`)
|
||||
|
||||
### Queues and sessions
|
||||
|
||||
- `openclaw.queue.lane.enqueue` (counter, attrs: `openclaw.lane`)
|
||||
- `openclaw.queue.lane.dequeue` (counter, attrs: `openclaw.lane`)
|
||||
- `openclaw.queue.depth` (histogram, attrs: `openclaw.lane` or `openclaw.channel=heartbeat`)
|
||||
- `openclaw.queue.wait_ms` (histogram, attrs: `openclaw.lane`)
|
||||
- `openclaw.session.state` (counter, attrs: `openclaw.state`, `openclaw.reason`)
|
||||
- `openclaw.session.stuck` (counter, attrs: `openclaw.state`)
|
||||
- `openclaw.session.stuck_age_ms` (histogram, attrs: `openclaw.state`)
|
||||
- `openclaw.run.attempt` (counter, attrs: `openclaw.attempt`)
|
||||
|
||||
### Exec
|
||||
|
||||
- `openclaw.exec.duration_ms` (histogram, attrs: `openclaw.exec.target`, `openclaw.exec.mode`, `openclaw.outcome`, `openclaw.failureKind`)
|
||||
|
||||
### Diagnostics internals (memory and tool loop)
|
||||
|
||||
- `openclaw.memory.heap_used_bytes` (histogram, attrs: `openclaw.memory.kind`)
|
||||
- `openclaw.memory.rss_bytes` (histogram)
|
||||
- `openclaw.memory.pressure` (counter, attrs: `openclaw.memory.level`)
|
||||
- `openclaw.tool.loop.iterations` (counter, attrs: `openclaw.toolName`, `openclaw.outcome`)
|
||||
- `openclaw.tool.loop.duration_ms` (histogram, attrs: `openclaw.toolName`, `openclaw.outcome`)
|
||||
|
||||
## Exported spans
|
||||
|
||||
- `openclaw.model.usage`
|
||||
- `openclaw.channel`, `openclaw.provider`, `openclaw.model`
|
||||
- `openclaw.tokens.*` (input/output/cache_read/cache_write/total)
|
||||
- `gen_ai.system` by default, or `gen_ai.provider.name` when the latest GenAI semantic conventions are opted in
|
||||
- `gen_ai.request.model`, `gen_ai.operation.name`, `gen_ai.usage.*`
|
||||
- `openclaw.run`
|
||||
- `openclaw.outcome`, `openclaw.channel`, `openclaw.provider`, `openclaw.model`, `openclaw.errorCategory`
|
||||
- `openclaw.model.call`
|
||||
- `gen_ai.system` by default, or `gen_ai.provider.name` when the latest GenAI semantic conventions are opted in
|
||||
- `gen_ai.request.model`, `gen_ai.operation.name`, `openclaw.provider`, `openclaw.model`, `openclaw.api`, `openclaw.transport`
|
||||
- `openclaw.provider.request_id_hash` (bounded SHA-based hash of the upstream provider request id; raw ids are not exported)
|
||||
- `openclaw.tool.execution`
|
||||
- `gen_ai.tool.name`, `openclaw.toolName`, `openclaw.errorCategory`, `openclaw.tool.params.*`
|
||||
- `openclaw.exec`
|
||||
- `openclaw.exec.target`, `openclaw.exec.mode`, `openclaw.outcome`, `openclaw.failureKind`, `openclaw.exec.command_length`, `openclaw.exec.exit_code`, `openclaw.exec.timed_out`
|
||||
- `openclaw.webhook.processed`
|
||||
- `openclaw.channel`, `openclaw.webhook`, `openclaw.chatId`
|
||||
- `openclaw.webhook.error`
|
||||
- `openclaw.channel`, `openclaw.webhook`, `openclaw.chatId`, `openclaw.error`
|
||||
- `openclaw.message.processed`
|
||||
- `openclaw.channel`, `openclaw.outcome`, `openclaw.chatId`, `openclaw.messageId`, `openclaw.reason`
|
||||
- `openclaw.message.delivery`
|
||||
- `openclaw.channel`, `openclaw.delivery.kind`, `openclaw.outcome`, `openclaw.errorCategory`, `openclaw.delivery.result_count`
|
||||
- `openclaw.session.stuck`
|
||||
- `openclaw.state`, `openclaw.ageMs`, `openclaw.queueDepth`
|
||||
- `openclaw.context.assembled`
|
||||
- `openclaw.prompt.size`, `openclaw.history.size`, `openclaw.context.tokens`, `openclaw.errorCategory` (no prompt, history, response, or session-key content)
|
||||
- `openclaw.tool.loop`
|
||||
- `openclaw.toolName`, `openclaw.outcome`, `openclaw.iterations`, `openclaw.errorCategory` (no loop messages, params, or tool output)
|
||||
- `openclaw.memory.pressure`
|
||||
- `openclaw.memory.level`, `openclaw.memory.heap_used_bytes`, `openclaw.memory.rss_bytes`
|
||||
|
||||
When content capture is explicitly enabled, model and tool spans can also
|
||||
include bounded, redacted `openclaw.content.*` attributes for the specific
|
||||
content classes you opted into.
|
||||
|
||||
## Diagnostic event catalog
|
||||
|
||||
The events below back the metrics and spans above. Plugins can also subscribe
|
||||
to them directly without OTLP export.
|
||||
|
||||
**Model usage**
|
||||
|
||||
- `model.usage` — tokens, cost, duration, context, provider/model/channel,
|
||||
session ids. `usage` is provider/turn accounting for cost and telemetry;
|
||||
`context.used` is the current prompt/context snapshot and can be lower than
|
||||
provider `usage.total` when cached input or tool-loop calls are involved.
|
||||
|
||||
**Message flow**
|
||||
|
||||
- `webhook.received` / `webhook.processed` / `webhook.error`
|
||||
- `message.queued` / `message.processed`
|
||||
- `message.delivery.started` / `message.delivery.completed` / `message.delivery.error`
|
||||
|
||||
**Queue and session**
|
||||
|
||||
- `queue.lane.enqueue` / `queue.lane.dequeue`
|
||||
- `session.state` / `session.stuck`
|
||||
- `run.attempt`
|
||||
- `diagnostic.heartbeat` (aggregate counters: webhooks/queue/session)
|
||||
|
||||
**Exec**
|
||||
|
||||
- `exec.process.completed` — terminal outcome, duration, target, mode, exit
|
||||
code, and failure kind. Command text and working directories are not
|
||||
included.
|
||||
|
||||
## Without an exporter
|
||||
|
||||
You can keep diagnostics events available to plugins or custom sinks without
|
||||
running `diagnostics-otel`:
|
||||
|
||||
```json5
|
||||
{
|
||||
diagnostics: { enabled: true },
|
||||
}
|
||||
```
|
||||
|
||||
For targeted debug output without raising `logging.level`, use diagnostics
|
||||
flags. Flags are case-insensitive and support wildcards (e.g. `telegram.*` or
|
||||
`*`):
|
||||
|
||||
```json5
|
||||
{
|
||||
diagnostics: { flags: ["telegram.http"] },
|
||||
}
|
||||
```
|
||||
|
||||
Or as a one-off env override:
|
||||
|
||||
```bash
|
||||
OPENCLAW_DIAGNOSTICS=telegram.http,telegram.payload openclaw gateway
|
||||
```
|
||||
|
||||
Flag output goes to the standard log file (`logging.file`) and is still
|
||||
redacted by `logging.redactSensitive`. Full guide:
|
||||
[Diagnostics flags](/diagnostics/flags).
|
||||
|
||||
## Disable
|
||||
|
||||
```json5
|
||||
{
|
||||
diagnostics: { otel: { enabled: false } },
|
||||
}
|
||||
```
|
||||
|
||||
You can also leave `diagnostics-otel` out of `plugins.allow`, or run
|
||||
`openclaw plugins disable diagnostics-otel`.
|
||||
|
||||
## Related
|
||||
|
||||
- [Logging](/logging) — file logs, console output, CLI tailing, and the Control UI Logs tab
|
||||
- [Gateway logging internals](/gateway/logging) — WS log styles, subsystem prefixes, and console capture
|
||||
- [Diagnostics flags](/diagnostics/flags) — targeted debug-log flags
|
||||
- [Diagnostics export](/gateway/diagnostics) — operator support-bundle tool (separate from OTEL export)
|
||||
- [Configuration reference](/gateway/configuration-reference#diagnostics) — full `diagnostics.*` field reference
|
||||
327
docs/logging.md
327
docs/logging.md
@@ -1,14 +1,12 @@
|
||||
---
|
||||
summary: "Logging overview: file logs, console output, CLI tailing, and the Control UI"
|
||||
summary: "File logs, console output, CLI tailing, and the Control UI Logs tab"
|
||||
read_when:
|
||||
- You need a beginner-friendly overview of logging
|
||||
- You want to configure log levels or formats
|
||||
- You need a beginner-friendly overview of OpenClaw logging
|
||||
- You want to configure log levels, formats, or redaction
|
||||
- You are troubleshooting and need to find logs quickly
|
||||
title: "Logging overview"
|
||||
title: "Logging"
|
||||
---
|
||||
|
||||
# Logging
|
||||
|
||||
OpenClaw has two main log surfaces:
|
||||
|
||||
- **File logs** (JSON lines) written by the Gateway.
|
||||
@@ -171,308 +169,35 @@ Tool summaries can redact sensitive tokens before they hit the console:
|
||||
|
||||
Redaction affects **console output only** and does not alter file logs.
|
||||
|
||||
## Diagnostics + OpenTelemetry
|
||||
## Diagnostics and OpenTelemetry
|
||||
|
||||
Diagnostics are structured, machine-readable events for model runs **and**
|
||||
Diagnostics are structured, machine-readable events for model runs and
|
||||
message-flow telemetry (webhooks, queueing, session state). They do **not**
|
||||
replace logs; they exist to feed metrics, traces, and other exporters.
|
||||
replace logs — they feed metrics, traces, and exporters. Events are emitted
|
||||
in-process whether or not you export them.
|
||||
|
||||
Diagnostics events are emitted in-process, but exporters only attach when
|
||||
diagnostics + the exporter plugin are enabled.
|
||||
Two adjacent surfaces:
|
||||
|
||||
### OpenTelemetry vs OTLP
|
||||
- **OpenTelemetry export** — send metrics, traces, and logs over OTLP/HTTP to
|
||||
any OpenTelemetry-compatible collector or backend (Grafana, Datadog,
|
||||
Honeycomb, New Relic, Tempo, etc.). Full configuration, signal catalog,
|
||||
metric/span names, env vars, and privacy model live on a dedicated page:
|
||||
[OpenTelemetry export](/gateway/opentelemetry).
|
||||
- **Diagnostics flags** — targeted debug-log flags that route extra logs to
|
||||
`logging.file` without raising `logging.level`. Flags are case-insensitive
|
||||
and support wildcards (`telegram.*`, `*`). Configure under `diagnostics.flags`
|
||||
or via the `OPENCLAW_DIAGNOSTICS=...` env override. Full guide:
|
||||
[Diagnostics flags](/diagnostics/flags).
|
||||
|
||||
- **OpenTelemetry (OTel)**: the data model + SDKs for traces, metrics, and logs.
|
||||
- **OTLP**: the wire protocol used to export OTel data to a collector/backend.
|
||||
- OpenClaw exports via **OTLP/HTTP (protobuf)** today.
|
||||
To enable diagnostics events for plugins or custom sinks without OTLP export:
|
||||
|
||||
### Signals exported
|
||||
|
||||
- **Metrics**: counters + histograms (token usage, message flow, queueing).
|
||||
- **Traces**: spans for model usage + webhook/message processing.
|
||||
- **Logs**: exported over OTLP when `diagnostics.otel.logs` is enabled. Log
|
||||
volume can be high; keep `logging.level` and exporter filters in mind.
|
||||
|
||||
### Diagnostic event catalog
|
||||
|
||||
Model usage:
|
||||
|
||||
- `model.usage`: tokens, cost, duration, context, provider/model/channel, session ids.
|
||||
`usage` is provider/turn accounting for cost and telemetry; `context.used`
|
||||
is the current prompt/context snapshot and can be lower than provider
|
||||
`usage.total` when cached input or tool-loop calls are involved.
|
||||
|
||||
Message flow:
|
||||
|
||||
- `webhook.received`: webhook ingress per channel.
|
||||
- `webhook.processed`: webhook handled + duration.
|
||||
- `webhook.error`: webhook handler errors.
|
||||
- `message.queued`: message enqueued for processing.
|
||||
- `message.processed`: outcome + duration + optional error.
|
||||
- `message.delivery.started`: outbound delivery attempt started.
|
||||
- `message.delivery.completed`: outbound delivery attempt finished + duration/result count.
|
||||
- `message.delivery.error`: outbound delivery attempt failed + duration/bounded error category.
|
||||
|
||||
Queue + session:
|
||||
|
||||
- `queue.lane.enqueue`: command queue lane enqueue + depth.
|
||||
- `queue.lane.dequeue`: command queue lane dequeue + wait time.
|
||||
- `session.state`: session state transition + reason.
|
||||
- `session.stuck`: session stuck warning + age.
|
||||
- `run.attempt`: run retry/attempt metadata.
|
||||
- `diagnostic.heartbeat`: aggregate counters (webhooks/queue/session).
|
||||
|
||||
Exec:
|
||||
|
||||
- `exec.process.completed`: terminal exec process outcome, duration, target, mode,
|
||||
exit code, and failure kind. Command text and working directories are not
|
||||
included.
|
||||
|
||||
### Enable diagnostics (no exporter)
|
||||
|
||||
Use this if you want diagnostics events available to plugins or custom sinks:
|
||||
|
||||
```json
|
||||
```json5
|
||||
{
|
||||
"diagnostics": {
|
||||
"enabled": true
|
||||
}
|
||||
diagnostics: { enabled: true },
|
||||
}
|
||||
```
|
||||
|
||||
### Diagnostics flags (targeted logs)
|
||||
|
||||
Use flags to turn on extra, targeted debug logs without raising `logging.level`.
|
||||
Flags are case-insensitive and support wildcards (e.g. `telegram.*` or `*`).
|
||||
|
||||
```json
|
||||
{
|
||||
"diagnostics": {
|
||||
"flags": ["telegram.http"]
|
||||
}
|
||||
}
|
||||
```
|
||||
|
||||
Env override (one-off):
|
||||
|
||||
```
|
||||
OPENCLAW_DIAGNOSTICS=telegram.http,telegram.payload
|
||||
```
|
||||
|
||||
Notes:
|
||||
|
||||
- Flag logs go to the standard log file (same as `logging.file`).
|
||||
- Output is still redacted according to `logging.redactSensitive`.
|
||||
- Full guide: [/diagnostics/flags](/diagnostics/flags).
|
||||
|
||||
### Export to OpenTelemetry
|
||||
|
||||
Diagnostics can be exported via the `diagnostics-otel` plugin (OTLP/HTTP). This
|
||||
works with any OpenTelemetry collector/backend that accepts OTLP/HTTP.
|
||||
|
||||
```json
|
||||
{
|
||||
"plugins": {
|
||||
"allow": ["diagnostics-otel"],
|
||||
"entries": {
|
||||
"diagnostics-otel": {
|
||||
"enabled": true
|
||||
}
|
||||
}
|
||||
},
|
||||
"diagnostics": {
|
||||
"enabled": true,
|
||||
"otel": {
|
||||
"enabled": true,
|
||||
"endpoint": "http://otel-collector:4318",
|
||||
"protocol": "http/protobuf",
|
||||
"serviceName": "openclaw-gateway",
|
||||
"traces": true,
|
||||
"metrics": true,
|
||||
"logs": true,
|
||||
"sampleRate": 0.2,
|
||||
"flushIntervalMs": 60000,
|
||||
"captureContent": {
|
||||
"enabled": false,
|
||||
"inputMessages": false,
|
||||
"outputMessages": false,
|
||||
"toolInputs": false,
|
||||
"toolOutputs": false,
|
||||
"systemPrompt": false
|
||||
}
|
||||
}
|
||||
}
|
||||
}
|
||||
```
|
||||
|
||||
Notes:
|
||||
|
||||
- You can also enable the plugin with `openclaw plugins enable diagnostics-otel`.
|
||||
- `protocol` currently supports `http/protobuf` only. `grpc` is ignored.
|
||||
- Metrics include token usage, cost, context size, run duration, and message-flow
|
||||
counters/histograms (webhooks, queueing, session state, queue depth/wait),
|
||||
plus GenAI token usage and model-call duration histograms.
|
||||
- Traces/metrics can be toggled with `traces` / `metrics` (default: on). Traces
|
||||
include model usage spans plus webhook/message processing spans when enabled.
|
||||
- Raw model/tool content is not exported by default. Use
|
||||
`diagnostics.otel.captureContent` only when your collector and retention policy
|
||||
are approved for prompt, response, tool, or system prompt text.
|
||||
- Set `headers` when your collector requires auth.
|
||||
- Environment variables supported: `OTEL_EXPORTER_OTLP_ENDPOINT`,
|
||||
`OTEL_SERVICE_NAME`, `OTEL_EXPORTER_OTLP_PROTOCOL`.
|
||||
- Set `OTEL_SEMCONV_STABILITY_OPT_IN=gen_ai_latest_experimental` to emit the
|
||||
latest experimental GenAI provider span attribute (`gen_ai.provider.name`)
|
||||
instead of the legacy span attribute (`gen_ai.system`). GenAI metrics always
|
||||
use bounded, low-cardinality semantic attributes.
|
||||
- Set `OPENCLAW_OTEL_PRELOADED=1` when another preload or host process already
|
||||
registered the global OpenTelemetry SDK. In that mode the plugin does not start
|
||||
or shut down its own SDK, but it still wires OpenClaw diagnostic listeners and
|
||||
honors `diagnostics.otel.traces`, `metrics`, and `logs`.
|
||||
|
||||
### Exported metrics (names + types)
|
||||
|
||||
Model usage:
|
||||
|
||||
- `openclaw.tokens` (counter, attrs: `openclaw.token`, `openclaw.channel`,
|
||||
`openclaw.provider`, `openclaw.model`)
|
||||
- `openclaw.cost.usd` (counter, attrs: `openclaw.channel`, `openclaw.provider`,
|
||||
`openclaw.model`)
|
||||
- `openclaw.run.duration_ms` (histogram, attrs: `openclaw.channel`,
|
||||
`openclaw.provider`, `openclaw.model`)
|
||||
- `openclaw.context.tokens` (histogram, attrs: `openclaw.context`,
|
||||
`openclaw.channel`, `openclaw.provider`, `openclaw.model`)
|
||||
- `gen_ai.client.token.usage` (histogram, GenAI semantic-conventions metric,
|
||||
attrs: `gen_ai.token.type` = `input`/`output`, `gen_ai.provider.name`,
|
||||
`gen_ai.operation.name`, `gen_ai.request.model`)
|
||||
- `gen_ai.client.operation.duration` (histogram, seconds, GenAI
|
||||
semantic-conventions metric, attrs: `gen_ai.provider.name`,
|
||||
`gen_ai.operation.name`, `gen_ai.request.model`, optional `error.type`)
|
||||
|
||||
Message flow:
|
||||
|
||||
- `openclaw.webhook.received` (counter, attrs: `openclaw.channel`,
|
||||
`openclaw.webhook`)
|
||||
- `openclaw.webhook.error` (counter, attrs: `openclaw.channel`,
|
||||
`openclaw.webhook`)
|
||||
- `openclaw.webhook.duration_ms` (histogram, attrs: `openclaw.channel`,
|
||||
`openclaw.webhook`)
|
||||
- `openclaw.message.queued` (counter, attrs: `openclaw.channel`,
|
||||
`openclaw.source`)
|
||||
- `openclaw.message.processed` (counter, attrs: `openclaw.channel`,
|
||||
`openclaw.outcome`)
|
||||
- `openclaw.message.duration_ms` (histogram, attrs: `openclaw.channel`,
|
||||
`openclaw.outcome`)
|
||||
- `openclaw.message.delivery.started` (counter, attrs: `openclaw.channel`,
|
||||
`openclaw.delivery.kind`)
|
||||
- `openclaw.message.delivery.duration_ms` (histogram, attrs:
|
||||
`openclaw.channel`, `openclaw.delivery.kind`, `openclaw.outcome`,
|
||||
`openclaw.errorCategory`)
|
||||
|
||||
Queues + sessions:
|
||||
|
||||
- `openclaw.queue.lane.enqueue` (counter, attrs: `openclaw.lane`)
|
||||
- `openclaw.queue.lane.dequeue` (counter, attrs: `openclaw.lane`)
|
||||
- `openclaw.queue.depth` (histogram, attrs: `openclaw.lane` or
|
||||
`openclaw.channel=heartbeat`)
|
||||
- `openclaw.queue.wait_ms` (histogram, attrs: `openclaw.lane`)
|
||||
- `openclaw.session.state` (counter, attrs: `openclaw.state`, `openclaw.reason`)
|
||||
- `openclaw.session.stuck` (counter, attrs: `openclaw.state`)
|
||||
- `openclaw.session.stuck_age_ms` (histogram, attrs: `openclaw.state`)
|
||||
- `openclaw.run.attempt` (counter, attrs: `openclaw.attempt`)
|
||||
|
||||
Exec:
|
||||
|
||||
- `openclaw.exec.duration_ms` (histogram, attrs: `openclaw.exec.target`,
|
||||
`openclaw.exec.mode`, `openclaw.outcome`, `openclaw.failureKind`)
|
||||
|
||||
Diagnostics internals (memory + tool loop):
|
||||
|
||||
- `openclaw.memory.heap_used_bytes` (histogram, attrs: `openclaw.memory.kind`)
|
||||
- `openclaw.memory.rss_bytes` (histogram)
|
||||
- `openclaw.memory.pressure` (counter, attrs: `openclaw.memory.level`)
|
||||
- `openclaw.tool.loop.iterations` (counter, attrs: `openclaw.toolName`,
|
||||
`openclaw.outcome`)
|
||||
- `openclaw.tool.loop.duration_ms` (histogram, attrs: `openclaw.toolName`,
|
||||
`openclaw.outcome`)
|
||||
|
||||
### Exported spans (names + key attributes)
|
||||
|
||||
- `openclaw.model.usage`
|
||||
- `openclaw.channel`, `openclaw.provider`, `openclaw.model`
|
||||
- `openclaw.tokens.*` (input/output/cache_read/cache_write/total)
|
||||
- `gen_ai.system` by default, or `gen_ai.provider.name` when latest GenAI
|
||||
semantic conventions are opted in
|
||||
- `gen_ai.request.model`, `gen_ai.operation.name`, `gen_ai.usage.*`
|
||||
- `openclaw.run`
|
||||
- `openclaw.outcome`, `openclaw.channel`, `openclaw.provider`,
|
||||
`openclaw.model`, `openclaw.errorCategory`
|
||||
- `openclaw.model.call`
|
||||
- `gen_ai.system` by default, or `gen_ai.provider.name` when latest GenAI
|
||||
semantic conventions are opted in
|
||||
- `gen_ai.request.model`, `gen_ai.operation.name`,
|
||||
`openclaw.provider`, `openclaw.model`, `openclaw.api`,
|
||||
`openclaw.transport`, `openclaw.provider.request_id_hash` (bounded
|
||||
SHA-based hash of the upstream provider request id; raw ids are not
|
||||
exported)
|
||||
- `openclaw.tool.execution`
|
||||
- `gen_ai.tool.name`, `openclaw.toolName`, `openclaw.errorCategory`,
|
||||
`openclaw.tool.params.*`
|
||||
- `openclaw.exec`
|
||||
- `openclaw.exec.target`, `openclaw.exec.mode`, `openclaw.outcome`,
|
||||
`openclaw.failureKind`, `openclaw.exec.command_length`,
|
||||
`openclaw.exec.exit_code`, `openclaw.exec.timed_out`
|
||||
- `openclaw.webhook.processed`
|
||||
- `openclaw.channel`, `openclaw.webhook`, `openclaw.chatId`
|
||||
- `openclaw.webhook.error`
|
||||
- `openclaw.channel`, `openclaw.webhook`, `openclaw.chatId`,
|
||||
`openclaw.error`
|
||||
- `openclaw.message.processed`
|
||||
- `openclaw.channel`, `openclaw.outcome`, `openclaw.chatId`,
|
||||
`openclaw.messageId`, `openclaw.reason`
|
||||
- `openclaw.message.delivery`
|
||||
- `openclaw.channel`, `openclaw.delivery.kind`, `openclaw.outcome`,
|
||||
`openclaw.errorCategory`, `openclaw.delivery.result_count`
|
||||
- `openclaw.session.stuck`
|
||||
- `openclaw.state`, `openclaw.ageMs`, `openclaw.queueDepth`
|
||||
- `openclaw.context.assembled`
|
||||
- `openclaw.prompt.size`, `openclaw.history.size`,
|
||||
`openclaw.context.tokens`, `openclaw.errorCategory` (no prompt,
|
||||
history, response, or session-key content)
|
||||
- `openclaw.tool.loop`
|
||||
- `openclaw.toolName`, `openclaw.outcome`, `openclaw.iterations`,
|
||||
`openclaw.errorCategory` (no loop messages, params, or tool output)
|
||||
- `openclaw.memory.pressure`
|
||||
- `openclaw.memory.level`, `openclaw.memory.heap_used_bytes`,
|
||||
`openclaw.memory.rss_bytes`
|
||||
|
||||
When content capture is explicitly enabled, model/tool spans can also include
|
||||
bounded, redacted `openclaw.content.*` attributes for the specific content
|
||||
classes you opted into.
|
||||
|
||||
### Sampling + flushing
|
||||
|
||||
- Trace sampling: `diagnostics.otel.sampleRate` (0.0–1.0, root spans only).
|
||||
- Metric export interval: `diagnostics.otel.flushIntervalMs` (min 1000ms).
|
||||
|
||||
### Protocol notes
|
||||
|
||||
- OTLP/HTTP endpoints can be set via `diagnostics.otel.endpoint` or
|
||||
`OTEL_EXPORTER_OTLP_ENDPOINT`.
|
||||
- If the endpoint already contains `/v1/traces` or `/v1/metrics`, it is used as-is.
|
||||
- If the endpoint already contains `/v1/logs`, it is used as-is for logs.
|
||||
- `OTEL_SEMCONV_STABILITY_OPT_IN=gen_ai_latest_experimental` controls only the
|
||||
GenAI span provider attribute shape. Existing dashboards that read
|
||||
`gen_ai.system` can keep the default until they migrate.
|
||||
- `OPENCLAW_OTEL_PRELOADED=1` reuses an externally registered OpenTelemetry SDK
|
||||
for traces/metrics instead of starting a plugin-owned NodeSDK.
|
||||
- `diagnostics.otel.logs` enables OTLP log export for the main logger output.
|
||||
|
||||
### Log export behavior
|
||||
|
||||
- OTLP logs use the same structured records written to `logging.file`.
|
||||
- Respect `logging.level` (file log level). Console redaction does **not** apply
|
||||
to OTLP logs.
|
||||
- High-volume installs should prefer OTLP collector sampling/filtering.
|
||||
For OTLP export to a collector, see [OpenTelemetry export](/gateway/opentelemetry).
|
||||
|
||||
## Troubleshooting tips
|
||||
|
||||
@@ -483,5 +208,7 @@ classes you opted into.
|
||||
|
||||
## Related
|
||||
|
||||
- [Gateway Logging Internals](/gateway/logging) — WS log styles, subsystem prefixes, and console capture
|
||||
- [Diagnostics](/gateway/configuration-reference#diagnostics) — OpenTelemetry export and cache trace config
|
||||
- [OpenTelemetry export](/gateway/opentelemetry) — OTLP/HTTP export, metric/span catalog, privacy model
|
||||
- [Diagnostics flags](/diagnostics/flags) — targeted debug-log flags
|
||||
- [Gateway logging internals](/gateway/logging) — WS log styles, subsystem prefixes, and console capture
|
||||
- [Configuration reference](/gateway/configuration-reference#diagnostics) — full `diagnostics.*` field reference
|
||||
|
||||
Reference in New Issue
Block a user