mirror of
https://github.com/openclaw/openclaw.git
synced 2026-05-06 21:10:43 +00:00
344 lines
18 KiB
Markdown
344 lines
18 KiB
Markdown
---
|
|
summary: "Export OpenClaw diagnostics to any OpenTelemetry collector via the diagnostics-otel plugin (OTLP/HTTP)"
|
|
title: "OpenTelemetry export"
|
|
read_when:
|
|
- You want to send OpenClaw model usage, message flow, or session metrics to an OpenTelemetry collector
|
|
- You are wiring traces, metrics, or logs into Grafana, Datadog, Honeycomb, New Relic, Tempo, or another OTLP backend
|
|
- You need the exact metric names, span names, or attribute shapes to build dashboards or alerts
|
|
---
|
|
|
|
OpenClaw exports diagnostics through the bundled `diagnostics-otel` plugin
|
|
using **OTLP/HTTP (protobuf)**. Any collector or backend that accepts OTLP/HTTP
|
|
works without code changes. For local file logs and how to read them, see
|
|
[Logging](/logging).
|
|
|
|
## How it fits together
|
|
|
|
- **Diagnostics events** are structured, in-process records emitted by the
|
|
Gateway and bundled plugins for model runs, message flow, sessions, queues,
|
|
and exec.
|
|
- **`diagnostics-otel` plugin** subscribes to those events and exports them as
|
|
OpenTelemetry **metrics**, **traces**, and **logs** over OTLP/HTTP.
|
|
- **Provider calls** receive a W3C `traceparent` header from OpenClaw's
|
|
trusted model-call span context when the provider transport accepts custom
|
|
headers. Plugin-emitted trace context is not propagated.
|
|
- Exporters only attach when both the diagnostics surface and the plugin are
|
|
enabled, so the in-process cost stays near zero by default.
|
|
|
|
## Quick start
|
|
|
|
```json5
|
|
{
|
|
plugins: {
|
|
allow: ["diagnostics-otel"],
|
|
entries: {
|
|
"diagnostics-otel": { enabled: true },
|
|
},
|
|
},
|
|
diagnostics: {
|
|
enabled: true,
|
|
otel: {
|
|
enabled: true,
|
|
endpoint: "http://otel-collector:4318",
|
|
protocol: "http/protobuf",
|
|
serviceName: "openclaw-gateway",
|
|
traces: true,
|
|
metrics: true,
|
|
logs: true,
|
|
sampleRate: 0.2,
|
|
flushIntervalMs: 60000,
|
|
},
|
|
},
|
|
}
|
|
```
|
|
|
|
You can also enable the plugin from the CLI:
|
|
|
|
```bash
|
|
openclaw plugins enable diagnostics-otel
|
|
```
|
|
|
|
<Note>
|
|
`protocol` currently supports `http/protobuf` only. `grpc` is ignored.
|
|
</Note>
|
|
|
|
## Signals exported
|
|
|
|
| Signal | What goes in it |
|
|
| ----------- | ------------------------------------------------------------------------------------------------------------------------------------------ |
|
|
| **Metrics** | Counters and histograms for token usage, cost, run duration, message flow, queue lanes, session state, exec, and memory pressure. |
|
|
| **Traces** | Spans for model usage, model calls, harness lifecycle, tool execution, exec, webhook/message processing, context assembly, and tool loops. |
|
|
| **Logs** | Structured `logging.file` records exported over OTLP when `diagnostics.otel.logs` is enabled. |
|
|
|
|
Toggle `traces`, `metrics`, and `logs` independently. All three default to on
|
|
when `diagnostics.otel.enabled` is true.
|
|
|
|
## Configuration reference
|
|
|
|
```json5
|
|
{
|
|
diagnostics: {
|
|
enabled: true,
|
|
otel: {
|
|
enabled: true,
|
|
endpoint: "http://otel-collector:4318",
|
|
tracesEndpoint: "http://otel-collector:4318/v1/traces",
|
|
metricsEndpoint: "http://otel-collector:4318/v1/metrics",
|
|
logsEndpoint: "http://otel-collector:4318/v1/logs",
|
|
protocol: "http/protobuf", // grpc is ignored
|
|
serviceName: "openclaw-gateway",
|
|
headers: { "x-collector-token": "..." },
|
|
traces: true,
|
|
metrics: true,
|
|
logs: true,
|
|
sampleRate: 0.2, // root-span sampler, 0.0..1.0
|
|
flushIntervalMs: 60000, // metric export interval (min 1000ms)
|
|
captureContent: {
|
|
enabled: false,
|
|
inputMessages: false,
|
|
outputMessages: false,
|
|
toolInputs: false,
|
|
toolOutputs: false,
|
|
systemPrompt: false,
|
|
},
|
|
},
|
|
},
|
|
}
|
|
```
|
|
|
|
### Environment variables
|
|
|
|
| Variable | Purpose |
|
|
| ----------------------------------------------------------------------------------------------------------------- | ------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------ |
|
|
| `OTEL_EXPORTER_OTLP_ENDPOINT` | Override `diagnostics.otel.endpoint`. If the value already contains `/v1/traces`, `/v1/metrics`, or `/v1/logs`, it is used as-is. |
|
|
| `OTEL_EXPORTER_OTLP_TRACES_ENDPOINT` / `OTEL_EXPORTER_OTLP_METRICS_ENDPOINT` / `OTEL_EXPORTER_OTLP_LOGS_ENDPOINT` | Signal-specific endpoint overrides used when the matching `diagnostics.otel.*Endpoint` config key is unset. Signal-specific config wins over signal-specific env, which wins over the shared endpoint. |
|
|
| `OTEL_SERVICE_NAME` | Override `diagnostics.otel.serviceName`. |
|
|
| `OTEL_EXPORTER_OTLP_PROTOCOL` | Override the wire protocol (only `http/protobuf` is honored today). |
|
|
| `OTEL_SEMCONV_STABILITY_OPT_IN` | Set to `gen_ai_latest_experimental` to emit the latest experimental GenAI span attribute (`gen_ai.provider.name`) instead of the legacy `gen_ai.system`. GenAI metrics always use bounded, low-cardinality semantic attributes regardless. |
|
|
| `OPENCLAW_OTEL_PRELOADED` | Set to `1` when another preload or host process already registered the global OpenTelemetry SDK. The plugin then skips its own NodeSDK lifecycle but still wires diagnostic listeners and honors `traces`/`metrics`/`logs`. |
|
|
|
|
## Privacy and content capture
|
|
|
|
Raw model/tool content is **not** exported by default. Spans carry bounded
|
|
identifiers (channel, provider, model, error category, hash-only request ids)
|
|
and never include prompt text, response text, tool inputs, tool outputs, or
|
|
session keys.
|
|
|
|
Outbound model requests may include a W3C `traceparent` header. That header is
|
|
generated only from OpenClaw-owned diagnostic trace context for the active model
|
|
call. Existing caller-supplied `traceparent` headers are replaced, so plugins or
|
|
custom provider options cannot spoof cross-service trace ancestry.
|
|
|
|
Set `diagnostics.otel.captureContent.*` to `true` only when your collector and
|
|
retention policy are approved for prompt, response, tool, or system-prompt
|
|
text. Each subkey is opt-in independently:
|
|
|
|
- `inputMessages` — user prompt content.
|
|
- `outputMessages` — model response content.
|
|
- `toolInputs` — tool argument payloads.
|
|
- `toolOutputs` — tool result payloads.
|
|
- `systemPrompt` — assembled system/developer prompt.
|
|
|
|
When any subkey is enabled, model and tool spans get bounded, redacted
|
|
`openclaw.content.*` attributes for that class only.
|
|
|
|
## Sampling and flushing
|
|
|
|
- **Traces:** `diagnostics.otel.sampleRate` (root-span only, `0.0` drops all,
|
|
`1.0` keeps all).
|
|
- **Metrics:** `diagnostics.otel.flushIntervalMs` (minimum `1000`).
|
|
- **Logs:** OTLP logs respect `logging.level` (file log level). They use the
|
|
diagnostic log-record redaction path, not console formatting. High-volume
|
|
installs should prefer OTLP collector sampling/filtering over local sampling.
|
|
- **File-log correlation:** JSONL file logs include top-level `traceId`,
|
|
`spanId`, `parentSpanId`, and `traceFlags` when the log call carries a valid
|
|
diagnostic trace context, which lets log processors join local log lines with
|
|
exported spans.
|
|
|
|
## Exported metrics
|
|
|
|
### Model usage
|
|
|
|
- `openclaw.tokens` (counter, attrs: `openclaw.token`, `openclaw.channel`, `openclaw.provider`, `openclaw.model`, `openclaw.agent`)
|
|
- `openclaw.cost.usd` (counter, attrs: `openclaw.channel`, `openclaw.provider`, `openclaw.model`)
|
|
- `openclaw.run.duration_ms` (histogram, attrs: `openclaw.channel`, `openclaw.provider`, `openclaw.model`)
|
|
- `openclaw.context.tokens` (histogram, attrs: `openclaw.context`, `openclaw.channel`, `openclaw.provider`, `openclaw.model`)
|
|
- `gen_ai.client.token.usage` (histogram, GenAI semantic-conventions metric, attrs: `gen_ai.token.type` = `input`/`output`, `gen_ai.provider.name`, `gen_ai.operation.name`, `gen_ai.request.model`)
|
|
- `gen_ai.client.operation.duration` (histogram, seconds, GenAI semantic-conventions metric, attrs: `gen_ai.provider.name`, `gen_ai.operation.name`, `gen_ai.request.model`, optional `error.type`)
|
|
- `openclaw.model_call.duration_ms` (histogram, attrs: `openclaw.provider`, `openclaw.model`, `openclaw.api`, `openclaw.transport`)
|
|
- `openclaw.model_call.request_bytes` (histogram, UTF-8 byte size of the final model request payload; no raw payload content)
|
|
- `openclaw.model_call.response_bytes` (histogram, UTF-8 byte size of streamed model response events; no raw response content)
|
|
- `openclaw.model_call.time_to_first_byte_ms` (histogram, elapsed time before the first streamed response event)
|
|
|
|
### Message flow
|
|
|
|
- `openclaw.webhook.received` (counter, attrs: `openclaw.channel`, `openclaw.webhook`)
|
|
- `openclaw.webhook.error` (counter, attrs: `openclaw.channel`, `openclaw.webhook`)
|
|
- `openclaw.webhook.duration_ms` (histogram, attrs: `openclaw.channel`, `openclaw.webhook`)
|
|
- `openclaw.message.queued` (counter, attrs: `openclaw.channel`, `openclaw.source`)
|
|
- `openclaw.message.processed` (counter, attrs: `openclaw.channel`, `openclaw.outcome`)
|
|
- `openclaw.message.duration_ms` (histogram, attrs: `openclaw.channel`, `openclaw.outcome`)
|
|
- `openclaw.message.delivery.started` (counter, attrs: `openclaw.channel`, `openclaw.delivery.kind`)
|
|
- `openclaw.message.delivery.duration_ms` (histogram, attrs: `openclaw.channel`, `openclaw.delivery.kind`, `openclaw.outcome`, `openclaw.errorCategory`)
|
|
|
|
### Queues and sessions
|
|
|
|
- `openclaw.queue.lane.enqueue` (counter, attrs: `openclaw.lane`)
|
|
- `openclaw.queue.lane.dequeue` (counter, attrs: `openclaw.lane`)
|
|
- `openclaw.queue.depth` (histogram, attrs: `openclaw.lane` or `openclaw.channel=heartbeat`)
|
|
- `openclaw.queue.wait_ms` (histogram, attrs: `openclaw.lane`)
|
|
- `openclaw.session.state` (counter, attrs: `openclaw.state`, `openclaw.reason`)
|
|
- `openclaw.session.stuck` (counter, attrs: `openclaw.state`)
|
|
- `openclaw.session.stuck_age_ms` (histogram, attrs: `openclaw.state`)
|
|
- `openclaw.run.attempt` (counter, attrs: `openclaw.attempt`)
|
|
|
|
### Harness lifecycle
|
|
|
|
- `openclaw.harness.duration_ms` (histogram, attrs: `openclaw.harness.id`, `openclaw.harness.plugin`, `openclaw.outcome`, `openclaw.harness.phase` on errors)
|
|
|
|
### Exec
|
|
|
|
- `openclaw.exec.duration_ms` (histogram, attrs: `openclaw.exec.target`, `openclaw.exec.mode`, `openclaw.outcome`, `openclaw.failureKind`)
|
|
|
|
### Diagnostics internals (memory and tool loop)
|
|
|
|
- `openclaw.memory.heap_used_bytes` (histogram, attrs: `openclaw.memory.kind`)
|
|
- `openclaw.memory.rss_bytes` (histogram)
|
|
- `openclaw.memory.pressure` (counter, attrs: `openclaw.memory.level`)
|
|
- `openclaw.tool.loop.iterations` (counter, attrs: `openclaw.toolName`, `openclaw.outcome`)
|
|
- `openclaw.tool.loop.duration_ms` (histogram, attrs: `openclaw.toolName`, `openclaw.outcome`)
|
|
|
|
## Exported spans
|
|
|
|
- `openclaw.model.usage`
|
|
- `openclaw.channel`, `openclaw.provider`, `openclaw.model`
|
|
- `openclaw.tokens.*` (input/output/cache_read/cache_write/total)
|
|
- `gen_ai.system` by default, or `gen_ai.provider.name` when the latest GenAI semantic conventions are opted in
|
|
- `gen_ai.request.model`, `gen_ai.operation.name`, `gen_ai.usage.*`
|
|
- `openclaw.run`
|
|
- `openclaw.outcome`, `openclaw.channel`, `openclaw.provider`, `openclaw.model`, `openclaw.errorCategory`
|
|
- `openclaw.model.call`
|
|
- `gen_ai.system` by default, or `gen_ai.provider.name` when the latest GenAI semantic conventions are opted in
|
|
- `gen_ai.request.model`, `gen_ai.operation.name`, `openclaw.provider`, `openclaw.model`, `openclaw.api`, `openclaw.transport`
|
|
- `openclaw.model_call.request_bytes`, `openclaw.model_call.response_bytes`, `openclaw.model_call.time_to_first_byte_ms`
|
|
- `openclaw.provider.request_id_hash` (bounded SHA-based hash of the upstream provider request id; raw ids are not exported)
|
|
- `openclaw.harness.run`
|
|
- `openclaw.harness.id`, `openclaw.harness.plugin`, `openclaw.outcome`, `openclaw.provider`, `openclaw.model`, `openclaw.channel`
|
|
- On completion: `openclaw.harness.result_classification`, `openclaw.harness.yield_detected`, `openclaw.harness.items.started`, `openclaw.harness.items.completed`, `openclaw.harness.items.active`
|
|
- On error: `openclaw.harness.phase`, `openclaw.errorCategory`, optional `openclaw.harness.cleanup_failed`
|
|
- `openclaw.tool.execution`
|
|
- `gen_ai.tool.name`, `openclaw.toolName`, `openclaw.errorCategory`, `openclaw.tool.params.*`
|
|
- `openclaw.exec`
|
|
- `openclaw.exec.target`, `openclaw.exec.mode`, `openclaw.outcome`, `openclaw.failureKind`, `openclaw.exec.command_length`, `openclaw.exec.exit_code`, `openclaw.exec.timed_out`
|
|
- `openclaw.webhook.processed`
|
|
- `openclaw.channel`, `openclaw.webhook`, `openclaw.chatId`
|
|
- `openclaw.webhook.error`
|
|
- `openclaw.channel`, `openclaw.webhook`, `openclaw.chatId`, `openclaw.error`
|
|
- `openclaw.message.processed`
|
|
- `openclaw.channel`, `openclaw.outcome`, `openclaw.chatId`, `openclaw.messageId`, `openclaw.reason`
|
|
- `openclaw.message.delivery`
|
|
- `openclaw.channel`, `openclaw.delivery.kind`, `openclaw.outcome`, `openclaw.errorCategory`, `openclaw.delivery.result_count`
|
|
- `openclaw.session.stuck`
|
|
- `openclaw.state`, `openclaw.ageMs`, `openclaw.queueDepth`
|
|
- `openclaw.context.assembled`
|
|
- `openclaw.prompt.size`, `openclaw.history.size`, `openclaw.context.tokens`, `openclaw.errorCategory` (no prompt, history, response, or session-key content)
|
|
- `openclaw.tool.loop`
|
|
- `openclaw.toolName`, `openclaw.outcome`, `openclaw.iterations`, `openclaw.errorCategory` (no loop messages, params, or tool output)
|
|
- `openclaw.memory.pressure`
|
|
- `openclaw.memory.level`, `openclaw.memory.heap_used_bytes`, `openclaw.memory.rss_bytes`
|
|
|
|
When content capture is explicitly enabled, model and tool spans can also
|
|
include bounded, redacted `openclaw.content.*` attributes for the specific
|
|
content classes you opted into.
|
|
|
|
## Diagnostic event catalog
|
|
|
|
The events below back the metrics and spans above. Plugins can also subscribe
|
|
to them directly without OTLP export.
|
|
|
|
**Model usage**
|
|
|
|
- `model.usage` — tokens, cost, duration, context, provider/model/channel,
|
|
session ids. `usage` is provider/turn accounting for cost and telemetry;
|
|
`context.used` is the current prompt/context snapshot and can be lower than
|
|
provider `usage.total` when cached input or tool-loop calls are involved.
|
|
|
|
**Message flow**
|
|
|
|
- `webhook.received` / `webhook.processed` / `webhook.error`
|
|
- `message.queued` / `message.processed`
|
|
- `message.delivery.started` / `message.delivery.completed` / `message.delivery.error`
|
|
|
|
**Queue and session**
|
|
|
|
- `queue.lane.enqueue` / `queue.lane.dequeue`
|
|
- `session.state` / `session.stuck`
|
|
- `run.attempt`
|
|
- `diagnostic.heartbeat` (aggregate counters: webhooks/queue/session)
|
|
|
|
**Harness lifecycle**
|
|
|
|
- `harness.run.started` / `harness.run.completed` / `harness.run.error` —
|
|
per-run lifecycle for the agent harness. Includes `harnessId`, optional
|
|
`pluginId`, provider/model/channel, and run id. Completion adds
|
|
`durationMs`, `outcome`, optional `resultClassification`, `yieldDetected`,
|
|
and `itemLifecycle` counts. Errors add `phase`
|
|
(`prepare`/`start`/`send`/`resolve`/`cleanup`), `errorCategory`, and
|
|
optional `cleanupFailed`.
|
|
|
|
**Exec**
|
|
|
|
- `exec.process.completed` — terminal outcome, duration, target, mode, exit
|
|
code, and failure kind. Command text and working directories are not
|
|
included.
|
|
|
|
## Without an exporter
|
|
|
|
You can keep diagnostics events available to plugins or custom sinks without
|
|
running `diagnostics-otel`:
|
|
|
|
```json5
|
|
{
|
|
diagnostics: { enabled: true },
|
|
}
|
|
```
|
|
|
|
For targeted debug output without raising `logging.level`, use diagnostics
|
|
flags. Flags are case-insensitive and support wildcards (e.g. `telegram.*` or
|
|
`*`):
|
|
|
|
```json5
|
|
{
|
|
diagnostics: { flags: ["telegram.http"] },
|
|
}
|
|
```
|
|
|
|
Or as a one-off env override:
|
|
|
|
```bash
|
|
OPENCLAW_DIAGNOSTICS=telegram.http,telegram.payload openclaw gateway
|
|
```
|
|
|
|
Flag output goes to the standard log file (`logging.file`) and is still
|
|
redacted by `logging.redactSensitive`. Full guide:
|
|
[Diagnostics flags](/diagnostics/flags).
|
|
|
|
## Disable
|
|
|
|
```json5
|
|
{
|
|
diagnostics: { otel: { enabled: false } },
|
|
}
|
|
```
|
|
|
|
You can also leave `diagnostics-otel` out of `plugins.allow`, or run
|
|
`openclaw plugins disable diagnostics-otel`.
|
|
|
|
## Related
|
|
|
|
- [Logging](/logging) — file logs, console output, CLI tailing, and the Control UI Logs tab
|
|
- [Gateway logging internals](/gateway/logging) — WS log styles, subsystem prefixes, and console capture
|
|
- [Diagnostics flags](/diagnostics/flags) — targeted debug-log flags
|
|
- [Diagnostics export](/gateway/diagnostics) — operator support-bundle tool (separate from OTEL export)
|
|
- [Configuration reference](/gateway/configuration-reference#diagnostics) — full `diagnostics.*` field reference
|