From 893f070560b36d72e823325e7832d784d22c813f Mon Sep 17 00:00:00 2001 From: Vincent Koc Date: Sun, 26 Apr 2026 02:26:01 -0700 Subject: [PATCH] docs(prometheus): rewrite with Steps quick start, Tabs for enable methods and pull-vs-push, AccordionGroup for label policy and troubleshooting; document the 2048-series cap and trusted-operator scope from the diagnostics-prometheus plugin code --- docs/gateway/prometheus.md | 190 ++++++++++++++++++++++++++++++------- 1 file changed, 155 insertions(+), 35 deletions(-) diff --git a/docs/gateway/prometheus.md b/docs/gateway/prometheus.md index 7c408aa4b33..92a4753df66 100644 --- a/docs/gateway/prometheus.md +++ b/docs/gateway/prometheus.md @@ -1,47 +1,84 @@ --- summary: "Expose OpenClaw diagnostics as Prometheus text metrics through the diagnostics-prometheus plugin" title: "Prometheus metrics" +sidebarTitle: "Prometheus" read_when: - You want Prometheus, Grafana, VictoriaMetrics, or another scraper to collect OpenClaw Gateway metrics - You need the Prometheus metric names and label policy for dashboards or alerts - You want metrics without running an OpenTelemetry collector --- -OpenClaw can expose diagnostics metrics through the bundled -`diagnostics-prometheus` plugin. It listens to trusted internal diagnostics and -renders a Prometheus text endpoint at: +OpenClaw can expose diagnostics metrics through the bundled `diagnostics-prometheus` plugin. It listens to trusted internal diagnostics and renders a Prometheus text endpoint at: ```text -/api/diagnostics/prometheus +GET /api/diagnostics/prometheus ``` -The route uses Gateway authentication. Do not expose it as a public -unauthenticated `/metrics` endpoint. +Content type is `text/plain; version=0.0.4; charset=utf-8`, the standard Prometheus exposition format. + + +The route uses Gateway authentication (operator scope). Do not expose it as a public unauthenticated `/metrics` endpoint. Scrape it through the same auth path you use for other operator APIs. + + +For traces, logs, OTLP push, and OpenTelemetry GenAI semantic attributes, see [OpenTelemetry export](/gateway/opentelemetry). ## Quick start -```json5 -{ - plugins: { - allow: ["diagnostics-prometheus"], - entries: { - "diagnostics-prometheus": { enabled: true }, - }, - }, - diagnostics: { - enabled: true, - }, -} -``` + + + + + ```json5 + { + plugins: { + allow: ["diagnostics-prometheus"], + entries: { + "diagnostics-prometheus": { enabled: true }, + }, + }, + diagnostics: { + enabled: true, + }, + } + ``` + + + ```bash + openclaw plugins enable diagnostics-prometheus + ``` + + + + + The HTTP route is registered at plugin startup, so reload after enabling. + + + Send the same gateway auth your operator clients use: -You can also enable the plugin from the CLI: + ```bash + curl -H "Authorization: Bearer $OPENCLAW_GATEWAY_TOKEN" \ + http://127.0.0.1:18789/api/diagnostics/prometheus + ``` -```bash -openclaw plugins enable diagnostics-prometheus -``` + + + ```yaml + # prometheus.yml + scrape_configs: + - job_name: openclaw + scrape_interval: 30s + metrics_path: /api/diagnostics/prometheus + authorization: + credentials_file: /etc/prometheus/openclaw-gateway-token + static_configs: + - targets: ["openclaw-gateway:18789"] + ``` + + -Then scrape the protected Gateway route with the same Gateway authentication you -use for operator APIs. + +`diagnostics.enabled: true` is required. Without it, the plugin still registers the HTTP route but no diagnostic events flow into the exporter, so the response is empty. + ## Metrics exported @@ -74,16 +111,99 @@ use for operator APIs. ## Label policy -Prometheus labels stay bounded and low-cardinality. The exporter does not emit -raw diagnostic identifiers such as `runId`, `sessionKey`, `sessionId`, `callId`, -`toolCallId`, message IDs, chat IDs, or provider request IDs. + + + Prometheus labels stay bounded and low-cardinality. The exporter does not emit raw diagnostic identifiers such as `runId`, `sessionKey`, `sessionId`, `callId`, `toolCallId`, message IDs, chat IDs, or provider request IDs. -Label values are redacted and must match OpenClaw's low-cardinality character -policy. Values that fail the policy are replaced with `unknown`, `other`, or -`none`, depending on the metric. + Label values are redacted and must match OpenClaw's low-cardinality character policy. Values that fail the policy are replaced with `unknown`, `other`, or `none`, depending on the metric. -The exporter caps retained time series in memory. If the cap is reached, new -series are dropped and `openclaw_prometheus_series_dropped_total` increments. + + + The exporter caps retained time series in memory at **2048** series across counters, gauges, and histograms combined. New series beyond that cap are dropped, and `openclaw_prometheus_series_dropped_total` increments by one each time. -For full traces, logs, OTLP export, and OpenTelemetry GenAI semantic attributes, -use [OpenTelemetry export](/gateway/opentelemetry). + Watch this counter as a hard signal that an attribute upstream is leaking high-cardinality values. The exporter never lifts the cap automatically; if it climbs, fix the source rather than disabling the cap. + + + + - prompt text, response text, tool inputs, tool outputs, system prompts + - raw provider request IDs (only bounded hashes, where applicable, on spans — never on metrics) + - session keys and session IDs + - hostnames, file paths, secret values + + + +## PromQL recipes + +```promql +# Tokens per minute, split by provider +sum by (provider) (rate(openclaw_model_tokens_total[1m])) + +# Spend (USD) over the last hour, by model +sum by (model) (increase(openclaw_model_cost_usd_total[1h])) + +# 95th percentile model run duration +histogram_quantile( + 0.95, + sum by (le, provider, model) + (rate(openclaw_run_duration_seconds_bucket[5m])) +) + +# Queue wait time SLO (95p under 2s) +histogram_quantile( + 0.95, + sum by (le, lane) (rate(openclaw_queue_lane_wait_seconds_bucket[5m])) +) < 2 + +# Dropped Prometheus series (cardinality alarm) +increase(openclaw_prometheus_series_dropped_total[15m]) > 0 +``` + + +Prefer `gen_ai_client_token_usage` for cross-provider dashboards: it follows the OpenTelemetry GenAI semantic conventions and is consistent with metrics from non-OpenClaw GenAI services. + + +## Choosing between Prometheus and OpenTelemetry export + +OpenClaw supports both surfaces independently. You can run either, both, or neither. + + + + - **Pull** model: Prometheus scrapes `/api/diagnostics/prometheus`. + - No external collector required. + - Authenticated through normal Gateway auth. + - Surface is metrics only (no traces or logs). + - Best for stacks already standardized on Prometheus + Grafana. + + + - **Push** model: OpenClaw sends OTLP/HTTP to a collector or OTLP-compatible backend. + - Surface includes metrics, traces, and logs. + - Bridges to Prometheus through an OpenTelemetry Collector (`prometheus` or `prometheusremotewrite` exporter) when you need both. + - See [OpenTelemetry export](/gateway/opentelemetry) for the full catalog. + + + +## Troubleshooting + + + + - Check `diagnostics.enabled: true` in config. + - Confirm the plugin is enabled and loaded with `openclaw plugins list --enabled`. + - Generate some traffic; counters and histograms only emit lines after at least one event. + + + The endpoint requires the Gateway operator scope (`auth: "gateway"` with `gatewayRuntimeScopeSurface: "trusted-operator"`). Use the same token or password Prometheus uses for any other Gateway operator route. There is no public unauthenticated mode. + + + A new attribute is exceeding the **2048**-series cap. Inspect recent metrics for an unexpectedly high-cardinality label and fix it at the source. The exporter intentionally drops new series instead of silently rewriting labels. + + + The plugin keeps state in memory only. After a Gateway restart, counters reset to zero and gauges restart at their next reported value. Use PromQL `rate()` and `increase()` to handle resets cleanly. + + + +## Related + +- [Diagnostics export](/gateway/diagnostics) — local diagnostics zip for support bundles +- [Health and readiness](/gateway/health) — `/healthz` and `/readyz` probes +- [Logging](/logging) — file-based logging +- [OpenTelemetry export](/gateway/opentelemetry) — OTLP push for traces, metrics, and logs