docs(logging): define diagnostics as telemetry surface

2026-05-06 16:10:49 +00:00 · 2026-03-09 13:28:09 -07:00
parent 2b2e5e2038
commit 16fda2d1ce
1 changed files with 53 additions and 0 deletions
--- a/docs/logging.md
+++ b/docs/logging.md
@@ -148,6 +148,37 @@ replace logs; they exist to feed metrics, traces, and other exporters.
 Diagnostics events are emitted in-process, but exporters only attach when
 diagnostics + the exporter plugin are enabled.

+### Telemetry surface ownership
+
+OpenClaw has separate surfaces for automation, runtime control, and telemetry:
+
+| If you want to...                                                                      | Use...                                  | Why                                                                |
+| -------------------------------------------------------------------------------------- | --------------------------------------- | ------------------------------------------------------------------ |
+| Export metrics, traces, or machine-readable health signals                             | Diagnostic events                       | Observability should be append-only telemetry, not a behavior hook |
+| Rewrite prompts, block tools, cancel outbound messages, or add policy/middleware       | Typed plugin hooks via `api.on(...)`    | Runtime hooks can mutate or block behavior                         |
+| Trigger coarse operator automation such as file writes, notifications, or side effects | HOOK.md hooks / `api.registerHook(...)` | Internal hooks are for operator automation, not telemetry schemas  |
+
+Future OTEL work should extend `src/infra/diagnostic-events.ts`, then map those
+events in the `diagnostics-otel` plugin. Do not add telemetry-only proposals by
+growing the hook APIs.
+
+### What diagnostic events are for
+
+Diagnostic events are the observability contract between the gateway runtime and
+telemetry consumers such as the `diagnostics-otel` plugin.
+
+Diagnostic events should be:
+
+- append-only signals for exporters, dashboards, alerts, and troubleshooting
+- safe to ignore without affecting runtime behavior
+- stable enough that exporters can map them into metrics, traces, or logs
+
+Diagnostic events should not be used for:
+
+- blocking, vetoing, or rewriting runtime behavior
+- policy enforcement or middleware ordering
+- side-effect automation that must run for the system to behave correctly
+
 ### OpenTelemetry vs OTLP

 - **OpenTelemetry (OTel)**: the data model + SDKs for traces, metrics, and logs.
@@ -184,6 +215,28 @@ Queue + session:
 - `run.attempt`: run retry/attempt metadata.
 - `diagnostic.heartbeat`: aggregate counters (webhooks/queue/session).

+Tool safety:
+
+- `tool.loop`: repeated-tool-loop warning/block telemetry emitted by the runtime.
+
+### What is still missing
+
+The current event catalog is useful, but still coarse in a few places. New
+observability work should generally extend `src/infra/diagnostic-events.ts`
+instead of asking hooks to carry telemetry-only meaning.
+
+Priority gaps for future telemetry work:
+
+- Run lifecycle: explicit run start, run end, and run error boundaries.
+- Model lifecycle: request/response/error boundaries in addition to aggregate
+  `model.usage`.
+- Tool lifecycle: tool call start/end/error boundaries, plus first-class exporter
+  mapping for existing `tool.loop` events.
+- Outbound delivery lifecycle: delivery attempted/sent/failed boundaries across
+  channels, separate from message processing.
+- Attribute hygiene: clearer redaction and cardinality guidance for exporter-safe
+  fields.
+
 ### Enable diagnostics (no exporter)

 Use this if you want diagnostics events available to plugins or custom sinks: