diff --git a/CHANGELOG.md b/CHANGELOG.md index 0aca7510e90..667c88e56f7 100644 --- a/CHANGELOG.md +++ b/CHANGELOG.md @@ -19,6 +19,7 @@ Docs: https://docs.openclaw.ai - Providers/Ollama: honor `/api/show` capabilities when registering local models so non-tool Ollama models no longer receive the agent tool surface, and keep native Ollama thinking opt-in instead of enabling it by default. Fixes #64710 and duplicate #65343. Thanks @yuan-b, @netherby, @xilopaint, and @Diyforfun2026. - Providers/Ollama: expose native Ollama thinking effort levels so `/think max` is accepted for reasoning-capable Ollama models and maps to Ollama's highest supported `think` effort. Fixes #71584. Thanks @g0st1n. - Agents/Ollama: validate explicit `--thinking max` against catalog-discovered Ollama reasoning metadata so local agent runs accept the same native thinking levels shown in the model catalog. Fixes #71584. Thanks @g0st1n. +- Docker/QA: add observability coverage to the normal Docker aggregate so QA-lab OTEL and Prometheus diagnostics run inside Docker. Thanks @vincentkoc. - Auto-reply: poison inbound message dedupe after replay-unsafe provider/runtime failures so retries stay safe before visible progress but cannot duplicate messages after block output, tool side effects, or session progress. Fixes #69303; keeps #58549 and #64606 as duplicate validation. Thanks @martingarramon, @NikolaFC, and @zeroth-blip. - Agents/model fallback: jump directly to a known later live-session model redirect instead of walking unrelated fallback candidates, while preserving the already-landed live-session/fallback loop guard. Fixes #57471; related loop family already closed via #58496. Thanks @yuxiaoyang2007-prog. - Gateway/Bonjour: keep @homebridge/ciao cancellation handlers registered across advertiser restarts so late probing cancellations cannot crash Linux and other mDNS-churned gateways. Thanks @codex. diff --git a/docs/concepts/qa-e2e-automation.md b/docs/concepts/qa-e2e-automation.md index d56e546b6f1..68e35f189f3 100644 --- a/docs/concepts/qa-e2e-automation.md +++ b/docs/concepts/qa-e2e-automation.md @@ -65,6 +65,14 @@ model calls must not export `StreamAbandoned` on successful turns; raw diagnosti `openclaw.content.*` attributes must stay out of the trace. It writes `otel-smoke-summary.json` next to the QA suite artifacts. +The normal Docker aggregate also runs an observability lane. It builds or +reuses a source-backed Docker observability image, runs the OTEL trace smoke +inside the container, then runs the `docker-prometheus-smoke` QA scenario with the +`diagnostics-prometheus` plugin enabled. Set +`OPENCLAW_DOCKER_OBSERVABILITY_LOOPS=` to repeat both checks inside one +Docker run while preserving per-loop artifacts under +`.artifacts/docker-observability/...`. + For a transport-real Matrix smoke lane, run: ```bash diff --git a/docs/help/testing.md b/docs/help/testing.md index 0206aa1f901..7da6a728b1f 100644 --- a/docs/help/testing.md +++ b/docs/help/testing.md @@ -617,6 +617,7 @@ The live-model Docker runners also bind-mount only the needed CLI auth homes (or - CLI backend smoke: `pnpm test:docker:live-cli-backend` (script: `scripts/test-live-cli-backend-docker.sh`) - Codex app-server harness smoke: `pnpm test:docker:live-codex-harness` (script: `scripts/test-live-codex-harness-docker.sh`) - Gateway + dev agent: `pnpm test:docker:live-gateway` (script: `scripts/test-live-gateway-models-docker.sh`) +- Docker observability smoke: included in `pnpm test:docker:all` and `pnpm test:docker:local:all` (script: `scripts/e2e/docker-observability-smoke.sh`). It runs QA-lab OTEL and Prometheus diagnostics checks inside a source-backed Docker image. Set `OPENCLAW_DOCKER_OBSERVABILITY_LOOPS=` to repeat both checks in one container run. - Open WebUI live smoke: `pnpm test:docker:openwebui` (script: `scripts/e2e/openwebui-docker.sh`) - Onboarding wizard (TTY, full scaffolding): `pnpm test:docker:onboard` (script: `scripts/e2e/onboard-docker.sh`) - Npm tarball onboarding/channel/agent smoke: `pnpm test:docker:npm-onboard-channel-agent` installs the packed OpenClaw tarball globally in Docker, configures OpenAI via env-ref onboarding plus Telegram by default, verifies doctor repairs activated plugin runtime deps, and runs one mocked OpenAI agent turn. Reuse a prebuilt tarball with `OPENCLAW_CURRENT_PACKAGE_TGZ=/path/to/openclaw-*.tgz`, skip the host rebuild with `OPENCLAW_NPM_ONBOARD_HOST_BUILD=0`, or switch channel with `OPENCLAW_NPM_ONBOARD_CHANNEL=discord`. diff --git a/qa/scenarios/runtime/docker-prometheus-smoke.md b/qa/scenarios/runtime/docker-prometheus-smoke.md new file mode 100644 index 00000000000..a91965e6ed9 --- /dev/null +++ b/qa/scenarios/runtime/docker-prometheus-smoke.md @@ -0,0 +1,156 @@ +# Docker Prometheus smoke + +```yaml qa-scenario +id: docker-prometheus-smoke +title: Docker Prometheus smoke +surface: telemetry +coverage: + primary: + - telemetry.prometheus + secondary: + - harness.qa-lab + - docker.e2e +objective: Verify a QA-lab gateway run emits protected, bounded Prometheus diagnostics metrics through the diagnostics-prometheus plugin. +successCriteria: + - The diagnostics-prometheus plugin exposes the protected scrape route. + - An unauthenticated scrape is rejected. + - A minimal QA-channel agent turn completes. + - The authenticated scrape includes release-critical diagnostics metric families. + - Prometheus output omits prompt content, session keys, auth tokens, raw ids, and file paths. +plugins: + - diagnostics-prometheus +gatewayConfigPatch: + diagnostics: + enabled: true +docsRefs: + - docs/gateway/prometheus.md + - docs/concepts/qa-e2e-automation.md +codeRefs: + - extensions/diagnostics-prometheus/src/service.ts + - src/diagnostics/internal-diagnostics.ts + - extensions/qa-lab/src/suite.ts +execution: + kind: flow + summary: Complete a minimal QA-lab turn and scrape the protected Prometheus route. + config: + prompt: Reply exactly DOCKER-PROMETHEUS-OK. Do not repeat DOCKER-PROMETHEUS-SECRET. + secretNeedle: DOCKER-PROMETHEUS-SECRET +``` + +```yaml qa-flow +steps: + - name: emits protected low-cardinality prometheus metrics + actions: + - call: waitForGatewayHealthy + args: + - ref: env + - 60000 + - call: waitForQaChannelReady + args: + - ref: env + - 60000 + - call: reset + - set: startCursor + value: + expr: state.getSnapshot().messages.length + - call: runAgentPrompt + args: + - ref: env + - sessionKey: agent:qa:docker-prometheus-smoke + message: + expr: config.prompt + timeoutMs: + expr: liveTurnTimeoutMs(env, 30000) + - call: waitForCondition + saveAs: outbound + args: + - lambda: + expr: "state.getSnapshot().messages.slice(startCursor).filter((candidate) => candidate.direction === 'outbound' && candidate.conversation.id === 'qa-operator' && String(candidate.text ?? '').trim().length > 0).at(-1)" + - expr: liveTurnTimeoutMs(env, 30000) + - expr: "env.providerMode === 'mock-openai' ? 100 : 250" + - assert: + expr: "String(outbound.text ?? '').trim().length > 0" + message: "expected non-empty qa output before scraping metrics" + - set: prometheusUrl + value: + expr: "`${env.gateway.baseUrl}/api/diagnostics/prometheus`" + - set: gatewayToken + value: + expr: "String(env.gateway.token ?? env.gateway.runtimeEnv.OPENCLAW_GATEWAY_TOKEN ?? '')" + - assert: + expr: "gatewayToken.length > 0" + message: "expected QA gateway token to be available for protected scrape" + - set: unauthenticatedScrape + value: + expr: |- + (async () => { + const response = await fetch(prometheusUrl); + await response.text().catch(() => ""); + return { status: response.status }; + })() + - assert: + expr: "unauthenticatedScrape.status === 401 || unauthenticatedScrape.status === 403" + message: + expr: "`expected unauthenticated prometheus scrape to be rejected, got ${unauthenticatedScrape.status}`" + - set: authenticatedScrape + value: + expr: |- + (async () => { + const response = await fetch(prometheusUrl, { + headers: { authorization: `Bearer ${gatewayToken}` }, + }); + const text = await response.text(); + return { + status: response.status, + contentType: response.headers.get("content-type") ?? "", + text, + }; + })() + - assert: + expr: "authenticatedScrape.status === 200" + message: + expr: "`expected authenticated prometheus scrape to return 200, got ${authenticatedScrape.status}`" + - assert: + expr: "authenticatedScrape.contentType.includes('text/plain')" + message: + expr: "`expected prometheus text content type, got ${authenticatedScrape.contentType}`" + - set: prometheusText + value: + expr: "String(authenticatedScrape.text ?? '')" + - assert: + expr: "prometheusText.includes('# TYPE openclaw_run_completed_total counter')" + message: "missing run completion counter" + - assert: + expr: "prometheusText.includes('# TYPE openclaw_run_duration_seconds histogram')" + message: "missing run duration histogram" + - assert: + expr: "prometheusText.includes('# TYPE openclaw_model_call_total counter')" + message: "missing model call counter" + - assert: + expr: "prometheusText.includes('# TYPE openclaw_harness_run_total counter')" + message: "missing harness run counter" + - assert: + expr: "!prometheusText.includes(config.secretNeedle)" + message: "prometheus output leaked prompt sentinel" + - assert: + expr: "!prometheusText.includes('DOCKER-PROMETHEUS-OK')" + message: "prometheus output leaked response content" + - assert: + expr: "!prometheusText.includes('agent:qa:docker-prometheus-smoke')" + message: "prometheus output leaked the session key" + - assert: + expr: "!prometheusText.includes(gatewayToken)" + message: "prometheus output leaked the gateway token" + - assert: + expr: "!/runId|sessionId|sessionKey|callId|toolCallId|messageId|providerRequestId/.test(prometheusText)" + message: "prometheus output leaked raw diagnostic identifiers" + - assert: + expr: "!/\\/tmp\\/|\\/private\\/tmp\\/|\\/app\\//.test(prometheusText)" + message: "prometheus output leaked a local file path" + - assert: + expr: "!prometheusText.includes('openclaw.content.')" + message: "prometheus output leaked content attributes" + - assert: + expr: "!/openclaw_prometheus_series_dropped_total(?:\\{[^}]*\\})?\\s+(?!0(?:\\.0+)?(?:\\s|$))/.test(prometheusText)" + message: "prometheus dropped series during the smoke" +``` diff --git a/scripts/e2e/Dockerfile.observability b/scripts/e2e/Dockerfile.observability new file mode 100644 index 00000000000..55ada3f2f22 --- /dev/null +++ b/scripts/e2e/Dockerfile.observability @@ -0,0 +1,55 @@ +# syntax=docker/dockerfile:1.7 + +FROM node:24-bookworm-slim@sha256:e8e2e91b1378f83c5b2dd15f0247f34110e2fe895f6ca7719dbb780f929368eb AS observability-runner + +RUN apt-get update \ + && apt-get install -y --no-install-recommends ca-certificates git \ + && rm -rf /var/lib/apt/lists/* + +RUN corepack enable + +RUN useradd --create-home --shell /bin/bash appuser \ + && mkdir -p /app \ + && chown appuser:appuser /app + +ENV HOME="/home/appuser" +ENV NODE_OPTIONS="--disable-warning=ExperimentalWarning" +ENV OPENCLAW_DISABLE_BONJOUR="1" + +USER appuser +WORKDIR /app + +COPY --chown=appuser:appuser package.json pnpm-lock.yaml pnpm-workspace.yaml .npmrc ./ +COPY --chown=appuser:appuser ui/package.json ./ui/package.json +COPY --chown=appuser:appuser patches ./patches +COPY --chown=appuser:appuser scripts/postinstall-bundled-plugins.mjs scripts/preinstall-package-manager-warning.mjs scripts/npm-runner.mjs scripts/windows-cmd-helpers.mjs ./scripts/ +RUN --mount=type=bind,source=extensions,target=/tmp/extensions,readonly \ + find /tmp/extensions -mindepth 2 -maxdepth 2 -name package.json -print | \ + while IFS= read -r manifest; do \ + dest="${manifest#/tmp/}"; \ + mkdir -p "$(dirname "$dest")"; \ + cp "$manifest" "$dest"; \ + done + +RUN --mount=type=cache,id=openclaw-pnpm-store,target=/home/appuser/.local/share/pnpm/store,sharing=locked \ + pnpm install --frozen-lockfile + +COPY --chown=appuser:appuser .oxlintrc.json tsconfig.json tsconfig.plugin-sdk.dts.json tsconfig.oxlint*.json tsdown.config.ts vitest.config.ts openclaw.mjs ./ +COPY --chown=appuser:appuser src ./src +COPY --chown=appuser:appuser test ./test +COPY --chown=appuser:appuser scripts ./scripts +COPY --chown=appuser:appuser docs ./docs +COPY --chown=appuser:appuser packages ./packages +COPY --chown=appuser:appuser qa ./qa +COPY --chown=appuser:appuser skills ./skills +COPY --chown=appuser:appuser ui ./ui +COPY --chown=appuser:appuser extensions ./extensions +COPY --chown=appuser:appuser vendor/a2ui/renderers/lit ./vendor/a2ui/renderers/lit +COPY --chown=appuser:appuser apps/shared/OpenClawKit/Sources/OpenClawKit/Resources ./apps/shared/OpenClawKit/Sources/OpenClawKit/Resources +COPY --chown=appuser:appuser apps/shared/OpenClawKit/Tools/CanvasA2UI ./apps/shared/OpenClawKit/Tools/CanvasA2UI + +RUN pnpm build +RUN mkdir -p dist/control-ui \ + && printf '%s\n' 'OpenClaw Control UI' > dist/control-ui/index.html + +CMD ["bash"] diff --git a/scripts/e2e/docker-observability-smoke.sh b/scripts/e2e/docker-observability-smoke.sh new file mode 100644 index 00000000000..885c1179d90 --- /dev/null +++ b/scripts/e2e/docker-observability-smoke.sh @@ -0,0 +1,52 @@ +#!/usr/bin/env bash +set -euo pipefail + +ROOT_DIR="$(cd "$(dirname "${BASH_SOURCE[0]}")/../.." && pwd)" +source "$ROOT_DIR/scripts/lib/docker-e2e-image.sh" + +IMAGE_NAME="$(docker_e2e_resolve_image "openclaw-docker-observability-e2e:local" OPENCLAW_DOCKER_OBSERVABILITY_E2E_IMAGE)" +SKIP_BUILD="${OPENCLAW_DOCKER_OBSERVABILITY_E2E_SKIP_BUILD:-0}" +LOOPS="${OPENCLAW_DOCKER_OBSERVABILITY_LOOPS:-1}" +OUTPUT_DIR="${OPENCLAW_DOCKER_OBSERVABILITY_OUTPUT_DIR:-$ROOT_DIR/.artifacts/docker-observability/$(date +%Y%m%d-%H%M%S)}" + +if ! [[ "$LOOPS" =~ ^[1-9][0-9]*$ ]]; then + echo "OPENCLAW_DOCKER_OBSERVABILITY_LOOPS must be a positive integer, got: $LOOPS" >&2 + exit 1 +fi + +mkdir -p "$OUTPUT_DIR" + +docker_e2e_build_or_reuse "$IMAGE_NAME" docker-observability "$ROOT_DIR/scripts/e2e/Dockerfile.observability" "$ROOT_DIR" "" "$SKIP_BUILD" + +echo "Running Docker observability smoke with $LOOPS loop(s)..." +run_logged docker-observability docker run --rm \ + -e "OPENCLAW_DOCKER_OBSERVABILITY_LOOPS=$LOOPS" \ + -v "$OUTPUT_DIR:/app/.artifacts/docker-observability-current" \ + "$IMAGE_NAME" \ + bash -lc ' +set -euo pipefail + +loops="${OPENCLAW_DOCKER_OBSERVABILITY_LOOPS:-1}" +artifact_root=".artifacts/docker-observability-current" +mkdir -p "$artifact_root" + +for i in $(seq 1 "$loops"); do + iteration_dir="$artifact_root/loop-$i" + mkdir -p "$iteration_dir" + + echo "== docker observability loop $i/$loops: otel ==" + pnpm qa:otel:smoke \ + --provider-mode mock-openai \ + --output-dir "$iteration_dir/otel" + + echo "== docker observability loop $i/$loops: prometheus ==" + pnpm openclaw qa suite \ + --provider-mode mock-openai \ + --scenario docker-prometheus-smoke \ + --concurrency 1 \ + --fast \ + --output-dir "$iteration_dir/prometheus" +done +' + +echo "Docker observability smoke passed. Artifacts: $OUTPUT_DIR" diff --git a/scripts/lib/docker-e2e-scenarios.mjs b/scripts/lib/docker-e2e-scenarios.mjs index 28acd792a13..bddda074b03 100644 --- a/scripts/lib/docker-e2e-scenarios.mjs +++ b/scripts/lib/docker-e2e-scenarios.mjs @@ -25,7 +25,10 @@ function lane(name, command, options = {}) { return { cacheKey: options.cacheKey, command, - e2eImageKind: options.e2eImageKind ?? (options.live ? undefined : "functional"), + e2eImageKind: + options.e2eImageKind === false + ? undefined + : (options.e2eImageKind ?? (options.live ? undefined : "functional")), estimateSeconds: options.estimateSeconds, live: options.live === true, name, @@ -181,6 +184,10 @@ export const mainLanes = [ { resources: ["service"], weight: 3 }, ), serviceLane("gateway-network", "OPENCLAW_SKIP_DOCKER_BUILD=1 pnpm test:docker:gateway-network"), + serviceLane("observability", "bash scripts/e2e/docker-observability-smoke.sh", { + e2eImageKind: false, + weight: 3, + }), serviceLane( "agents-delete-shared-workspace", "OPENCLAW_SKIP_DOCKER_BUILD=1 pnpm test:docker:agents-delete-shared-workspace",