From 759fe0bf95bbe01a2fca40d800f09bc1221af779 Mon Sep 17 00:00:00 2001 From: Peter Steinberger Date: Sat, 25 Apr 2026 05:48:29 +0100 Subject: [PATCH] docs: cover reply media and voice-call fixes --- docs/cli/voicecall.md | 4 ++++ docs/concepts/streaming.md | 13 ++++++++++++ docs/help/testing-live.md | 29 ++++++++++++++++++++++++++ docs/plugins/voice-call.md | 11 ++++++++++ docs/reference/rich-output-protocol.md | 5 +++++ 5 files changed, 62 insertions(+) diff --git a/docs/cli/voicecall.md b/docs/cli/voicecall.md index 4ad165c6069..dc42b95879a 100644 --- a/docs/cli/voicecall.md +++ b/docs/cli/voicecall.md @@ -33,6 +33,10 @@ scripts: openclaw voicecall setup --json ``` +For external providers (`twilio`, `telnyx`, `plivo`), setup must resolve a public +webhook URL from `publicUrl`, a tunnel, or Tailscale exposure. A loopback/private +serve fallback is rejected because carriers cannot reach it. + `smoke` runs the same readiness checks. It will not place a real phone call unless both `--to` and `--yes` are present: diff --git a/docs/concepts/streaming.md b/docs/concepts/streaming.md index 5c80ffc2cf4..cc487bc2c1b 100644 --- a/docs/concepts/streaming.md +++ b/docs/concepts/streaming.md @@ -54,6 +54,19 @@ Legend: `message_end` still uses the chunker if the buffered text exceeds `maxChars`, so it can emit multiple chunks at the end. +### Media delivery with block streaming + +`MEDIA:` directives are normal delivery metadata. When block streaming sends a +media block early, OpenClaw remembers that delivery for the turn. If the final +assistant payload repeats the same media URL, the final delivery strips the +duplicate media instead of sending the attachment again. + +Exact duplicate final payloads are suppressed. If the final payload adds +distinct text around media that was already streamed, OpenClaw still sends the +new text while keeping the media single-delivery. This prevents duplicate voice +notes or files on channels such as Telegram when an agent emits `MEDIA:` during +streaming and the provider also includes it in the completed reply. + ## Chunking algorithm (low/high bounds) Block chunking is implemented by `EmbeddedBlockChunker`: diff --git a/docs/help/testing-live.md b/docs/help/testing-live.md index 08fe18bc617..21c4f00f6f6 100644 --- a/docs/help/testing-live.md +++ b/docs/help/testing-live.md @@ -13,6 +13,35 @@ For quick start, QA runners, unit/integration suites, and Docker flows, see suites: model matrix, CLI backends, ACP, and media-provider live tests, plus credential handling. +## Live: local profile smoke commands + +Source `~/.profile` before ad hoc live checks so provider keys and local tool +paths match your shell: + +```bash +source ~/.profile +``` + +Safe media smoke: + +```bash +pnpm openclaw infer tts convert --local --json \ + --text "OpenClaw live smoke." \ + --output /tmp/openclaw-live-smoke.mp3 +``` + +Safe voice-call readiness smoke: + +```bash +pnpm openclaw voicecall setup --json +pnpm openclaw voicecall smoke --to "+15555550123" +``` + +`voicecall smoke` is a dry run unless `--yes` is also present. Use `--yes` only +when you intentionally want to place a real notify call. For Twilio, Telnyx, and +Plivo, a successful readiness check requires a public webhook URL; local-only +loopback/private fallbacks are rejected by design. + ## Live: Android node capability sweep - Test: `src/gateway/android-node.capabilities.live.test.ts` diff --git a/docs/plugins/voice-call.md b/docs/plugins/voice-call.md index df9a55a1ab7..ad14fc1f571 100644 --- a/docs/plugins/voice-call.md +++ b/docs/plugins/voice-call.md @@ -152,6 +152,11 @@ whether the plugin is enabled, the provider and credentials are present, webhook exposure is configured, and only one audio mode is active. Use `openclaw voicecall setup --json` for scripts. +For Twilio, Telnyx, and Plivo, setup must resolve to a public webhook URL. If the +configured `publicUrl`, tunnel URL, Tailscale URL, or serve fallback resolves to +loopback or private network space, setup fails instead of starting a provider +that cannot receive real carrier webhooks. + For a no-surprises smoke test, run: ```bash @@ -478,6 +483,9 @@ Notes: - Core TTS is used when Twilio media streaming is enabled; otherwise calls fall back to provider native voices. - If a Twilio media stream is already active, Voice Call does not fall back to TwiML ``. If telephony TTS is unavailable in that state, the playback request fails instead of mixing two playback paths. - When telephony TTS falls back to a secondary provider, Voice Call logs a warning with the provider chain (`from`, `to`, `attempts`) for debugging. +- When Twilio barge-in or stream teardown clears the pending TTS queue, queued + playback requests settle instead of hanging callers that are awaiting playback + completion. ### More examples @@ -589,6 +597,9 @@ For outbound `conversation` calls, first-message handling is tied to live playba - Barge-in queue clear and auto-response are suppressed only while the initial greeting is actively speaking. - If initial playback fails, the call returns to `listening` and the initial message remains queued for retry. - Initial playback for Twilio streaming starts on stream connect without extra delay. +- Barge-in aborts active playback and clears queued-but-not-yet-playing Twilio + TTS entries. Cleared entries resolve as skipped, so follow-up response logic + can continue without waiting on audio that will never play. - Realtime voice conversations use the realtime stream's own opening turn. Voice Call does not post a legacy `` TwiML update for that initial message, so outbound `` sessions stay attached. ### Twilio stream disconnect grace diff --git a/docs/reference/rich-output-protocol.md b/docs/reference/rich-output-protocol.md index be790e9defe..4703cfefaf6 100644 --- a/docs/reference/rich-output-protocol.md +++ b/docs/reference/rich-output-protocol.md @@ -15,6 +15,11 @@ Assistant output can carry a small set of delivery/render directives: These directives are separate. `MEDIA:` and reply/voice tags remain delivery metadata; `[embed ...]` is the web-only rich render path. +When block streaming is enabled, `MEDIA:` remains single-delivery metadata for a +turn. If the same media URL is sent in a streamed block and repeated in the final +assistant payload, OpenClaw delivers the attachment once and strips the duplicate +from the final payload. + ## `[embed ...]` `[embed ...]` is the only agent-facing rich render syntax for the Control UI.