docs: cover reply media and voice-call fixes

2026-05-06 06:00:43 +00:00 · 2026-04-25 05:48:29 +01:00
parent 938b53698e
commit 759fe0bf95
5 changed files with 62 additions and 0 deletions
--- a/docs/cli/voicecall.md
+++ b/docs/cli/voicecall.md
@@ -33,6 +33,10 @@ scripts:
 openclaw voicecall setup --json
 ```

+For external providers (`twilio`, `telnyx`, `plivo`), setup must resolve a public
+webhook URL from `publicUrl`, a tunnel, or Tailscale exposure. A loopback/private
+serve fallback is rejected because carriers cannot reach it.
+
 `smoke` runs the same readiness checks. It will not place a real phone call
 unless both `--to` and `--yes` are present:

--- a/docs/concepts/streaming.md
+++ b/docs/concepts/streaming.md
@@ -54,6 +54,19 @@ Legend:

 `message_end` still uses the chunker if the buffered text exceeds `maxChars`, so it can emit multiple chunks at the end.

+### Media delivery with block streaming
+
+`MEDIA:` directives are normal delivery metadata. When block streaming sends a
+media block early, OpenClaw remembers that delivery for the turn. If the final
+assistant payload repeats the same media URL, the final delivery strips the
+duplicate media instead of sending the attachment again.
+
+Exact duplicate final payloads are suppressed. If the final payload adds
+distinct text around media that was already streamed, OpenClaw still sends the
+new text while keeping the media single-delivery. This prevents duplicate voice
+notes or files on channels such as Telegram when an agent emits `MEDIA:` during
+streaming and the provider also includes it in the completed reply.
+
 ## Chunking algorithm (low/high bounds)

 Block chunking is implemented by `EmbeddedBlockChunker`:
--- a/docs/help/testing-live.md
+++ b/docs/help/testing-live.md
@@ -13,6 +13,35 @@ For quick start, QA runners, unit/integration suites, and Docker flows, see
 suites: model matrix, CLI backends, ACP, and media-provider live tests, plus
 credential handling.

+## Live: local profile smoke commands
+
+Source `~/.profile` before ad hoc live checks so provider keys and local tool
+paths match your shell:
+
+```bash
+source ~/.profile
+```
+
+Safe media smoke:
+
+```bash
+pnpm openclaw infer tts convert --local --json \
+  --text "OpenClaw live smoke." \
+  --output /tmp/openclaw-live-smoke.mp3
+```
+
+Safe voice-call readiness smoke:
+
+```bash
+pnpm openclaw voicecall setup --json
+pnpm openclaw voicecall smoke --to "+15555550123"
+```
+
+`voicecall smoke` is a dry run unless `--yes` is also present. Use `--yes` only
+when you intentionally want to place a real notify call. For Twilio, Telnyx, and
+Plivo, a successful readiness check requires a public webhook URL; local-only
+loopback/private fallbacks are rejected by design.
+
 ## Live: Android node capability sweep

 - Test: `src/gateway/android-node.capabilities.live.test.ts`
--- a/docs/plugins/voice-call.md
+++ b/docs/plugins/voice-call.md
@@ -152,6 +152,11 @@ whether the plugin is enabled, the provider and credentials are present, webhook
 exposure is configured, and only one audio mode is active. Use
 `openclaw voicecall setup --json` for scripts.

+For Twilio, Telnyx, and Plivo, setup must resolve to a public webhook URL. If the
+configured `publicUrl`, tunnel URL, Tailscale URL, or serve fallback resolves to
+loopback or private network space, setup fails instead of starting a provider
+that cannot receive real carrier webhooks.
+
 For a no-surprises smoke test, run:

 ```bash
@@ -478,6 +483,9 @@ Notes:
 - Core TTS is used when Twilio media streaming is enabled; otherwise calls fall back to provider native voices.
 - If a Twilio media stream is already active, Voice Call does not fall back to TwiML `<Say>`. If telephony TTS is unavailable in that state, the playback request fails instead of mixing two playback paths.
 - When telephony TTS falls back to a secondary provider, Voice Call logs a warning with the provider chain (`from`, `to`, `attempts`) for debugging.
+- When Twilio barge-in or stream teardown clears the pending TTS queue, queued
+  playback requests settle instead of hanging callers that are awaiting playback
+  completion.

 ### More examples

@@ -589,6 +597,9 @@ For outbound `conversation` calls, first-message handling is tied to live playba
 - Barge-in queue clear and auto-response are suppressed only while the initial greeting is actively speaking.
 - If initial playback fails, the call returns to `listening` and the initial message remains queued for retry.
 - Initial playback for Twilio streaming starts on stream connect without extra delay.
+- Barge-in aborts active playback and clears queued-but-not-yet-playing Twilio
+  TTS entries. Cleared entries resolve as skipped, so follow-up response logic
+  can continue without waiting on audio that will never play.
 - Realtime voice conversations use the realtime stream's own opening turn. Voice Call does not post a legacy `<Say>` TwiML update for that initial message, so outbound `<Connect><Stream>` sessions stay attached.

 ### Twilio stream disconnect grace
--- a/docs/reference/rich-output-protocol.md
+++ b/docs/reference/rich-output-protocol.md
@@ -15,6 +15,11 @@ Assistant output can carry a small set of delivery/render directives:

 These directives are separate. `MEDIA:` and reply/voice tags remain delivery metadata; `[embed ...]` is the web-only rich render path.

+When block streaming is enabled, `MEDIA:` remains single-delivery metadata for a
+turn. If the same media URL is sent in a streamed block and repeated in the final
+assistant payload, OpenClaw delivers the attachment once and strips the duplicate
+from the final payload.
+
 ## `[embed ...]`

 `[embed ...]` is the only agent-facing rich render syntax for the Control UI.