fix(voice): reuse preflight transcripts across channels

2026-05-06 10:30:44 +00:00 · 2026-04-26 05:29:24 +01:00
parent 46b9044c3f
commit 6a67f65568
30 changed files with 586 additions and 64 deletions
--- a/docs/channels/bluebubbles.md
+++ b/docs/channels/bluebubbles.md
@@ -21,6 +21,8 @@ need a separate `openclaw plugins install` step.
 - OpenClaw talks to it through its REST API (`GET /api/v1/ping`, `POST /message/text`, `POST /chat/:id/*`).
 - Incoming messages arrive via webhooks; outgoing replies, typing indicators, read receipts, and tapbacks are REST calls.
 - Attachments and stickers are ingested as inbound media (and surfaced to the agent when possible).
+- Auto-TTS replies that synthesize MP3 or CAF audio are delivered as iMessage
+  voice memo bubbles instead of plain file attachments.
 - Pairing/allowlist works the same way as other channels (`/channels/pairing` etc) with `channels.bluebubbles.allowFrom` + pairing codes.
 - Reactions are surfaced as system events just like Slack/Telegram so agents can "mention" them before replying.
 - Advanced features: edit, unsend, reply threading, message effects, group management.
--- a/docs/channels/whatsapp.md
+++ b/docs/channels/whatsapp.md
@@ -244,6 +244,7 @@ content and identifiers.

    - explicit WhatsApp mentions of the bot identity
    - configured mention regex patterns (`agents.list[].groupChat.mentionPatterns`, fallback `messages.groupChat.mentionPatterns`)
+    - inbound voice-note transcripts for authorized group messages
    - implicit reply-to-bot detection (reply sender matches bot identity)

    Security note:
@@ -296,6 +297,11 @@ When the linked self number is also present in `allowFrom`, WhatsApp self-chat s
    - `<media:document>`
    - `<media:sticker>`

+    Authorized group voice notes are transcribed before mention gating when the
+    body is only `<media:audio>`, so saying the bot mention in the voice note can
+    trigger the reply. If the transcript still does not mention the bot, the
+    transcript is kept in pending group history instead of the raw placeholder.
+
    Location bodies use terse coordinate text. Location labels/comments and contact/vCard details are rendered as fenced untrusted metadata, not inline prompt text.

  </Accordion>
--- a/docs/tools/media-overview.md
+++ b/docs/tools/media-overview.md
@@ -58,6 +58,10 @@ Video and music generation run as background tasks because provider processing t

 Deepgram, ElevenLabs, Mistral, OpenAI, SenseAudio, and xAI can all transcribe
 inbound audio through the batch `tools.media.audio` path when configured.
+Channel plugins that preflight a voice note for mention gating or command
+parsing mark the transcribed attachment on the inbound context, so the shared
+media-understanding pass reuses that transcript instead of making a second STT
+call for the same audio.
 Deepgram, ElevenLabs, Mistral, OpenAI, and xAI also register Voice Call
 streaming STT providers, so live phone audio can be forwarded to the selected
 vendor without waiting for a completed recording.