fix(voice): reuse preflight transcripts across channels

This commit is contained in:
Peter Steinberger
2026-04-26 05:29:24 +01:00
parent 46b9044c3f
commit 6a67f65568
30 changed files with 586 additions and 64 deletions

View File

@@ -21,6 +21,8 @@ need a separate `openclaw plugins install` step.
- OpenClaw talks to it through its REST API (`GET /api/v1/ping`, `POST /message/text`, `POST /chat/:id/*`).
- Incoming messages arrive via webhooks; outgoing replies, typing indicators, read receipts, and tapbacks are REST calls.
- Attachments and stickers are ingested as inbound media (and surfaced to the agent when possible).
- Auto-TTS replies that synthesize MP3 or CAF audio are delivered as iMessage
voice memo bubbles instead of plain file attachments.
- Pairing/allowlist works the same way as other channels (`/channels/pairing` etc) with `channels.bluebubbles.allowFrom` + pairing codes.
- Reactions are surfaced as system events just like Slack/Telegram so agents can "mention" them before replying.
- Advanced features: edit, unsend, reply threading, message effects, group management.

View File

@@ -244,6 +244,7 @@ content and identifiers.
- explicit WhatsApp mentions of the bot identity
- configured mention regex patterns (`agents.list[].groupChat.mentionPatterns`, fallback `messages.groupChat.mentionPatterns`)
- inbound voice-note transcripts for authorized group messages
- implicit reply-to-bot detection (reply sender matches bot identity)
Security note:
@@ -296,6 +297,11 @@ When the linked self number is also present in `allowFrom`, WhatsApp self-chat s
- `<media:document>`
- `<media:sticker>`
Authorized group voice notes are transcribed before mention gating when the
body is only `<media:audio>`, so saying the bot mention in the voice note can
trigger the reply. If the transcript still does not mention the bot, the
transcript is kept in pending group history instead of the raw placeholder.
Location bodies use terse coordinate text. Location labels/comments and contact/vCard details are rendered as fenced untrusted metadata, not inline prompt text.
</Accordion>

View File

@@ -58,6 +58,10 @@ Video and music generation run as background tasks because provider processing t
Deepgram, ElevenLabs, Mistral, OpenAI, SenseAudio, and xAI can all transcribe
inbound audio through the batch `tools.media.audio` path when configured.
Channel plugins that preflight a voice note for mention gating or command
parsing mark the transcribed attachment on the inbound context, so the shared
media-understanding pass reuses that transcript instead of making a second STT
call for the same audio.
Deepgram, ElevenLabs, Mistral, OpenAI, and xAI also register Voice Call
streaming STT providers, so live phone audio can be forwarded to the selected
vendor without waiting for a completed recording.