mirror of
https://github.com/openclaw/openclaw.git
synced 2026-05-06 10:30:44 +00:00
fix(voice): reuse preflight transcripts across channels
This commit is contained in:
@@ -21,6 +21,8 @@ need a separate `openclaw plugins install` step.
|
||||
- OpenClaw talks to it through its REST API (`GET /api/v1/ping`, `POST /message/text`, `POST /chat/:id/*`).
|
||||
- Incoming messages arrive via webhooks; outgoing replies, typing indicators, read receipts, and tapbacks are REST calls.
|
||||
- Attachments and stickers are ingested as inbound media (and surfaced to the agent when possible).
|
||||
- Auto-TTS replies that synthesize MP3 or CAF audio are delivered as iMessage
|
||||
voice memo bubbles instead of plain file attachments.
|
||||
- Pairing/allowlist works the same way as other channels (`/channels/pairing` etc) with `channels.bluebubbles.allowFrom` + pairing codes.
|
||||
- Reactions are surfaced as system events just like Slack/Telegram so agents can "mention" them before replying.
|
||||
- Advanced features: edit, unsend, reply threading, message effects, group management.
|
||||
|
||||
@@ -244,6 +244,7 @@ content and identifiers.
|
||||
|
||||
- explicit WhatsApp mentions of the bot identity
|
||||
- configured mention regex patterns (`agents.list[].groupChat.mentionPatterns`, fallback `messages.groupChat.mentionPatterns`)
|
||||
- inbound voice-note transcripts for authorized group messages
|
||||
- implicit reply-to-bot detection (reply sender matches bot identity)
|
||||
|
||||
Security note:
|
||||
@@ -296,6 +297,11 @@ When the linked self number is also present in `allowFrom`, WhatsApp self-chat s
|
||||
- `<media:document>`
|
||||
- `<media:sticker>`
|
||||
|
||||
Authorized group voice notes are transcribed before mention gating when the
|
||||
body is only `<media:audio>`, so saying the bot mention in the voice note can
|
||||
trigger the reply. If the transcript still does not mention the bot, the
|
||||
transcript is kept in pending group history instead of the raw placeholder.
|
||||
|
||||
Location bodies use terse coordinate text. Location labels/comments and contact/vCard details are rendered as fenced untrusted metadata, not inline prompt text.
|
||||
|
||||
</Accordion>
|
||||
|
||||
@@ -58,6 +58,10 @@ Video and music generation run as background tasks because provider processing t
|
||||
|
||||
Deepgram, ElevenLabs, Mistral, OpenAI, SenseAudio, and xAI can all transcribe
|
||||
inbound audio through the batch `tools.media.audio` path when configured.
|
||||
Channel plugins that preflight a voice note for mention gating or command
|
||||
parsing mark the transcribed attachment on the inbound context, so the shared
|
||||
media-understanding pass reuses that transcript instead of making a second STT
|
||||
call for the same audio.
|
||||
Deepgram, ElevenLabs, Mistral, OpenAI, and xAI also register Voice Call
|
||||
streaming STT providers, so live phone audio can be forwarded to the selected
|
||||
vendor without waiting for a completed recording.
|
||||
|
||||
Reference in New Issue
Block a user