fix(tts): route WhatsApp MP3 TTS as voice notes

2026-05-06 15:30:47 +00:00 · 2026-04-26 03:25:55 +01:00
parent 90cd9fce85
commit 9b91040053
4 changed files with 34 additions and 36 deletions
--- a/docs/tools/tts.md
+++ b/docs/tools/tts.md
@@ -754,10 +754,11 @@ These override the effective config from `messages.tts` plus the active

 - **Feishu / Matrix / Telegram / WhatsApp**: voice-note replies prefer Opus (`opus_48000_64` from ElevenLabs, `opus` from OpenAI).
  - 48kHz / 64kbps is a good voice message tradeoff.
- **Feishu**: when a voice-note reply is produced as MP3/WAV/M4A or another
-  likely audio file, the Feishu plugin transcodes it to 48kHz Ogg/Opus with
-  `ffmpeg` before sending the native `audio` bubble. If conversion fails, Feishu
-  receives the original file as an attachment.
+- **Feishu / WhatsApp**: when a voice-note reply is produced as MP3/WAV/M4A or
+  another likely audio file, the channel plugin transcodes it to 48kHz Ogg/Opus
+  with `ffmpeg` before sending the native voice message. If conversion fails,
+  Feishu receives the original file as an attachment; WhatsApp send fails rather
+  than posting an incompatible PTT payload.
 - **Other channels**: MP3 (`mp3_44100_128` from ElevenLabs, `mp3` from OpenAI).
  - 44.1kHz / 128kbps is the default balance for speech clarity.
 - **MiniMax**: MP3 (`speech-2.8-hd` model, 32kHz sample rate) for normal audio attachments. For voice-note targets such as Feishu and Telegram, OpenClaw transcodes the MiniMax MP3 to 48kHz Opus with `ffmpeg` before delivery.
@@ -844,8 +845,8 @@ Notes:
 The `tts` tool converts text to speech and returns an audio attachment for
 reply delivery. When the channel is Feishu, Matrix, Telegram, or WhatsApp,
 the audio is delivered as a voice message rather than a file attachment.
-Feishu can transcode non-Opus TTS output on this path when `ffmpeg` is
-available.
+Feishu and WhatsApp can transcode non-Opus TTS output on this path when
+`ffmpeg` is available.
 WhatsApp sends visible text separately from PTT voice-note audio because clients
 do not consistently render captions on voice notes.
 It accepts optional `channel` and `timeoutMs` fields; `timeoutMs` is a