refactor(tts): resolve voice delivery from channel capabilities

This commit is contained in:
Peter Steinberger
2026-04-26 06:51:19 +01:00
parent 2784710f4d
commit d613c8e29b
20 changed files with 349 additions and 71 deletions

View File

@@ -1,2 +1,2 @@
690c1cd4c0c2c3d31577958120e14ac0bf555af529e03aa5e7965b1d04659c49 plugin-sdk-api-baseline.json
a0e6ba472ddd3acea34c0a8fda8cbb7d1172b1671a671d5fef5a9f42d749ce0d plugin-sdk-api-baseline.jsonl
a81b6ddeb1fd24bf234a3b7ba1d51d18d7060afa49378dd92988f326e140db13 plugin-sdk-api-baseline.json
90a6e45404c2c017c23ab9ee75e71503ec683a680f64266504fdab69e43f288b plugin-sdk-api-baseline.jsonl

View File

@@ -646,6 +646,44 @@ or `messages.tts.prefsPath`.
These override the effective config from `messages.tts` plus the active
`agents.list[].tts` block for that host.
## Output formats (fixed)
TTS voice delivery is channel-capability driven. Channel plugins advertise
whether voice-style TTS should ask providers for a native `voice-note` target or
keep normal `audio-file` synthesis and only mark compatible output for voice
delivery.
- **Voice-note capable channels**: voice-note replies prefer Opus (`opus_48000_64` from ElevenLabs, `opus` from OpenAI).
- 48kHz / 64kbps is a good voice message tradeoff.
- **Feishu / WhatsApp**: when a voice-note reply is produced as MP3/WebM/WAV/M4A
or another likely audio file, the channel plugin transcodes it to 48kHz
Ogg/Opus with `ffmpeg` before sending the native voice message. WhatsApp sends
the result through the Baileys `audio` payload with `ptt: true` and
`audio/ogg; codecs=opus`. If conversion fails, Feishu receives the original
file as an attachment; WhatsApp send fails rather than posting an incompatible
PTT payload.
- **BlueBubbles**: keeps provider synthesis on the normal audio-file path; MP3
and CAF outputs are marked for iMessage voice memo delivery.
- **Other channels**: MP3 (`mp3_44100_128` from ElevenLabs, `mp3` from OpenAI).
- 44.1kHz / 128kbps is the default balance for speech clarity.
- **MiniMax**: MP3 (`speech-2.8-hd` model, 32kHz sample rate) for normal audio attachments. For channel-advertised voice-note targets, OpenClaw transcodes the MiniMax MP3 to 48kHz Opus with `ffmpeg` before delivery when the channel advertises transcoding.
- **Xiaomi MiMo**: MP3 by default, or WAV when configured. For channel-advertised voice-note targets, OpenClaw transcodes Xiaomi output to 48kHz Opus with `ffmpeg` before delivery when the channel advertises transcoding.
- **Local CLI**: uses the configured `outputFormat`. Voice-note targets are
converted to Ogg/Opus and telephony output is converted to raw 16 kHz mono PCM
with `ffmpeg`.
- **Google Gemini**: Gemini API TTS returns raw 24kHz PCM. OpenClaw wraps it as WAV for audio attachments, transcodes it to 48kHz Opus for voice-note targets, and returns PCM directly for Talk/telephony.
- **Gradium**: WAV for audio attachments, Opus for voice-note targets, and `ulaw_8000` at 8 kHz for telephony.
- **Inworld**: MP3 for normal audio attachments, native `OGG_OPUS` for voice-note targets, and raw `PCM` at 22050 Hz for Talk/telephony.
- **xAI**: MP3 by default; `responseFormat` may be `mp3`, `wav`, `pcm`, `mulaw`, or `alaw`. OpenClaw uses xAI's batch REST TTS endpoint and returns a complete audio attachment; xAI's streaming TTS WebSocket is not used by this provider path. Native Opus voice-note format is not supported by this path.
- **Microsoft**: uses `microsoft.outputFormat` (default `audio-24khz-48kbitrate-mono-mp3`).
- The bundled transport accepts an `outputFormat`, but not all formats are available from the service.
- Output format values follow Microsoft Speech output formats (including Ogg/WebM Opus).
- Telegram `sendVoice` accepts OGG/MP3/M4A; use OpenAI/ElevenLabs if you need
guaranteed Opus voice messages.
- If the configured Microsoft output format fails, OpenClaw retries with MP3.
OpenAI/ElevenLabs output formats are fixed per channel (see above).
## Auto-TTS behavior
When `messages.tts.auto` is enabled, OpenClaw: