fix(google): emit opus voice-note tts

2026-05-06 09:20:43 +00:00 · 2026-04-25 21:33:15 +01:00
parent d5b6667823
commit e2fd3dcee9
14 changed files with 300 additions and 123 deletions
--- a/docs/.generated/plugin-sdk-api-baseline.sha256
+++ b/docs/.generated/plugin-sdk-api-baseline.sha256
@@ -1,2 +1,2 @@
-f813474b1623f06e1465daacd56db970e8e92ab1be122faee0fa2a1dc2d4fc43  plugin-sdk-api-baseline.json
-b3ea88c0c9b4cf6d9a46f0d34149063303853e78ef9708224608e4da79b23190  plugin-sdk-api-baseline.jsonl
+c911117176b41eebf26470618274a7e093910e9b36855bc045bc8a92f6856745  plugin-sdk-api-baseline.json
+ff360635f95beb217b9dd207a87eaf331319a7671aea03acfe05911756741b21  plugin-sdk-api-baseline.jsonl
--- a/docs/providers/google.md
+++ b/docs/providers/google.md
@@ -252,8 +252,8 @@ The bundled `google` speech provider uses the Gemini API TTS path with

 - Default voice: `Kore`
 - Auth: `messages.tts.providers.google.apiKey`, `models.providers.google.apiKey`, `GEMINI_API_KEY`, or `GOOGLE_API_KEY`
- Output: WAV for regular TTS attachments, PCM for Talk/telephony
- Native voice-note output: not supported on this Gemini API path because the API returns PCM rather than Opus
+- Output: WAV for regular TTS attachments, Opus for voice-note targets, PCM for Talk/telephony
+- Voice-note output: Google PCM is wrapped as WAV and transcoded to 48 kHz Opus with `ffmpeg`

 To use Google as the default TTS provider:

--- a/docs/tools/tts.md
+++ b/docs/tools/tts.md
@@ -584,7 +584,7 @@ These override `messages.tts.*` for that host.
 - **Local CLI**: uses the configured `outputFormat`. Voice-note targets are
  converted to Ogg/Opus and telephony output is converted to raw 16 kHz mono PCM
  with `ffmpeg`.
- **Google Gemini**: Gemini API TTS returns raw 24kHz PCM. OpenClaw wraps it as WAV for audio attachments and returns PCM directly for Talk/telephony. Native Opus voice-note format is not supported by this path.
+- **Google Gemini**: Gemini API TTS returns raw 24kHz PCM. OpenClaw wraps it as WAV for audio attachments, transcodes it to 48kHz Opus for voice-note targets, and returns PCM directly for Talk/telephony.
 - **Gradium**: WAV for audio attachments, Opus for voice-note targets, and `ulaw_8000` at 8 kHz for telephony.
 - **xAI**: MP3 by default; `responseFormat` may be `mp3`, `wav`, `pcm`, `mulaw`, or `alaw`. OpenClaw uses xAI's batch REST TTS endpoint and returns a complete audio attachment; xAI's streaming TTS WebSocket is not used by this provider path. Native Opus voice-note format is not supported by this path.
 - **Microsoft**: uses `microsoft.outputFormat` (default `audio-24khz-48kbitrate-mono-mp3`).