mirror of
https://github.com/openclaw/openclaw.git
synced 2026-05-06 14:20:44 +00:00
fix: add Google Gemini TTS provider (#67515) (thanks @barronlroth)
* Add Google Gemini TTS provider * Remove committed planning artifact * Explain Google media provider type shape * google: distill Gemini TTS provider * fix: add Google Gemini TTS provider (#67515) (thanks @barronlroth) * fix: honor cfg-backed Google TTS selection (#67515) (thanks @barronlroth) * fix: narrow Google TTS directive aliases (#67515) (thanks @barronlroth) --------- Co-authored-by: Ayaan Zaidi <hi@obviy.us>
This commit is contained in:
@@ -9,12 +9,13 @@ title: "Text-to-Speech"
|
||||
|
||||
# Text-to-speech (TTS)
|
||||
|
||||
OpenClaw can convert outbound replies into audio using ElevenLabs, Microsoft, MiniMax, or OpenAI.
|
||||
OpenClaw can convert outbound replies into audio using ElevenLabs, Google Gemini, Microsoft, MiniMax, or OpenAI.
|
||||
It works anywhere OpenClaw can send audio.
|
||||
|
||||
## Supported services
|
||||
|
||||
- **ElevenLabs** (primary or fallback provider)
|
||||
- **Google Gemini** (primary or fallback provider; uses Gemini API TTS)
|
||||
- **Microsoft** (primary or fallback provider; current bundled implementation uses `node-edge-tts`)
|
||||
- **MiniMax** (primary or fallback provider; uses the T2A v2 API)
|
||||
- **OpenAI** (primary or fallback provider; also used for summaries)
|
||||
@@ -34,9 +35,10 @@ or ElevenLabs.
|
||||
|
||||
## Optional keys
|
||||
|
||||
If you want OpenAI, ElevenLabs, or MiniMax:
|
||||
If you want OpenAI, ElevenLabs, Google Gemini, or MiniMax:
|
||||
|
||||
- `ELEVENLABS_API_KEY` (or `XI_API_KEY`)
|
||||
- `GEMINI_API_KEY` (or `GOOGLE_API_KEY`)
|
||||
- `MINIMAX_API_KEY`
|
||||
- `OPENAI_API_KEY`
|
||||
|
||||
@@ -170,6 +172,32 @@ Full schema is in [Gateway configuration](/gateway/configuration).
|
||||
}
|
||||
```
|
||||
|
||||
### Google Gemini primary
|
||||
|
||||
```json5
|
||||
{
|
||||
messages: {
|
||||
tts: {
|
||||
auto: "always",
|
||||
provider: "google",
|
||||
providers: {
|
||||
google: {
|
||||
apiKey: "gemini_api_key",
|
||||
model: "gemini-3.1-flash-tts-preview",
|
||||
voiceName: "Kore",
|
||||
},
|
||||
},
|
||||
},
|
||||
},
|
||||
}
|
||||
```
|
||||
|
||||
Google Gemini TTS uses the Gemini API key path. A Google Cloud Console API key
|
||||
restricted to the Gemini API is valid here, and it is the same style of key used
|
||||
by the bundled Google image-generation provider. Resolution order is
|
||||
`messages.tts.providers.google.apiKey` -> `models.providers.google.apiKey` ->
|
||||
`GEMINI_API_KEY` -> `GOOGLE_API_KEY`.
|
||||
|
||||
### Disable Microsoft speech
|
||||
|
||||
```json5
|
||||
@@ -238,7 +266,7 @@ Then run:
|
||||
- `tagged` only sends audio when the reply includes `[[tts:key=value]]` directives or a `[[tts:text]]...[[/tts:text]]` block.
|
||||
- `enabled`: legacy toggle (doctor migrates this to `auto`).
|
||||
- `mode`: `"final"` (default) or `"all"` (includes tool/block replies).
|
||||
- `provider`: speech provider id such as `"elevenlabs"`, `"microsoft"`, `"minimax"`, or `"openai"` (fallback is automatic).
|
||||
- `provider`: speech provider id such as `"elevenlabs"`, `"google"`, `"microsoft"`, `"minimax"`, or `"openai"` (fallback is automatic).
|
||||
- If `provider` is **unset**, OpenClaw uses the first configured speech provider in registry auto-select order.
|
||||
- Legacy `provider: "edge"` still works and is normalized to `microsoft`.
|
||||
- `summaryModel`: optional cheap model for auto-summary; defaults to `agents.defaults.model.primary`.
|
||||
@@ -250,7 +278,7 @@ Then run:
|
||||
- `maxTextLength`: hard cap for TTS input (chars). `/tts audio` fails if exceeded.
|
||||
- `timeoutMs`: request timeout (ms).
|
||||
- `prefsPath`: override the local prefs JSON path (provider/limit/summary).
|
||||
- `apiKey` values fall back to env vars (`ELEVENLABS_API_KEY`/`XI_API_KEY`, `MINIMAX_API_KEY`, `OPENAI_API_KEY`).
|
||||
- `apiKey` values fall back to env vars (`ELEVENLABS_API_KEY`/`XI_API_KEY`, `GEMINI_API_KEY`/`GOOGLE_API_KEY`, `MINIMAX_API_KEY`, `OPENAI_API_KEY`).
|
||||
- `providers.elevenlabs.baseUrl`: override ElevenLabs API base URL.
|
||||
- `providers.openai.baseUrl`: override the OpenAI TTS endpoint.
|
||||
- Resolution order: `messages.tts.providers.openai.baseUrl` -> `OPENAI_TTS_BASE_URL` -> `https://api.openai.com/v1`
|
||||
@@ -268,6 +296,10 @@ Then run:
|
||||
- `providers.minimax.speed`: playback speed `0.5..2.0` (default 1.0).
|
||||
- `providers.minimax.vol`: volume `(0, 10]` (default 1.0; must be greater than 0).
|
||||
- `providers.minimax.pitch`: pitch shift `-12..12` (default 0).
|
||||
- `providers.google.model`: Gemini TTS model (default `gemini-3.1-flash-tts-preview`).
|
||||
- `providers.google.voiceName`: Gemini prebuilt voice name (default `Kore`; `voice` is also accepted).
|
||||
- `providers.google.baseUrl`: override the Gemini API base URL. Only `https://generativelanguage.googleapis.com` is accepted.
|
||||
- If `messages.tts.providers.google.apiKey` is omitted, TTS can reuse `models.providers.google.apiKey` before env fallback.
|
||||
- `providers.microsoft.enabled`: allow Microsoft speech usage (default `true`; no API key).
|
||||
- `providers.microsoft.voice`: Microsoft neural voice name (e.g. `en-US-MichelleNeural`).
|
||||
- `providers.microsoft.lang`: language code (e.g. `en-US`).
|
||||
@@ -302,9 +334,9 @@ Here you go.
|
||||
|
||||
Available directive keys (when enabled):
|
||||
|
||||
- `provider` (registered speech provider id, for example `openai`, `elevenlabs`, `minimax`, or `microsoft`; requires `allowProvider: true`)
|
||||
- `voice` (OpenAI voice) or `voiceId` (ElevenLabs / MiniMax)
|
||||
- `model` (OpenAI TTS model, ElevenLabs model id, or MiniMax model)
|
||||
- `provider` (registered speech provider id, for example `openai`, `elevenlabs`, `google`, `minimax`, or `microsoft`; requires `allowProvider: true`)
|
||||
- `voice` (OpenAI voice), `voiceName` / `voice_name` / `google_voice` (Google voice), or `voiceId` (ElevenLabs / MiniMax)
|
||||
- `model` (OpenAI TTS model, ElevenLabs model id, or MiniMax model) or `google_model` (Google TTS model)
|
||||
- `stability`, `similarityBoost`, `style`, `speed`, `useSpeakerBoost`
|
||||
- `vol` / `volume` (MiniMax volume, 0-10)
|
||||
- `pitch` (MiniMax pitch, -12 to 12)
|
||||
@@ -364,6 +396,7 @@ These override `messages.tts.*` for that host.
|
||||
- **Other channels**: MP3 (`mp3_44100_128` from ElevenLabs, `mp3` from OpenAI).
|
||||
- 44.1kHz / 128kbps is the default balance for speech clarity.
|
||||
- **MiniMax**: MP3 (`speech-2.8-hd` model, 32kHz sample rate). Voice-note format not natively supported; use OpenAI or ElevenLabs for guaranteed Opus voice messages.
|
||||
- **Google Gemini**: Gemini API TTS returns raw 24kHz PCM. OpenClaw wraps it as WAV for audio attachments and returns PCM directly for Talk/telephony. Native Opus voice-note format is not supported by this path.
|
||||
- **Microsoft**: uses `microsoft.outputFormat` (default `audio-24khz-48kbitrate-mono-mp3`).
|
||||
- The bundled transport accepts an `outputFormat`, but not all formats are available from the service.
|
||||
- Output format values follow Microsoft Speech output formats (including Ogg/WebM Opus).
|
||||
|
||||
Reference in New Issue
Block a user