diff --git a/CHANGELOG.md b/CHANGELOG.md
index 702eb3f1c4a..7e4a3ba5318 100644
--- a/CHANGELOG.md
+++ b/CHANGELOG.md
@@ -6,6 +6,8 @@ Docs: https://docs.openclaw.ai
### Changes
+- Google/TTS: add Gemini text-to-speech support to the bundled `google` plugin, including provider registration, voice selection, WAV reply output, PCM telephony output, and setup/docs guidance. (#67515) Thanks @barronlroth.
+
### Fixes
- Gateway/tools: anchor trusted local `MEDIA:` tool-result passthrough on the exact raw name of this run's registered built-in tools, and reject client tool definitions whose names normalize-collide with a built-in or with another client tool in the same request (`400 invalid_request_error` on both JSON and SSE paths), so a client-supplied tool named like a built-in can no longer inherit its local-media trust. (#67303)
diff --git a/docs/providers/google.md b/docs/providers/google.md
index 70ee5d16693..7dabc43a100 100644
--- a/docs/providers/google.md
+++ b/docs/providers/google.md
@@ -1,6 +1,6 @@
---
title: "Google (Gemini)"
-summary: "Google Gemini setup (API key + OAuth, image generation, media understanding, web search)"
+summary: "Google Gemini setup (API key + OAuth, image generation, media understanding, TTS, web search)"
read_when:
- You want to use Google Gemini models with OpenClaw
- You need the API key or OAuth auth flow
@@ -9,7 +9,7 @@ read_when:
# Google (Gemini)
The Google plugin provides access to Gemini models through Google AI Studio, plus
-image generation, media understanding (image/audio/video), and web search via
+image generation, media understanding (image/audio/video), text-to-speech, and web search via
Gemini Grounding.
- Provider: `google`
@@ -133,6 +133,7 @@ Choose your preferred auth method and follow the setup steps.
| Chat completions | Yes |
| Image generation | Yes |
| Music generation | Yes |
+| Text-to-speech | Yes |
| Image understanding | Yes |
| Audio transcription | Yes |
| Video understanding | Yes |
@@ -233,6 +234,50 @@ To use Google as the default music provider:
See [Music Generation](/tools/music-generation) for shared tool parameters, provider selection, and failover behavior.
+## Text-to-speech
+
+The bundled `google` speech provider uses the Gemini API TTS path with
+`gemini-3.1-flash-tts-preview`.
+
+- Default voice: `Kore`
+- Auth: `messages.tts.providers.google.apiKey`, `models.providers.google.apiKey`, `GEMINI_API_KEY`, or `GOOGLE_API_KEY`
+- Output: WAV for regular TTS attachments, PCM for Talk/telephony
+- Native voice-note output: not supported on this Gemini API path because the API returns PCM rather than Opus
+
+To use Google as the default TTS provider:
+
+```json5
+{
+ messages: {
+ tts: {
+ auto: "always",
+ provider: "google",
+ providers: {
+ google: {
+ model: "gemini-3.1-flash-tts-preview",
+ voiceName: "Kore",
+ },
+ },
+ },
+ },
+}
+```
+
+Gemini API TTS accepts expressive square-bracket audio tags in the text, such as
+`[whispers]` or `[laughs]`. To keep tags out of the visible chat reply while
+sending them to TTS, put them inside a `[[tts:text]]...[[/tts:text]]` block:
+
+```text
+Here is the clean reply text.
+
+[[tts:text]][whispers] Here is the spoken version.[[/tts:text]]
+```
+
+
+A Google Cloud Console API key restricted to the Gemini API is valid for this
+provider. This is not the separate Cloud Text-to-Speech API path.
+
+
## Advanced configuration
diff --git a/docs/tools/tts.md b/docs/tools/tts.md
index 0f4a7075e3f..cdb59116720 100644
--- a/docs/tools/tts.md
+++ b/docs/tools/tts.md
@@ -9,12 +9,13 @@ title: "Text-to-Speech"
# Text-to-speech (TTS)
-OpenClaw can convert outbound replies into audio using ElevenLabs, Microsoft, MiniMax, or OpenAI.
+OpenClaw can convert outbound replies into audio using ElevenLabs, Google Gemini, Microsoft, MiniMax, or OpenAI.
It works anywhere OpenClaw can send audio.
## Supported services
- **ElevenLabs** (primary or fallback provider)
+- **Google Gemini** (primary or fallback provider; uses Gemini API TTS)
- **Microsoft** (primary or fallback provider; current bundled implementation uses `node-edge-tts`)
- **MiniMax** (primary or fallback provider; uses the T2A v2 API)
- **OpenAI** (primary or fallback provider; also used for summaries)
@@ -34,9 +35,10 @@ or ElevenLabs.
## Optional keys
-If you want OpenAI, ElevenLabs, or MiniMax:
+If you want OpenAI, ElevenLabs, Google Gemini, or MiniMax:
- `ELEVENLABS_API_KEY` (or `XI_API_KEY`)
+- `GEMINI_API_KEY` (or `GOOGLE_API_KEY`)
- `MINIMAX_API_KEY`
- `OPENAI_API_KEY`
@@ -170,6 +172,32 @@ Full schema is in [Gateway configuration](/gateway/configuration).
}
```
+### Google Gemini primary
+
+```json5
+{
+ messages: {
+ tts: {
+ auto: "always",
+ provider: "google",
+ providers: {
+ google: {
+ apiKey: "gemini_api_key",
+ model: "gemini-3.1-flash-tts-preview",
+ voiceName: "Kore",
+ },
+ },
+ },
+ },
+}
+```
+
+Google Gemini TTS uses the Gemini API key path. A Google Cloud Console API key
+restricted to the Gemini API is valid here, and it is the same style of key used
+by the bundled Google image-generation provider. Resolution order is
+`messages.tts.providers.google.apiKey` -> `models.providers.google.apiKey` ->
+`GEMINI_API_KEY` -> `GOOGLE_API_KEY`.
+
### Disable Microsoft speech
```json5
@@ -238,7 +266,7 @@ Then run:
- `tagged` only sends audio when the reply includes `[[tts:key=value]]` directives or a `[[tts:text]]...[[/tts:text]]` block.
- `enabled`: legacy toggle (doctor migrates this to `auto`).
- `mode`: `"final"` (default) or `"all"` (includes tool/block replies).
-- `provider`: speech provider id such as `"elevenlabs"`, `"microsoft"`, `"minimax"`, or `"openai"` (fallback is automatic).
+- `provider`: speech provider id such as `"elevenlabs"`, `"google"`, `"microsoft"`, `"minimax"`, or `"openai"` (fallback is automatic).
- If `provider` is **unset**, OpenClaw uses the first configured speech provider in registry auto-select order.
- Legacy `provider: "edge"` still works and is normalized to `microsoft`.
- `summaryModel`: optional cheap model for auto-summary; defaults to `agents.defaults.model.primary`.
@@ -250,7 +278,7 @@ Then run:
- `maxTextLength`: hard cap for TTS input (chars). `/tts audio` fails if exceeded.
- `timeoutMs`: request timeout (ms).
- `prefsPath`: override the local prefs JSON path (provider/limit/summary).
-- `apiKey` values fall back to env vars (`ELEVENLABS_API_KEY`/`XI_API_KEY`, `MINIMAX_API_KEY`, `OPENAI_API_KEY`).
+- `apiKey` values fall back to env vars (`ELEVENLABS_API_KEY`/`XI_API_KEY`, `GEMINI_API_KEY`/`GOOGLE_API_KEY`, `MINIMAX_API_KEY`, `OPENAI_API_KEY`).
- `providers.elevenlabs.baseUrl`: override ElevenLabs API base URL.
- `providers.openai.baseUrl`: override the OpenAI TTS endpoint.
- Resolution order: `messages.tts.providers.openai.baseUrl` -> `OPENAI_TTS_BASE_URL` -> `https://api.openai.com/v1`
@@ -268,6 +296,10 @@ Then run:
- `providers.minimax.speed`: playback speed `0.5..2.0` (default 1.0).
- `providers.minimax.vol`: volume `(0, 10]` (default 1.0; must be greater than 0).
- `providers.minimax.pitch`: pitch shift `-12..12` (default 0).
+- `providers.google.model`: Gemini TTS model (default `gemini-3.1-flash-tts-preview`).
+- `providers.google.voiceName`: Gemini prebuilt voice name (default `Kore`; `voice` is also accepted).
+- `providers.google.baseUrl`: override the Gemini API base URL. Only `https://generativelanguage.googleapis.com` is accepted.
+ - If `messages.tts.providers.google.apiKey` is omitted, TTS can reuse `models.providers.google.apiKey` before env fallback.
- `providers.microsoft.enabled`: allow Microsoft speech usage (default `true`; no API key).
- `providers.microsoft.voice`: Microsoft neural voice name (e.g. `en-US-MichelleNeural`).
- `providers.microsoft.lang`: language code (e.g. `en-US`).
@@ -302,9 +334,9 @@ Here you go.
Available directive keys (when enabled):
-- `provider` (registered speech provider id, for example `openai`, `elevenlabs`, `minimax`, or `microsoft`; requires `allowProvider: true`)
-- `voice` (OpenAI voice) or `voiceId` (ElevenLabs / MiniMax)
-- `model` (OpenAI TTS model, ElevenLabs model id, or MiniMax model)
+- `provider` (registered speech provider id, for example `openai`, `elevenlabs`, `google`, `minimax`, or `microsoft`; requires `allowProvider: true`)
+- `voice` (OpenAI voice), `voiceName` / `voice_name` / `google_voice` (Google voice), or `voiceId` (ElevenLabs / MiniMax)
+- `model` (OpenAI TTS model, ElevenLabs model id, or MiniMax model) or `google_model` (Google TTS model)
- `stability`, `similarityBoost`, `style`, `speed`, `useSpeakerBoost`
- `vol` / `volume` (MiniMax volume, 0-10)
- `pitch` (MiniMax pitch, -12 to 12)
@@ -364,6 +396,7 @@ These override `messages.tts.*` for that host.
- **Other channels**: MP3 (`mp3_44100_128` from ElevenLabs, `mp3` from OpenAI).
- 44.1kHz / 128kbps is the default balance for speech clarity.
- **MiniMax**: MP3 (`speech-2.8-hd` model, 32kHz sample rate). Voice-note format not natively supported; use OpenAI or ElevenLabs for guaranteed Opus voice messages.
+- **Google Gemini**: Gemini API TTS returns raw 24kHz PCM. OpenClaw wraps it as WAV for audio attachments and returns PCM directly for Talk/telephony. Native Opus voice-note format is not supported by this path.
- **Microsoft**: uses `microsoft.outputFormat` (default `audio-24khz-48kbitrate-mono-mp3`).
- The bundled transport accepts an `outputFormat`, but not all formats are available from the service.
- Output format values follow Microsoft Speech output formats (including Ogg/WebM Opus).
diff --git a/extensions/google/index.ts b/extensions/google/index.ts
index 697abbe8cb5..3f8b36eea9b 100644
--- a/extensions/google/index.ts
+++ b/extensions/google/index.ts
@@ -5,18 +5,19 @@ import { buildGoogleGeminiCliBackend } from "./cli-backend.js";
import { registerGoogleGeminiCliProvider } from "./gemini-cli-provider.js";
import { buildGoogleMusicGenerationProvider } from "./music-generation-provider.js";
import { registerGoogleProvider } from "./provider-registration.js";
+import { buildGoogleSpeechProvider } from "./speech-provider.js";
import { createGeminiWebSearchProvider } from "./src/gemini-web-search-provider.js";
import { buildGoogleVideoGenerationProvider } from "./video-generation-provider.js";
let googleImageGenerationProviderPromise: Promise | null = null;
let googleMediaUnderstandingProviderPromise: Promise | null = null;
-type GoogleMediaUnderstandingProvider = MediaUnderstandingProvider & {
- describeImage: NonNullable;
- describeImages: NonNullable;
- transcribeAudio: NonNullable;
- describeVideo: NonNullable;
-};
+type GoogleMediaUnderstandingProvider = Required<
+ Pick<
+ MediaUnderstandingProvider,
+ "describeImage" | "describeImages" | "transcribeAudio" | "describeVideo"
+ >
+>;
async function loadGoogleImageGenerationProvider(): Promise {
if (!googleImageGenerationProviderPromise) {
@@ -113,6 +114,7 @@ export default definePluginEntry({
api.registerImageGenerationProvider(createLazyGoogleImageGenerationProvider());
api.registerMediaUnderstandingProvider(createLazyGoogleMediaUnderstandingProvider());
api.registerMusicGenerationProvider(buildGoogleMusicGenerationProvider());
+ api.registerSpeechProvider(buildGoogleSpeechProvider());
api.registerVideoGenerationProvider(buildGoogleVideoGenerationProvider());
api.registerWebSearchProvider(createGeminiWebSearchProvider());
},
diff --git a/extensions/google/openclaw.plugin.json b/extensions/google/openclaw.plugin.json
index 7eea69423ef..40f0ad25e4d 100644
--- a/extensions/google/openclaw.plugin.json
+++ b/extensions/google/openclaw.plugin.json
@@ -48,6 +48,7 @@
"mediaUnderstandingProviders": ["google"],
"imageGenerationProviders": ["google"],
"musicGenerationProviders": ["google"],
+ "speechProviders": ["google"],
"videoGenerationProviders": ["google"],
"webSearchProviders": ["gemini"]
},
diff --git a/extensions/google/plugin-registration.contract.test.ts b/extensions/google/plugin-registration.contract.test.ts
index 0b8fbe52bbf..3c18525e463 100644
--- a/extensions/google/plugin-registration.contract.test.ts
+++ b/extensions/google/plugin-registration.contract.test.ts
@@ -3,6 +3,7 @@ import { describePluginRegistrationContract } from "../../test/helpers/plugins/p
describePluginRegistrationContract({
...pluginRegistrationContractCases.google,
+ speechProviderIds: ["google"],
videoGenerationProviderIds: ["google"],
webSearchProviderIds: ["gemini"],
requireDescribeImages: true,
diff --git a/extensions/google/speech-provider.test.ts b/extensions/google/speech-provider.test.ts
new file mode 100644
index 00000000000..29ae0d57d9b
--- /dev/null
+++ b/extensions/google/speech-provider.test.ts
@@ -0,0 +1,248 @@
+import { afterEach, describe, expect, it, vi } from "vitest";
+import { buildGoogleSpeechProvider, __testing } from "./speech-provider.js";
+
+function installGoogleTtsFetchMock(pcm = Buffer.from([1, 0, 2, 0])) {
+ const fetchMock = vi.fn().mockResolvedValue({
+ ok: true,
+ json: async () => ({
+ candidates: [
+ {
+ content: {
+ parts: [
+ {
+ inlineData: {
+ mimeType: "audio/L16;codec=pcm;rate=24000",
+ data: pcm.toString("base64"),
+ },
+ },
+ ],
+ },
+ },
+ ],
+ }),
+ });
+ vi.stubGlobal("fetch", fetchMock);
+ return fetchMock;
+}
+
+describe("Google speech provider", () => {
+ afterEach(() => {
+ vi.restoreAllMocks();
+ vi.unstubAllGlobals();
+ vi.unstubAllEnvs();
+ });
+
+ it("synthesizes Gemini PCM as WAV and preserves audio tags in the request text", async () => {
+ const fetchMock = installGoogleTtsFetchMock();
+ const provider = buildGoogleSpeechProvider();
+
+ const result = await provider.synthesize({
+ text: "[whispers] The door is open.",
+ cfg: {},
+ providerConfig: {
+ apiKey: "google-test-key",
+ model: "google/gemini-3.1-flash-tts",
+ voiceName: "Puck",
+ },
+ target: "audio-file",
+ timeoutMs: 12_345,
+ });
+
+ expect(fetchMock).toHaveBeenCalledWith(
+ "https://generativelanguage.googleapis.com/v1beta/models/gemini-3.1-flash-tts-preview:generateContent",
+ expect.objectContaining({
+ method: "POST",
+ body: JSON.stringify({
+ contents: [
+ {
+ role: "user",
+ parts: [{ text: "[whispers] The door is open." }],
+ },
+ ],
+ generationConfig: {
+ responseModalities: ["AUDIO"],
+ speechConfig: {
+ voiceConfig: {
+ prebuiltVoiceConfig: {
+ voiceName: "Puck",
+ },
+ },
+ },
+ },
+ }),
+ }),
+ );
+ const [, init] = fetchMock.mock.calls[0];
+ expect(new Headers(init.headers).get("x-goog-api-key")).toBe("google-test-key");
+ expect(result.outputFormat).toBe("wav");
+ expect(result.fileExtension).toBe(".wav");
+ expect(result.voiceCompatible).toBe(false);
+ expect(result.audioBuffer.subarray(0, 4).toString("ascii")).toBe("RIFF");
+ expect(result.audioBuffer.subarray(8, 12).toString("ascii")).toBe("WAVE");
+ expect(result.audioBuffer.readUInt32LE(24)).toBe(__testing.GOOGLE_TTS_SAMPLE_RATE);
+ expect(result.audioBuffer.subarray(44)).toEqual(Buffer.from([1, 0, 2, 0]));
+ });
+
+ it("falls back to GEMINI_API_KEY and configured Google API base URL", async () => {
+ vi.stubEnv("GEMINI_API_KEY", "env-google-key");
+ const fetchMock = installGoogleTtsFetchMock();
+ const provider = buildGoogleSpeechProvider();
+
+ expect(provider.isConfigured({ providerConfig: {}, timeoutMs: 1 })).toBe(true);
+
+ await provider.synthesize({
+ text: "Read this plainly.",
+ cfg: {
+ models: {
+ providers: {
+ google: {
+ baseUrl: "https://generativelanguage.googleapis.com/v1beta/openai",
+ models: [],
+ },
+ },
+ },
+ },
+ providerConfig: {},
+ target: "voice-note",
+ timeoutMs: 10_000,
+ });
+
+ expect(fetchMock).toHaveBeenCalledWith(
+ "https://generativelanguage.googleapis.com/v1beta/models/gemini-3.1-flash-tts-preview:generateContent",
+ expect.any(Object),
+ );
+ const [, init] = fetchMock.mock.calls[0];
+ expect(new Headers(init.headers).get("x-goog-api-key")).toBe("env-google-key");
+ });
+
+ it("can reuse a configured Google model-provider API key without auth profiles", async () => {
+ const fetchMock = installGoogleTtsFetchMock();
+ const provider = buildGoogleSpeechProvider();
+ const cfg = {
+ models: {
+ providers: {
+ google: {
+ apiKey: "model-provider-google-key",
+ baseUrl: "https://generativelanguage.googleapis.com",
+ models: [],
+ },
+ },
+ },
+ };
+
+ expect(provider.isConfigured({ cfg, providerConfig: {}, timeoutMs: 1 })).toBe(true);
+
+ await provider.synthesize({
+ text: "Use the configured model provider key.",
+ cfg,
+ providerConfig: {},
+ target: "audio-file",
+ timeoutMs: 10_000,
+ });
+
+ const [, init] = fetchMock.mock.calls[0];
+ expect(new Headers(init.headers).get("x-goog-api-key")).toBe("model-provider-google-key");
+ });
+
+ it("returns Gemini PCM directly for telephony synthesis", async () => {
+ const pcm = Buffer.from([3, 0, 4, 0]);
+ installGoogleTtsFetchMock(pcm);
+ const provider = buildGoogleSpeechProvider();
+
+ const result = await provider.synthesizeTelephony?.({
+ text: "Phone call audio.",
+ cfg: {},
+ providerConfig: {
+ apiKey: "google-test-key",
+ voice: "Kore",
+ },
+ timeoutMs: 5_000,
+ });
+
+ expect(result).toEqual({
+ audioBuffer: pcm,
+ outputFormat: "pcm",
+ sampleRate: 24_000,
+ });
+ });
+
+ it("resolves provider config and directive overrides", () => {
+ const provider = buildGoogleSpeechProvider();
+
+ expect(
+ provider.resolveConfig?.({
+ cfg: {},
+ rawConfig: {
+ providers: {
+ google: {
+ apiKey: "configured-key",
+ model: "google/gemini-3.1-flash-tts-preview",
+ voice: "Leda",
+ },
+ },
+ },
+ timeoutMs: 1,
+ }),
+ ).toEqual({
+ apiKey: "configured-key",
+ baseUrl: undefined,
+ model: "gemini-3.1-flash-tts-preview",
+ voiceName: "Leda",
+ });
+
+ expect(
+ provider.parseDirectiveToken?.({
+ key: "google_voice",
+ value: "Aoede",
+ policy: {
+ enabled: true,
+ allowText: true,
+ allowProvider: true,
+ allowVoice: true,
+ allowModelId: true,
+ allowVoiceSettings: true,
+ allowNormalization: true,
+ allowSeed: true,
+ },
+ }),
+ ).toEqual({
+ handled: true,
+ overrides: {
+ voiceName: "Aoede",
+ },
+ });
+
+ expect(
+ provider.parseDirectiveToken?.({
+ key: "google_model",
+ value: "gemini-3.1-flash-tts-preview",
+ policy: {
+ enabled: true,
+ allowText: true,
+ allowProvider: true,
+ allowVoice: true,
+ allowModelId: true,
+ allowVoiceSettings: true,
+ allowNormalization: true,
+ allowSeed: true,
+ },
+ }),
+ ).toEqual({
+ handled: true,
+ overrides: {
+ model: "gemini-3.1-flash-tts-preview",
+ },
+ });
+ });
+
+ it("lists Gemini prebuilt TTS voices", async () => {
+ const provider = buildGoogleSpeechProvider();
+
+ await expect(provider.listVoices?.({ providerConfig: {} })).resolves.toEqual(
+ expect.arrayContaining([
+ { id: "Kore", name: "Kore" },
+ { id: "Puck", name: "Puck" },
+ ]),
+ );
+ });
+});
diff --git a/extensions/google/speech-provider.ts b/extensions/google/speech-provider.ts
new file mode 100644
index 00000000000..0c22fb18f95
--- /dev/null
+++ b/extensions/google/speech-provider.ts
@@ -0,0 +1,391 @@
+import { assertOkOrThrowHttpError, postJsonRequest } from "openclaw/plugin-sdk/provider-http";
+import type { OpenClawConfig } from "openclaw/plugin-sdk/provider-onboard";
+import { normalizeResolvedSecretInputString } from "openclaw/plugin-sdk/secret-input";
+import type {
+ SpeechDirectiveTokenParseContext,
+ SpeechProviderConfig,
+ SpeechProviderOverrides,
+ SpeechProviderPlugin,
+} from "openclaw/plugin-sdk/speech-core";
+import { asObject, trimToUndefined } from "openclaw/plugin-sdk/speech-core";
+import { normalizeOptionalString } from "openclaw/plugin-sdk/text-runtime";
+import { resolveGoogleGenerativeAiHttpRequestConfig } from "./api.js";
+
+const DEFAULT_GOOGLE_TTS_MODEL = "gemini-3.1-flash-tts-preview";
+const DEFAULT_GOOGLE_TTS_VOICE = "Kore";
+const GOOGLE_TTS_SAMPLE_RATE = 24_000;
+const GOOGLE_TTS_CHANNELS = 1;
+const GOOGLE_TTS_BITS_PER_SAMPLE = 16;
+
+const GOOGLE_TTS_VOICES = [
+ "Zephyr",
+ "Puck",
+ "Charon",
+ "Kore",
+ "Fenrir",
+ "Leda",
+ "Orus",
+ "Aoede",
+ "Callirrhoe",
+ "Autonoe",
+ "Enceladus",
+ "Iapetus",
+ "Umbriel",
+ "Algieba",
+ "Despina",
+ "Erinome",
+ "Algenib",
+ "Rasalgethi",
+ "Laomedeia",
+ "Achernar",
+ "Alnilam",
+ "Schedar",
+ "Gacrux",
+ "Pulcherrima",
+ "Achird",
+ "Zubenelgenubi",
+ "Vindemiatrix",
+ "Sadachbia",
+ "Sadaltager",
+ "Sulafat",
+] as const;
+
+type GoogleTtsProviderConfig = {
+ apiKey?: string;
+ baseUrl?: string;
+ model: string;
+ voiceName: string;
+};
+
+type GoogleTtsProviderOverrides = {
+ model?: string;
+ voiceName?: string;
+};
+
+type Maybe = T | undefined;
+
+type GoogleInlineDataPart = {
+ mimeType?: string;
+ mime_type?: string;
+ data?: string;
+};
+
+type GoogleGenerateSpeechResponse = {
+ candidates?: Array<{
+ content?: {
+ parts?: Array<{
+ text?: string;
+ inlineData?: GoogleInlineDataPart;
+ inline_data?: GoogleInlineDataPart;
+ }>;
+ };
+ }>;
+};
+
+function normalizeGoogleTtsModel(model: unknown): string {
+ const trimmed = normalizeOptionalString(model);
+ if (!trimmed) {
+ return DEFAULT_GOOGLE_TTS_MODEL;
+ }
+ const withoutProvider = trimmed.startsWith("google/") ? trimmed.slice("google/".length) : trimmed;
+ return withoutProvider === "gemini-3.1-flash-tts" ? DEFAULT_GOOGLE_TTS_MODEL : withoutProvider;
+}
+
+function normalizeGoogleTtsVoiceName(voiceName: unknown): string {
+ return normalizeOptionalString(voiceName) ?? DEFAULT_GOOGLE_TTS_VOICE;
+}
+
+function resolveGoogleTtsEnvApiKey(): string | undefined {
+ return (
+ normalizeOptionalString(process.env.GEMINI_API_KEY) ??
+ normalizeOptionalString(process.env.GOOGLE_API_KEY)
+ );
+}
+
+function resolveGoogleTtsModelProviderApiKey(cfg?: OpenClawConfig): string | undefined {
+ return normalizeResolvedSecretInputString({
+ value: cfg?.models?.providers?.google?.apiKey,
+ path: "models.providers.google.apiKey",
+ });
+}
+
+function resolveGoogleTtsApiKey(params: {
+ cfg?: OpenClawConfig;
+ providerConfig: SpeechProviderConfig;
+}): string | undefined {
+ return (
+ readGoogleTtsProviderConfig(params.providerConfig).apiKey ??
+ resolveGoogleTtsModelProviderApiKey(params.cfg) ??
+ resolveGoogleTtsEnvApiKey()
+ );
+}
+
+function resolveGoogleTtsBaseUrl(params: {
+ cfg?: OpenClawConfig;
+ providerConfig: GoogleTtsProviderConfig;
+}): string | undefined {
+ return (
+ params.providerConfig.baseUrl ?? trimToUndefined(params.cfg?.models?.providers?.google?.baseUrl)
+ );
+}
+
+function resolveGoogleTtsConfigRecord(
+ rawConfig: Record,
+): Record | undefined {
+ const providers = asObject(rawConfig.providers);
+ return asObject(providers?.google) ?? asObject(rawConfig.google);
+}
+
+function normalizeGoogleTtsProviderConfig(
+ rawConfig: Record,
+): GoogleTtsProviderConfig {
+ const raw = resolveGoogleTtsConfigRecord(rawConfig);
+ return {
+ apiKey: normalizeResolvedSecretInputString({
+ value: raw?.apiKey,
+ path: "messages.tts.providers.google.apiKey",
+ }),
+ baseUrl: trimToUndefined(raw?.baseUrl),
+ model: normalizeGoogleTtsModel(raw?.model),
+ voiceName: normalizeGoogleTtsVoiceName(raw?.voiceName ?? raw?.voice),
+ };
+}
+
+function readGoogleTtsProviderConfig(config: SpeechProviderConfig): GoogleTtsProviderConfig {
+ const normalized = normalizeGoogleTtsProviderConfig({});
+ return {
+ apiKey: trimToUndefined(config.apiKey) ?? normalized.apiKey,
+ baseUrl: trimToUndefined(config.baseUrl) ?? normalized.baseUrl,
+ model: normalizeGoogleTtsModel(config.model ?? normalized.model),
+ voiceName: normalizeGoogleTtsVoiceName(
+ config.voiceName ?? config.voice ?? normalized.voiceName,
+ ),
+ };
+}
+
+function readGoogleTtsOverrides(
+ overrides: Maybe,
+): GoogleTtsProviderOverrides {
+ if (!overrides) {
+ return {};
+ }
+ return {
+ model: normalizeOptionalString(overrides.model),
+ voiceName: normalizeOptionalString(overrides.voiceName ?? overrides.voice),
+ };
+}
+
+function parseDirectiveToken(ctx: SpeechDirectiveTokenParseContext): {
+ handled: boolean;
+ overrides?: SpeechProviderOverrides;
+ warnings?: string[];
+} {
+ switch (ctx.key) {
+ case "voicename":
+ case "voice_name":
+ case "google_voice":
+ case "googlevoice":
+ if (!ctx.policy.allowVoice) {
+ return { handled: true };
+ }
+ return { handled: true, overrides: { voiceName: ctx.value } };
+ case "google_model":
+ case "googlemodel":
+ if (!ctx.policy.allowModelId) {
+ return { handled: true };
+ }
+ return { handled: true, overrides: { model: ctx.value } };
+ default:
+ return { handled: false };
+ }
+}
+
+function extractGoogleSpeechPcm(payload: GoogleGenerateSpeechResponse): Buffer {
+ for (const candidate of payload.candidates ?? []) {
+ for (const part of candidate.content?.parts ?? []) {
+ const inline = part.inlineData ?? part.inline_data;
+ const data = normalizeOptionalString(inline?.data);
+ if (!data) {
+ continue;
+ }
+ return Buffer.from(data, "base64");
+ }
+ }
+ throw new Error("Google TTS response missing audio data");
+}
+
+function wrapPcm16MonoToWav(pcm: Buffer, sampleRate = GOOGLE_TTS_SAMPLE_RATE): Buffer {
+ const byteRate = sampleRate * GOOGLE_TTS_CHANNELS * (GOOGLE_TTS_BITS_PER_SAMPLE / 8);
+ const blockAlign = GOOGLE_TTS_CHANNELS * (GOOGLE_TTS_BITS_PER_SAMPLE / 8);
+ const header = Buffer.alloc(44);
+
+ header.write("RIFF", 0, "ascii");
+ header.writeUInt32LE(36 + pcm.length, 4);
+ header.write("WAVE", 8, "ascii");
+ header.write("fmt ", 12, "ascii");
+ header.writeUInt32LE(16, 16);
+ header.writeUInt16LE(1, 20);
+ header.writeUInt16LE(GOOGLE_TTS_CHANNELS, 22);
+ header.writeUInt32LE(sampleRate, 24);
+ header.writeUInt32LE(byteRate, 28);
+ header.writeUInt16LE(blockAlign, 32);
+ header.writeUInt16LE(GOOGLE_TTS_BITS_PER_SAMPLE, 34);
+ header.write("data", 36, "ascii");
+ header.writeUInt32LE(pcm.length, 40);
+
+ return Buffer.concat([header, pcm]);
+}
+
+async function synthesizeGoogleTtsPcm(params: {
+ text: string;
+ apiKey: string;
+ baseUrl?: string;
+ model: string;
+ voiceName: string;
+ timeoutMs: number;
+}): Promise {
+ const { baseUrl, allowPrivateNetwork, headers, dispatcherPolicy } =
+ resolveGoogleGenerativeAiHttpRequestConfig({
+ apiKey: params.apiKey,
+ baseUrl: params.baseUrl,
+ capability: "audio",
+ transport: "http",
+ });
+
+ const { response: res, release } = await postJsonRequest({
+ url: `${baseUrl}/models/${params.model}:generateContent`,
+ headers,
+ body: {
+ contents: [
+ {
+ role: "user",
+ parts: [{ text: params.text }],
+ },
+ ],
+ generationConfig: {
+ responseModalities: ["AUDIO"],
+ speechConfig: {
+ voiceConfig: {
+ prebuiltVoiceConfig: {
+ voiceName: params.voiceName,
+ },
+ },
+ },
+ },
+ },
+ timeoutMs: params.timeoutMs,
+ fetchFn: fetch,
+ pinDns: false,
+ allowPrivateNetwork,
+ dispatcherPolicy,
+ });
+
+ try {
+ await assertOkOrThrowHttpError(res, "Google TTS failed");
+ return extractGoogleSpeechPcm((await res.json()) as GoogleGenerateSpeechResponse);
+ } finally {
+ await release();
+ }
+}
+
+export function buildGoogleSpeechProvider(): SpeechProviderPlugin {
+ return {
+ id: "google",
+ label: "Google",
+ autoSelectOrder: 50,
+ models: [DEFAULT_GOOGLE_TTS_MODEL],
+ voices: GOOGLE_TTS_VOICES,
+ resolveConfig: ({ rawConfig }) => normalizeGoogleTtsProviderConfig(rawConfig),
+ parseDirectiveToken,
+ resolveTalkConfig: ({ baseTtsConfig, talkProviderConfig }) => {
+ const base = normalizeGoogleTtsProviderConfig(baseTtsConfig);
+ return {
+ ...base,
+ ...(talkProviderConfig.apiKey === undefined
+ ? {}
+ : {
+ apiKey: normalizeResolvedSecretInputString({
+ value: talkProviderConfig.apiKey,
+ path: "talk.providers.google.apiKey",
+ }),
+ }),
+ ...(trimToUndefined(talkProviderConfig.baseUrl) == null
+ ? {}
+ : { baseUrl: trimToUndefined(talkProviderConfig.baseUrl) }),
+ ...(trimToUndefined(talkProviderConfig.modelId) == null
+ ? {}
+ : { model: normalizeGoogleTtsModel(talkProviderConfig.modelId) }),
+ ...(trimToUndefined(talkProviderConfig.voiceId) == null
+ ? {}
+ : { voiceName: normalizeGoogleTtsVoiceName(talkProviderConfig.voiceId) }),
+ };
+ },
+ resolveTalkOverrides: ({ params }) => ({
+ ...(trimToUndefined(params.voiceId) == null
+ ? {}
+ : { voiceName: normalizeGoogleTtsVoiceName(params.voiceId) }),
+ ...(trimToUndefined(params.modelId) == null
+ ? {}
+ : { model: normalizeGoogleTtsModel(params.modelId) }),
+ }),
+ listVoices: async () => GOOGLE_TTS_VOICES.map((voice) => ({ id: voice, name: voice })),
+ isConfigured: ({ cfg, providerConfig }) =>
+ Boolean(resolveGoogleTtsApiKey({ cfg, providerConfig })),
+ synthesize: async (req) => {
+ const config = readGoogleTtsProviderConfig(req.providerConfig);
+ const overrides = readGoogleTtsOverrides(req.providerOverrides);
+ const apiKey = resolveGoogleTtsApiKey({
+ cfg: req.cfg,
+ providerConfig: req.providerConfig,
+ });
+ if (!apiKey) {
+ throw new Error("Google API key missing");
+ }
+ const pcm = await synthesizeGoogleTtsPcm({
+ text: req.text,
+ apiKey,
+ baseUrl: resolveGoogleTtsBaseUrl({ cfg: req.cfg, providerConfig: config }),
+ model: normalizeGoogleTtsModel(overrides.model ?? config.model),
+ voiceName: normalizeGoogleTtsVoiceName(overrides.voiceName ?? config.voiceName),
+ timeoutMs: req.timeoutMs,
+ });
+ return {
+ audioBuffer: wrapPcm16MonoToWav(pcm),
+ outputFormat: "wav",
+ fileExtension: ".wav",
+ voiceCompatible: false,
+ };
+ },
+ synthesizeTelephony: async (req) => {
+ const config = readGoogleTtsProviderConfig(req.providerConfig);
+ const apiKey = resolveGoogleTtsApiKey({
+ cfg: req.cfg,
+ providerConfig: req.providerConfig,
+ });
+ if (!apiKey) {
+ throw new Error("Google API key missing");
+ }
+ const pcm = await synthesizeGoogleTtsPcm({
+ text: req.text,
+ apiKey,
+ baseUrl: resolveGoogleTtsBaseUrl({ cfg: req.cfg, providerConfig: config }),
+ model: config.model,
+ voiceName: config.voiceName,
+ timeoutMs: req.timeoutMs,
+ });
+ return {
+ audioBuffer: pcm,
+ outputFormat: "pcm",
+ sampleRate: GOOGLE_TTS_SAMPLE_RATE,
+ };
+ },
+ };
+}
+
+export const __testing = {
+ DEFAULT_GOOGLE_TTS_MODEL,
+ DEFAULT_GOOGLE_TTS_VOICE,
+ GOOGLE_TTS_SAMPLE_RATE,
+ normalizeGoogleTtsModel,
+ wrapPcm16MonoToWav,
+};
diff --git a/extensions/google/test-api.ts b/extensions/google/test-api.ts
index 2ae0aa66c1a..44dfec0dddd 100644
--- a/extensions/google/test-api.ts
+++ b/extensions/google/test-api.ts
@@ -1,5 +1,6 @@
export { buildGoogleGeminiCliBackend } from "./cli-backend.js";
export { buildGoogleImageGenerationProvider } from "./image-generation-provider.js";
export { buildGoogleMusicGenerationProvider } from "./music-generation-provider.js";
+export { buildGoogleSpeechProvider } from "./speech-provider.js";
export { googleMediaUnderstandingProvider } from "./media-understanding-provider.js";
export { buildGoogleVideoGenerationProvider } from "./video-generation-provider.js";
diff --git a/extensions/speech-core/src/tts.ts b/extensions/speech-core/src/tts.ts
index f95fc5cc43c..2e3ec1d6321 100644
--- a/extensions/speech-core/src/tts.ts
+++ b/extensions/speech-core/src/tts.ts
@@ -474,9 +474,11 @@ export function getTtsProvider(config: ResolvedTtsConfig, prefsPath: string): Tt
return normalizeConfiguredSpeechProviderId(config.provider) ?? config.provider;
}
- for (const provider of sortSpeechProvidersForAutoSelection()) {
+ const effectiveCfg = config.sourceConfig;
+ for (const provider of sortSpeechProvidersForAutoSelection(effectiveCfg)) {
if (
provider.isConfigured({
+ cfg: effectiveCfg,
providerConfig: config.providerConfigs[provider.id] ?? {},
timeoutMs: config.timeoutMs,
})
diff --git a/test/helpers/plugins/plugin-registration-contract-cases.ts b/test/helpers/plugins/plugin-registration-contract-cases.ts
index 44e32e5af5a..7f912b6b8ba 100644
--- a/test/helpers/plugins/plugin-registration-contract-cases.ts
+++ b/test/helpers/plugins/plugin-registration-contract-cases.ts
@@ -55,6 +55,7 @@ export const pluginRegistrationContractCases = {
pluginId: "google",
providerIds: ["google", "google-gemini-cli"],
webSearchProviderIds: ["gemini"],
+ speechProviderIds: ["google"],
mediaUnderstandingProviderIds: ["google"],
imageGenerationProviderIds: ["google"],
requireDescribeImages: true,
diff --git a/test/helpers/plugins/tts-contract-suites.ts b/test/helpers/plugins/tts-contract-suites.ts
index 0d959bed618..4ebc8318397 100644
--- a/test/helpers/plugins/tts-contract-suites.ts
+++ b/test/helpers/plugins/tts-contract-suites.ts
@@ -307,7 +307,8 @@ function buildTestMicrosoftSpeechProvider(): SpeechProviderPlugin {
outputFormat: edgeConfig.outputFormat ?? "audio-24khz-48kbitrate-mono-mp3",
};
},
- isConfigured: () => true,
+ isConfigured: ({ providerConfig }) =>
+ (providerConfig as Record | undefined)?.enabled !== false,
synthesize: async () => ({
audioBuffer: createAudioBuffer(),
outputFormat: "mp3",
@@ -368,6 +369,32 @@ function buildTestElevenLabsSpeechProvider(): SpeechProviderPlugin {
};
}
+function buildTestGoogleSpeechProvider(): SpeechProviderPlugin {
+ return {
+ id: "google",
+ label: "Google",
+ autoSelectOrder: 50,
+ resolveConfig: ({ rawConfig }) => resolveTestProviderConfig(rawConfig, "google"),
+ isConfigured: ({ cfg, providerConfig }) =>
+ typeof (providerConfig as Record | undefined)?.apiKey === "string" ||
+ typeof cfg?.models?.providers?.google?.apiKey === "string" ||
+ typeof process.env.GEMINI_API_KEY === "string" ||
+ typeof process.env.GOOGLE_API_KEY === "string",
+ synthesize: async () => ({
+ audioBuffer: createAudioBuffer(),
+ outputFormat: "wav",
+ fileExtension: ".wav",
+ voiceCompatible: false,
+ }),
+ synthesizeTelephony: async () => ({
+ audioBuffer: createAudioBuffer(),
+ outputFormat: "pcm",
+ sampleRate: 24_000,
+ }),
+ listVoices: async () => [{ id: "Kore", label: "Kore" }],
+ };
+}
+
async function loadTtsRuntime(): Promise {
ttsRuntimePromise ??= import("../../../src/tts/tts.js");
return await ttsRuntimePromise;
@@ -395,6 +422,7 @@ function setupTestSpeechProviderRegistry() {
{ pluginId: "openai", provider: buildTestOpenAISpeechProvider(), source: "test" },
{ pluginId: "microsoft", provider: buildTestMicrosoftSpeechProvider(), source: "test" },
{ pluginId: "elevenlabs", provider: buildTestElevenLabsSpeechProvider(), source: "test" },
+ { pluginId: "google", provider: buildTestGoogleSpeechProvider(), source: "test" },
];
const { cacheKey } = pluginLoaderTesting.resolvePluginLoadCacheContext({ config: {} });
setActivePluginRegistry(registry, cacheKey);
@@ -613,6 +641,32 @@ export function describeTtsConfigContract() {
expect(provider).toBe(testCase.expected);
});
});
+
+ it("passes cfg into auto-selection so model-provider Google keys can configure TTS", () => {
+ const cfg = asLegacyOpenClawConfig({
+ agents: { defaults: { model: { primary: "openai/gpt-4o-mini" } } },
+ models: {
+ providers: {
+ google: {
+ apiKey: "model-provider-google-key",
+ },
+ },
+ },
+ messages: {
+ tts: {
+ providers: {
+ microsoft: {
+ enabled: false,
+ },
+ },
+ },
+ },
+ });
+ const config = resolveTtsConfig(cfg);
+ const prefsPath = `/tmp/tts-prefs-google-model-provider-${Date.now()}.json`;
+
+ expect(getTtsProvider(config, prefsPath)).toBe("google");
+ });
});
describe("resolveTtsConfig provider normalization", () => {