fix: add Google Gemini TTS provider (#67515) (thanks @barronlroth)

* Add Google Gemini TTS provider * Remove committed planning artifact * Explain Google media provider type shape * google: distill Gemini TTS provider * fix: add Google Gemini TTS provider (#67515) (thanks @barronlroth) * fix: honor cfg-backed Google TTS selection (#67515) (thanks @barronlroth) * fix: narrow Google TTS directive aliases (#67515) (thanks @barronlroth) --------- Co-authored-by: Ayaan Zaidi <hi@obviy.us>
2026-05-06 05:10:44 +00:00 · 2026-04-15 23:24:35 -07:00
parent b10ae0bf13
commit bf59917cd1
12 changed files with 798 additions and 17 deletions
--- a/CHANGELOG.md
+++ b/CHANGELOG.md
@@ -6,6 +6,8 @@ Docs: https://docs.openclaw.ai

 ### Changes

+- Google/TTS: add Gemini text-to-speech support to the bundled `google` plugin, including provider registration, voice selection, WAV reply output, PCM telephony output, and setup/docs guidance. (#67515) Thanks @barronlroth.
+
 ### Fixes

 - Gateway/tools: anchor trusted local `MEDIA:` tool-result passthrough on the exact raw name of this run's registered built-in tools, and reject client tool definitions whose names normalize-collide with a built-in or with another client tool in the same request (`400 invalid_request_error` on both JSON and SSE paths), so a client-supplied tool named like a built-in can no longer inherit its local-media trust. (#67303)
--- a/docs/providers/google.md
+++ b/docs/providers/google.md
@@ -1,6 +1,6 @@
 ---
 title: "Google (Gemini)"
-summary: "Google Gemini setup (API key + OAuth, image generation, media understanding, web search)"
+summary: "Google Gemini setup (API key + OAuth, image generation, media understanding, TTS, web search)"
 read_when:
  - You want to use Google Gemini models with OpenClaw
  - You need the API key or OAuth auth flow
@@ -9,7 +9,7 @@ read_when:
 # Google (Gemini)

 The Google plugin provides access to Gemini models through Google AI Studio, plus
-image generation, media understanding (image/audio/video), and web search via
+image generation, media understanding (image/audio/video), text-to-speech, and web search via
 Gemini Grounding.

 - Provider: `google`
@@ -133,6 +133,7 @@ Choose your preferred auth method and follow the setup steps.
 | Chat completions       | Yes               |
 | Image generation       | Yes               |
 | Music generation       | Yes               |
+| Text-to-speech         | Yes               |
 | Image understanding    | Yes               |
 | Audio transcription    | Yes               |
 | Video understanding    | Yes               |
@@ -233,6 +234,50 @@ To use Google as the default music provider:
 See [Music Generation](/tools/music-generation) for shared tool parameters, provider selection, and failover behavior.
 </Note>

+## Text-to-speech
+
+The bundled `google` speech provider uses the Gemini API TTS path with
+`gemini-3.1-flash-tts-preview`.
+
+- Default voice: `Kore`
+- Auth: `messages.tts.providers.google.apiKey`, `models.providers.google.apiKey`, `GEMINI_API_KEY`, or `GOOGLE_API_KEY`
+- Output: WAV for regular TTS attachments, PCM for Talk/telephony
+- Native voice-note output: not supported on this Gemini API path because the API returns PCM rather than Opus
+
+To use Google as the default TTS provider:
+
+```json5
+{
+  messages: {
+    tts: {
+      auto: "always",
+      provider: "google",
+      providers: {
+        google: {
+          model: "gemini-3.1-flash-tts-preview",
+          voiceName: "Kore",
+        },
+      },
+    },
+  },
+}
+```
+
+Gemini API TTS accepts expressive square-bracket audio tags in the text, such as
+`[whispers]` or `[laughs]`. To keep tags out of the visible chat reply while
+sending them to TTS, put them inside a `[[tts:text]]...[[/tts:text]]` block:
+
+```text
+Here is the clean reply text.
+
+[[tts:text]][whispers] Here is the spoken version.[[/tts:text]]
+```
+
+<Note>
+A Google Cloud Console API key restricted to the Gemini API is valid for this
+provider. This is not the separate Cloud Text-to-Speech API path.
+</Note>
+
 ## Advanced configuration

 <AccordionGroup>
--- a/docs/tools/tts.md
+++ b/docs/tools/tts.md
@@ -9,12 +9,13 @@ title: "Text-to-Speech"

 # Text-to-speech (TTS)

-OpenClaw can convert outbound replies into audio using ElevenLabs, Microsoft, MiniMax, or OpenAI.
+OpenClaw can convert outbound replies into audio using ElevenLabs, Google Gemini, Microsoft, MiniMax, or OpenAI.
 It works anywhere OpenClaw can send audio.

 ## Supported services

 - **ElevenLabs** (primary or fallback provider)
+- **Google Gemini** (primary or fallback provider; uses Gemini API TTS)
 - **Microsoft** (primary or fallback provider; current bundled implementation uses `node-edge-tts`)
 - **MiniMax** (primary or fallback provider; uses the T2A v2 API)
 - **OpenAI** (primary or fallback provider; also used for summaries)
@@ -34,9 +35,10 @@ or ElevenLabs.

 ## Optional keys

-If you want OpenAI, ElevenLabs, or MiniMax:
+If you want OpenAI, ElevenLabs, Google Gemini, or MiniMax:

 - `ELEVENLABS_API_KEY` (or `XI_API_KEY`)
+- `GEMINI_API_KEY` (or `GOOGLE_API_KEY`)
 - `MINIMAX_API_KEY`
 - `OPENAI_API_KEY`

@@ -170,6 +172,32 @@ Full schema is in [Gateway configuration](/gateway/configuration).
 }
 ```

+### Google Gemini primary
+
+```json5
+{
+  messages: {
+    tts: {
+      auto: "always",
+      provider: "google",
+      providers: {
+        google: {
+          apiKey: "gemini_api_key",
+          model: "gemini-3.1-flash-tts-preview",
+          voiceName: "Kore",
+        },
+      },
+    },
+  },
+}
+```
+
+Google Gemini TTS uses the Gemini API key path. A Google Cloud Console API key
+restricted to the Gemini API is valid here, and it is the same style of key used
+by the bundled Google image-generation provider. Resolution order is
+`messages.tts.providers.google.apiKey` -> `models.providers.google.apiKey` ->
+`GEMINI_API_KEY` -> `GOOGLE_API_KEY`.
+
 ### Disable Microsoft speech

 ```json5
@@ -238,7 +266,7 @@ Then run:
  - `tagged` only sends audio when the reply includes `[[tts:key=value]]` directives or a `[[tts:text]]...[[/tts:text]]` block.
 - `enabled`: legacy toggle (doctor migrates this to `auto`).
 - `mode`: `"final"` (default) or `"all"` (includes tool/block replies).
- `provider`: speech provider id such as `"elevenlabs"`, `"microsoft"`, `"minimax"`, or `"openai"` (fallback is automatic).
+- `provider`: speech provider id such as `"elevenlabs"`, `"google"`, `"microsoft"`, `"minimax"`, or `"openai"` (fallback is automatic).
 - If `provider` is **unset**, OpenClaw uses the first configured speech provider in registry auto-select order.
 - Legacy `provider: "edge"` still works and is normalized to `microsoft`.
 - `summaryModel`: optional cheap model for auto-summary; defaults to `agents.defaults.model.primary`.
@@ -250,7 +278,7 @@ Then run:
 - `maxTextLength`: hard cap for TTS input (chars). `/tts audio` fails if exceeded.
 - `timeoutMs`: request timeout (ms).
 - `prefsPath`: override the local prefs JSON path (provider/limit/summary).
- `apiKey` values fall back to env vars (`ELEVENLABS_API_KEY`/`XI_API_KEY`, `MINIMAX_API_KEY`, `OPENAI_API_KEY`).
+- `apiKey` values fall back to env vars (`ELEVENLABS_API_KEY`/`XI_API_KEY`, `GEMINI_API_KEY`/`GOOGLE_API_KEY`, `MINIMAX_API_KEY`, `OPENAI_API_KEY`).
 - `providers.elevenlabs.baseUrl`: override ElevenLabs API base URL.
 - `providers.openai.baseUrl`: override the OpenAI TTS endpoint.
  - Resolution order: `messages.tts.providers.openai.baseUrl` -> `OPENAI_TTS_BASE_URL` -> `https://api.openai.com/v1`
@@ -268,6 +296,10 @@ Then run:
 - `providers.minimax.speed`: playback speed `0.5..2.0` (default 1.0).
 - `providers.minimax.vol`: volume `(0, 10]` (default 1.0; must be greater than 0).
 - `providers.minimax.pitch`: pitch shift `-12..12` (default 0).
+- `providers.google.model`: Gemini TTS model (default `gemini-3.1-flash-tts-preview`).
+- `providers.google.voiceName`: Gemini prebuilt voice name (default `Kore`; `voice` is also accepted).
+- `providers.google.baseUrl`: override the Gemini API base URL. Only `https://generativelanguage.googleapis.com` is accepted.
+  - If `messages.tts.providers.google.apiKey` is omitted, TTS can reuse `models.providers.google.apiKey` before env fallback.
 - `providers.microsoft.enabled`: allow Microsoft speech usage (default `true`; no API key).
 - `providers.microsoft.voice`: Microsoft neural voice name (e.g. `en-US-MichelleNeural`).
 - `providers.microsoft.lang`: language code (e.g. `en-US`).
@@ -302,9 +334,9 @@ Here you go.

 Available directive keys (when enabled):

- `provider` (registered speech provider id, for example `openai`, `elevenlabs`, `minimax`, or `microsoft`; requires `allowProvider: true`)
- `voice` (OpenAI voice) or `voiceId` (ElevenLabs / MiniMax)
- `model` (OpenAI TTS model, ElevenLabs model id, or MiniMax model)
+- `provider` (registered speech provider id, for example `openai`, `elevenlabs`, `google`, `minimax`, or `microsoft`; requires `allowProvider: true`)
+- `voice` (OpenAI voice), `voiceName` / `voice_name` / `google_voice` (Google voice), or `voiceId` (ElevenLabs / MiniMax)
+- `model` (OpenAI TTS model, ElevenLabs model id, or MiniMax model) or `google_model` (Google TTS model)
 - `stability`, `similarityBoost`, `style`, `speed`, `useSpeakerBoost`
 - `vol` / `volume` (MiniMax volume, 0-10)
 - `pitch` (MiniMax pitch, -12 to 12)
@@ -364,6 +396,7 @@ These override `messages.tts.*` for that host.
 - **Other channels**: MP3 (`mp3_44100_128` from ElevenLabs, `mp3` from OpenAI).
  - 44.1kHz / 128kbps is the default balance for speech clarity.
 - **MiniMax**: MP3 (`speech-2.8-hd` model, 32kHz sample rate). Voice-note format not natively supported; use OpenAI or ElevenLabs for guaranteed Opus voice messages.
+- **Google Gemini**: Gemini API TTS returns raw 24kHz PCM. OpenClaw wraps it as WAV for audio attachments and returns PCM directly for Talk/telephony. Native Opus voice-note format is not supported by this path.
 - **Microsoft**: uses `microsoft.outputFormat` (default `audio-24khz-48kbitrate-mono-mp3`).
  - The bundled transport accepts an `outputFormat`, but not all formats are available from the service.
  - Output format values follow Microsoft Speech output formats (including Ogg/WebM Opus).
--- a/extensions/google/index.ts
+++ b/extensions/google/index.ts
@@ -5,18 +5,19 @@ import { buildGoogleGeminiCliBackend } from "./cli-backend.js";
 import { registerGoogleGeminiCliProvider } from "./gemini-cli-provider.js";
 import { buildGoogleMusicGenerationProvider } from "./music-generation-provider.js";
 import { registerGoogleProvider } from "./provider-registration.js";
+import { buildGoogleSpeechProvider } from "./speech-provider.js";
 import { createGeminiWebSearchProvider } from "./src/gemini-web-search-provider.js";
 import { buildGoogleVideoGenerationProvider } from "./video-generation-provider.js";

 let googleImageGenerationProviderPromise: Promise<ImageGenerationProvider> | null = null;
 let googleMediaUnderstandingProviderPromise: Promise<MediaUnderstandingProvider> | null = null;

-type GoogleMediaUnderstandingProvider = MediaUnderstandingProvider & {
-  describeImage: NonNullable<MediaUnderstandingProvider["describeImage"]>;
-  describeImages: NonNullable<MediaUnderstandingProvider["describeImages"]>;
-  transcribeAudio: NonNullable<MediaUnderstandingProvider["transcribeAudio"]>;
-  describeVideo: NonNullable<MediaUnderstandingProvider["describeVideo"]>;
-};
+type GoogleMediaUnderstandingProvider = Required<
+  Pick<
+    MediaUnderstandingProvider,
+    "describeImage" | "describeImages" | "transcribeAudio" | "describeVideo"
+  >
+>;

 async function loadGoogleImageGenerationProvider(): Promise<ImageGenerationProvider> {
  if (!googleImageGenerationProviderPromise) {
@@ -113,6 +114,7 @@ export default definePluginEntry({
    api.registerImageGenerationProvider(createLazyGoogleImageGenerationProvider());
    api.registerMediaUnderstandingProvider(createLazyGoogleMediaUnderstandingProvider());
    api.registerMusicGenerationProvider(buildGoogleMusicGenerationProvider());
+    api.registerSpeechProvider(buildGoogleSpeechProvider());
    api.registerVideoGenerationProvider(buildGoogleVideoGenerationProvider());
    api.registerWebSearchProvider(createGeminiWebSearchProvider());
  },
--- a/extensions/google/openclaw.plugin.json
+++ b/extensions/google/openclaw.plugin.json
@@ -48,6 +48,7 @@
    "mediaUnderstandingProviders": ["google"],
    "imageGenerationProviders": ["google"],
    "musicGenerationProviders": ["google"],
+    "speechProviders": ["google"],
    "videoGenerationProviders": ["google"],
    "webSearchProviders": ["gemini"]
  },
--- a/extensions/google/plugin-registration.contract.test.ts
+++ b/extensions/google/plugin-registration.contract.test.ts
@@ -3,6 +3,7 @@ import { describePluginRegistrationContract } from "../../test/helpers/plugins/p

 describePluginRegistrationContract({
  ...pluginRegistrationContractCases.google,
+  speechProviderIds: ["google"],
  videoGenerationProviderIds: ["google"],
  webSearchProviderIds: ["gemini"],
  requireDescribeImages: true,
--- a/extensions/google/speech-provider.test.ts
+++ b/extensions/google/speech-provider.test.ts
@@ -0,0 +1,248 @@
+import { afterEach, describe, expect, it, vi } from "vitest";
+import { buildGoogleSpeechProvider, __testing } from "./speech-provider.js";
+
+function installGoogleTtsFetchMock(pcm = Buffer.from([1, 0, 2, 0])) {
+  const fetchMock = vi.fn().mockResolvedValue({
+    ok: true,
+    json: async () => ({
+      candidates: [
+        {
+          content: {
+            parts: [
+              {
+                inlineData: {
+                  mimeType: "audio/L16;codec=pcm;rate=24000",
+                  data: pcm.toString("base64"),
+                },
+              },
+            ],
+          },
+        },
+      ],
+    }),
+  });
+  vi.stubGlobal("fetch", fetchMock);
+  return fetchMock;
+}
+
+describe("Google speech provider", () => {
+  afterEach(() => {
+    vi.restoreAllMocks();
+    vi.unstubAllGlobals();
+    vi.unstubAllEnvs();
+  });
+
+  it("synthesizes Gemini PCM as WAV and preserves audio tags in the request text", async () => {
+    const fetchMock = installGoogleTtsFetchMock();
+    const provider = buildGoogleSpeechProvider();
+
+    const result = await provider.synthesize({
+      text: "[whispers] The door is open.",
+      cfg: {},
+      providerConfig: {
+        apiKey: "google-test-key",
+        model: "google/gemini-3.1-flash-tts",
+        voiceName: "Puck",
+      },
+      target: "audio-file",
+      timeoutMs: 12_345,
+    });
+
+    expect(fetchMock).toHaveBeenCalledWith(
+      "https://generativelanguage.googleapis.com/v1beta/models/gemini-3.1-flash-tts-preview:generateContent",
+      expect.objectContaining({
+        method: "POST",
+        body: JSON.stringify({
+          contents: [
+            {
+              role: "user",
+              parts: [{ text: "[whispers] The door is open." }],
+            },
+          ],
+          generationConfig: {
+            responseModalities: ["AUDIO"],
+            speechConfig: {
+              voiceConfig: {
+                prebuiltVoiceConfig: {
+                  voiceName: "Puck",
+                },
+              },
+            },
+          },
+        }),
+      }),
+    );
+    const [, init] = fetchMock.mock.calls[0];
+    expect(new Headers(init.headers).get("x-goog-api-key")).toBe("google-test-key");
+    expect(result.outputFormat).toBe("wav");
+    expect(result.fileExtension).toBe(".wav");
+    expect(result.voiceCompatible).toBe(false);
+    expect(result.audioBuffer.subarray(0, 4).toString("ascii")).toBe("RIFF");
+    expect(result.audioBuffer.subarray(8, 12).toString("ascii")).toBe("WAVE");
+    expect(result.audioBuffer.readUInt32LE(24)).toBe(__testing.GOOGLE_TTS_SAMPLE_RATE);
+    expect(result.audioBuffer.subarray(44)).toEqual(Buffer.from([1, 0, 2, 0]));
+  });
+
+  it("falls back to GEMINI_API_KEY and configured Google API base URL", async () => {
+    vi.stubEnv("GEMINI_API_KEY", "env-google-key");
+    const fetchMock = installGoogleTtsFetchMock();
+    const provider = buildGoogleSpeechProvider();
+
+    expect(provider.isConfigured({ providerConfig: {}, timeoutMs: 1 })).toBe(true);
+
+    await provider.synthesize({
+      text: "Read this plainly.",
+      cfg: {
+        models: {
+          providers: {
+            google: {
+              baseUrl: "https://generativelanguage.googleapis.com/v1beta/openai",
+              models: [],
+            },
+          },
+        },
+      },
+      providerConfig: {},
+      target: "voice-note",
+      timeoutMs: 10_000,
+    });
+
+    expect(fetchMock).toHaveBeenCalledWith(
+      "https://generativelanguage.googleapis.com/v1beta/models/gemini-3.1-flash-tts-preview:generateContent",
+      expect.any(Object),
+    );
+    const [, init] = fetchMock.mock.calls[0];
+    expect(new Headers(init.headers).get("x-goog-api-key")).toBe("env-google-key");
+  });
+
+  it("can reuse a configured Google model-provider API key without auth profiles", async () => {
+    const fetchMock = installGoogleTtsFetchMock();
+    const provider = buildGoogleSpeechProvider();
+    const cfg = {
+      models: {
+        providers: {
+          google: {
+            apiKey: "model-provider-google-key",
+            baseUrl: "https://generativelanguage.googleapis.com",
+            models: [],
+          },
+        },
+      },
+    };
+
+    expect(provider.isConfigured({ cfg, providerConfig: {}, timeoutMs: 1 })).toBe(true);
+
+    await provider.synthesize({
+      text: "Use the configured model provider key.",
+      cfg,
+      providerConfig: {},
+      target: "audio-file",
+      timeoutMs: 10_000,
+    });
+
+    const [, init] = fetchMock.mock.calls[0];
+    expect(new Headers(init.headers).get("x-goog-api-key")).toBe("model-provider-google-key");
+  });
+
+  it("returns Gemini PCM directly for telephony synthesis", async () => {
+    const pcm = Buffer.from([3, 0, 4, 0]);
+    installGoogleTtsFetchMock(pcm);
+    const provider = buildGoogleSpeechProvider();
+
+    const result = await provider.synthesizeTelephony?.({
+      text: "Phone call audio.",
+      cfg: {},
+      providerConfig: {
+        apiKey: "google-test-key",
+        voice: "Kore",
+      },
+      timeoutMs: 5_000,
+    });
+
+    expect(result).toEqual({
+      audioBuffer: pcm,
+      outputFormat: "pcm",
+      sampleRate: 24_000,
+    });
+  });
+
+  it("resolves provider config and directive overrides", () => {
+    const provider = buildGoogleSpeechProvider();
+
+    expect(
+      provider.resolveConfig?.({
+        cfg: {},
+        rawConfig: {
+          providers: {
+            google: {
+              apiKey: "configured-key",
+              model: "google/gemini-3.1-flash-tts-preview",
+              voice: "Leda",
+            },
+          },
+        },
+        timeoutMs: 1,
+      }),
+    ).toEqual({
+      apiKey: "configured-key",
+      baseUrl: undefined,
+      model: "gemini-3.1-flash-tts-preview",
+      voiceName: "Leda",
+    });
+
+    expect(
+      provider.parseDirectiveToken?.({
+        key: "google_voice",
+        value: "Aoede",
+        policy: {
+          enabled: true,
+          allowText: true,
+          allowProvider: true,
+          allowVoice: true,
+          allowModelId: true,
+          allowVoiceSettings: true,
+          allowNormalization: true,
+          allowSeed: true,
+        },
+      }),
+    ).toEqual({
+      handled: true,
+      overrides: {
+        voiceName: "Aoede",
+      },
+    });
+
+    expect(
+      provider.parseDirectiveToken?.({
+        key: "google_model",
+        value: "gemini-3.1-flash-tts-preview",
+        policy: {
+          enabled: true,
+          allowText: true,
+          allowProvider: true,
+          allowVoice: true,
+          allowModelId: true,
+          allowVoiceSettings: true,
+          allowNormalization: true,
+          allowSeed: true,
+        },
+      }),
+    ).toEqual({
+      handled: true,
+      overrides: {
+        model: "gemini-3.1-flash-tts-preview",
+      },
+    });
+  });
+
+  it("lists Gemini prebuilt TTS voices", async () => {
+    const provider = buildGoogleSpeechProvider();
+
+    await expect(provider.listVoices?.({ providerConfig: {} })).resolves.toEqual(
+      expect.arrayContaining([
+        { id: "Kore", name: "Kore" },
+        { id: "Puck", name: "Puck" },
+      ]),
+    );
+  });
+});
--- a/extensions/google/speech-provider.ts
+++ b/extensions/google/speech-provider.ts
@@ -0,0 +1,391 @@
+import { assertOkOrThrowHttpError, postJsonRequest } from "openclaw/plugin-sdk/provider-http";
+import type { OpenClawConfig } from "openclaw/plugin-sdk/provider-onboard";
+import { normalizeResolvedSecretInputString } from "openclaw/plugin-sdk/secret-input";
+import type {
+  SpeechDirectiveTokenParseContext,
+  SpeechProviderConfig,
+  SpeechProviderOverrides,
+  SpeechProviderPlugin,
+} from "openclaw/plugin-sdk/speech-core";
+import { asObject, trimToUndefined } from "openclaw/plugin-sdk/speech-core";
+import { normalizeOptionalString } from "openclaw/plugin-sdk/text-runtime";
+import { resolveGoogleGenerativeAiHttpRequestConfig } from "./api.js";
+
+const DEFAULT_GOOGLE_TTS_MODEL = "gemini-3.1-flash-tts-preview";
+const DEFAULT_GOOGLE_TTS_VOICE = "Kore";
+const GOOGLE_TTS_SAMPLE_RATE = 24_000;
+const GOOGLE_TTS_CHANNELS = 1;
+const GOOGLE_TTS_BITS_PER_SAMPLE = 16;
+
+const GOOGLE_TTS_VOICES = [
+  "Zephyr",
+  "Puck",
+  "Charon",
+  "Kore",
+  "Fenrir",
+  "Leda",
+  "Orus",
+  "Aoede",
+  "Callirrhoe",
+  "Autonoe",
+  "Enceladus",
+  "Iapetus",
+  "Umbriel",
+  "Algieba",
+  "Despina",
+  "Erinome",
+  "Algenib",
+  "Rasalgethi",
+  "Laomedeia",
+  "Achernar",
+  "Alnilam",
+  "Schedar",
+  "Gacrux",
+  "Pulcherrima",
+  "Achird",
+  "Zubenelgenubi",
+  "Vindemiatrix",
+  "Sadachbia",
+  "Sadaltager",
+  "Sulafat",
+] as const;
+
+type GoogleTtsProviderConfig = {
+  apiKey?: string;
+  baseUrl?: string;
+  model: string;
+  voiceName: string;
+};
+
+type GoogleTtsProviderOverrides = {
+  model?: string;
+  voiceName?: string;
+};
+
+type Maybe<T> = T | undefined;
+
+type GoogleInlineDataPart = {
+  mimeType?: string;
+  mime_type?: string;
+  data?: string;
+};
+
+type GoogleGenerateSpeechResponse = {
+  candidates?: Array<{
+    content?: {
+      parts?: Array<{
+        text?: string;
+        inlineData?: GoogleInlineDataPart;
+        inline_data?: GoogleInlineDataPart;
+      }>;
+    };
+  }>;
+};
+
+function normalizeGoogleTtsModel(model: unknown): string {
+  const trimmed = normalizeOptionalString(model);
+  if (!trimmed) {
+    return DEFAULT_GOOGLE_TTS_MODEL;
+  }
+  const withoutProvider = trimmed.startsWith("google/") ? trimmed.slice("google/".length) : trimmed;
+  return withoutProvider === "gemini-3.1-flash-tts" ? DEFAULT_GOOGLE_TTS_MODEL : withoutProvider;
+}
+
+function normalizeGoogleTtsVoiceName(voiceName: unknown): string {
+  return normalizeOptionalString(voiceName) ?? DEFAULT_GOOGLE_TTS_VOICE;
+}
+
+function resolveGoogleTtsEnvApiKey(): string | undefined {
+  return (
+    normalizeOptionalString(process.env.GEMINI_API_KEY) ??
+    normalizeOptionalString(process.env.GOOGLE_API_KEY)
+  );
+}
+
+function resolveGoogleTtsModelProviderApiKey(cfg?: OpenClawConfig): string | undefined {
+  return normalizeResolvedSecretInputString({
+    value: cfg?.models?.providers?.google?.apiKey,
+    path: "models.providers.google.apiKey",
+  });
+}
+
+function resolveGoogleTtsApiKey(params: {
+  cfg?: OpenClawConfig;
+  providerConfig: SpeechProviderConfig;
+}): string | undefined {
+  return (
+    readGoogleTtsProviderConfig(params.providerConfig).apiKey ??
+    resolveGoogleTtsModelProviderApiKey(params.cfg) ??
+    resolveGoogleTtsEnvApiKey()
+  );
+}
+
+function resolveGoogleTtsBaseUrl(params: {
+  cfg?: OpenClawConfig;
+  providerConfig: GoogleTtsProviderConfig;
+}): string | undefined {
+  return (
+    params.providerConfig.baseUrl ?? trimToUndefined(params.cfg?.models?.providers?.google?.baseUrl)
+  );
+}
+
+function resolveGoogleTtsConfigRecord(
+  rawConfig: Record<string, unknown>,
+): Record<string, unknown> | undefined {
+  const providers = asObject(rawConfig.providers);
+  return asObject(providers?.google) ?? asObject(rawConfig.google);
+}
+
+function normalizeGoogleTtsProviderConfig(
+  rawConfig: Record<string, unknown>,
+): GoogleTtsProviderConfig {
+  const raw = resolveGoogleTtsConfigRecord(rawConfig);
+  return {
+    apiKey: normalizeResolvedSecretInputString({
+      value: raw?.apiKey,
+      path: "messages.tts.providers.google.apiKey",
+    }),
+    baseUrl: trimToUndefined(raw?.baseUrl),
+    model: normalizeGoogleTtsModel(raw?.model),
+    voiceName: normalizeGoogleTtsVoiceName(raw?.voiceName ?? raw?.voice),
+  };
+}
+
+function readGoogleTtsProviderConfig(config: SpeechProviderConfig): GoogleTtsProviderConfig {
+  const normalized = normalizeGoogleTtsProviderConfig({});
+  return {
+    apiKey: trimToUndefined(config.apiKey) ?? normalized.apiKey,
+    baseUrl: trimToUndefined(config.baseUrl) ?? normalized.baseUrl,
+    model: normalizeGoogleTtsModel(config.model ?? normalized.model),
+    voiceName: normalizeGoogleTtsVoiceName(
+      config.voiceName ?? config.voice ?? normalized.voiceName,
+    ),
+  };
+}
+
+function readGoogleTtsOverrides(
+  overrides: Maybe<SpeechProviderOverrides>,
+): GoogleTtsProviderOverrides {
+  if (!overrides) {
+    return {};
+  }
+  return {
+    model: normalizeOptionalString(overrides.model),
+    voiceName: normalizeOptionalString(overrides.voiceName ?? overrides.voice),
+  };
+}
+
+function parseDirectiveToken(ctx: SpeechDirectiveTokenParseContext): {
+  handled: boolean;
+  overrides?: SpeechProviderOverrides;
+  warnings?: string[];
+} {
+  switch (ctx.key) {
+    case "voicename":
+    case "voice_name":
+    case "google_voice":
+    case "googlevoice":
+      if (!ctx.policy.allowVoice) {
+        return { handled: true };
+      }
+      return { handled: true, overrides: { voiceName: ctx.value } };
+    case "google_model":
+    case "googlemodel":
+      if (!ctx.policy.allowModelId) {
+        return { handled: true };
+      }
+      return { handled: true, overrides: { model: ctx.value } };
+    default:
+      return { handled: false };
+  }
+}
+
+function extractGoogleSpeechPcm(payload: GoogleGenerateSpeechResponse): Buffer {
+  for (const candidate of payload.candidates ?? []) {
+    for (const part of candidate.content?.parts ?? []) {
+      const inline = part.inlineData ?? part.inline_data;
+      const data = normalizeOptionalString(inline?.data);
+      if (!data) {
+        continue;
+      }
+      return Buffer.from(data, "base64");
+    }
+  }
+  throw new Error("Google TTS response missing audio data");
+}
+
+function wrapPcm16MonoToWav(pcm: Buffer, sampleRate = GOOGLE_TTS_SAMPLE_RATE): Buffer {
+  const byteRate = sampleRate * GOOGLE_TTS_CHANNELS * (GOOGLE_TTS_BITS_PER_SAMPLE / 8);
+  const blockAlign = GOOGLE_TTS_CHANNELS * (GOOGLE_TTS_BITS_PER_SAMPLE / 8);
+  const header = Buffer.alloc(44);
+
+  header.write("RIFF", 0, "ascii");
+  header.writeUInt32LE(36 + pcm.length, 4);
+  header.write("WAVE", 8, "ascii");
+  header.write("fmt ", 12, "ascii");
+  header.writeUInt32LE(16, 16);
+  header.writeUInt16LE(1, 20);
+  header.writeUInt16LE(GOOGLE_TTS_CHANNELS, 22);
+  header.writeUInt32LE(sampleRate, 24);
+  header.writeUInt32LE(byteRate, 28);
+  header.writeUInt16LE(blockAlign, 32);
+  header.writeUInt16LE(GOOGLE_TTS_BITS_PER_SAMPLE, 34);
+  header.write("data", 36, "ascii");
+  header.writeUInt32LE(pcm.length, 40);
+
+  return Buffer.concat([header, pcm]);
+}
+
+async function synthesizeGoogleTtsPcm(params: {
+  text: string;
+  apiKey: string;
+  baseUrl?: string;
+  model: string;
+  voiceName: string;
+  timeoutMs: number;
+}): Promise<Buffer> {
+  const { baseUrl, allowPrivateNetwork, headers, dispatcherPolicy } =
+    resolveGoogleGenerativeAiHttpRequestConfig({
+      apiKey: params.apiKey,
+      baseUrl: params.baseUrl,
+      capability: "audio",
+      transport: "http",
+    });
+
+  const { response: res, release } = await postJsonRequest({
+    url: `${baseUrl}/models/${params.model}:generateContent`,
+    headers,
+    body: {
+      contents: [
+        {
+          role: "user",
+          parts: [{ text: params.text }],
+        },
+      ],
+      generationConfig: {
+        responseModalities: ["AUDIO"],
+        speechConfig: {
+          voiceConfig: {
+            prebuiltVoiceConfig: {
+              voiceName: params.voiceName,
+            },
+          },
+        },
+      },
+    },
+    timeoutMs: params.timeoutMs,
+    fetchFn: fetch,
+    pinDns: false,
+    allowPrivateNetwork,
+    dispatcherPolicy,
+  });
+
+  try {
+    await assertOkOrThrowHttpError(res, "Google TTS failed");
+    return extractGoogleSpeechPcm((await res.json()) as GoogleGenerateSpeechResponse);
+  } finally {
+    await release();
+  }
+}
+
+export function buildGoogleSpeechProvider(): SpeechProviderPlugin {
+  return {
+    id: "google",
+    label: "Google",
+    autoSelectOrder: 50,
+    models: [DEFAULT_GOOGLE_TTS_MODEL],
+    voices: GOOGLE_TTS_VOICES,
+    resolveConfig: ({ rawConfig }) => normalizeGoogleTtsProviderConfig(rawConfig),
+    parseDirectiveToken,
+    resolveTalkConfig: ({ baseTtsConfig, talkProviderConfig }) => {
+      const base = normalizeGoogleTtsProviderConfig(baseTtsConfig);
+      return {
+        ...base,
+        ...(talkProviderConfig.apiKey === undefined
+          ? {}
+          : {
+              apiKey: normalizeResolvedSecretInputString({
+                value: talkProviderConfig.apiKey,
+                path: "talk.providers.google.apiKey",
+              }),
+            }),
+        ...(trimToUndefined(talkProviderConfig.baseUrl) == null
+          ? {}
+          : { baseUrl: trimToUndefined(talkProviderConfig.baseUrl) }),
+        ...(trimToUndefined(talkProviderConfig.modelId) == null
+          ? {}
+          : { model: normalizeGoogleTtsModel(talkProviderConfig.modelId) }),
+        ...(trimToUndefined(talkProviderConfig.voiceId) == null
+          ? {}
+          : { voiceName: normalizeGoogleTtsVoiceName(talkProviderConfig.voiceId) }),
+      };
+    },
+    resolveTalkOverrides: ({ params }) => ({
+      ...(trimToUndefined(params.voiceId) == null
+        ? {}
+        : { voiceName: normalizeGoogleTtsVoiceName(params.voiceId) }),
+      ...(trimToUndefined(params.modelId) == null
+        ? {}
+        : { model: normalizeGoogleTtsModel(params.modelId) }),
+    }),
+    listVoices: async () => GOOGLE_TTS_VOICES.map((voice) => ({ id: voice, name: voice })),
+    isConfigured: ({ cfg, providerConfig }) =>
+      Boolean(resolveGoogleTtsApiKey({ cfg, providerConfig })),
+    synthesize: async (req) => {
+      const config = readGoogleTtsProviderConfig(req.providerConfig);
+      const overrides = readGoogleTtsOverrides(req.providerOverrides);
+      const apiKey = resolveGoogleTtsApiKey({
+        cfg: req.cfg,
+        providerConfig: req.providerConfig,
+      });
+      if (!apiKey) {
+        throw new Error("Google API key missing");
+      }
+      const pcm = await synthesizeGoogleTtsPcm({
+        text: req.text,
+        apiKey,
+        baseUrl: resolveGoogleTtsBaseUrl({ cfg: req.cfg, providerConfig: config }),
+        model: normalizeGoogleTtsModel(overrides.model ?? config.model),
+        voiceName: normalizeGoogleTtsVoiceName(overrides.voiceName ?? config.voiceName),
+        timeoutMs: req.timeoutMs,
+      });
+      return {
+        audioBuffer: wrapPcm16MonoToWav(pcm),
+        outputFormat: "wav",
+        fileExtension: ".wav",
+        voiceCompatible: false,
+      };
+    },
+    synthesizeTelephony: async (req) => {
+      const config = readGoogleTtsProviderConfig(req.providerConfig);
+      const apiKey = resolveGoogleTtsApiKey({
+        cfg: req.cfg,
+        providerConfig: req.providerConfig,
+      });
+      if (!apiKey) {
+        throw new Error("Google API key missing");
+      }
+      const pcm = await synthesizeGoogleTtsPcm({
+        text: req.text,
+        apiKey,
+        baseUrl: resolveGoogleTtsBaseUrl({ cfg: req.cfg, providerConfig: config }),
+        model: config.model,
+        voiceName: config.voiceName,
+        timeoutMs: req.timeoutMs,
+      });
+      return {
+        audioBuffer: pcm,
+        outputFormat: "pcm",
+        sampleRate: GOOGLE_TTS_SAMPLE_RATE,
+      };
+    },
+  };
+}
+
+export const __testing = {
+  DEFAULT_GOOGLE_TTS_MODEL,
+  DEFAULT_GOOGLE_TTS_VOICE,
+  GOOGLE_TTS_SAMPLE_RATE,
+  normalizeGoogleTtsModel,
+  wrapPcm16MonoToWav,
+};
--- a/extensions/google/test-api.ts
+++ b/extensions/google/test-api.ts
@@ -1,5 +1,6 @@
 export { buildGoogleGeminiCliBackend } from "./cli-backend.js";
 export { buildGoogleImageGenerationProvider } from "./image-generation-provider.js";
 export { buildGoogleMusicGenerationProvider } from "./music-generation-provider.js";
+export { buildGoogleSpeechProvider } from "./speech-provider.js";
 export { googleMediaUnderstandingProvider } from "./media-understanding-provider.js";
 export { buildGoogleVideoGenerationProvider } from "./video-generation-provider.js";
--- a/extensions/speech-core/src/tts.ts
+++ b/extensions/speech-core/src/tts.ts
@@ -474,9 +474,11 @@ export function getTtsProvider(config: ResolvedTtsConfig, prefsPath: string): Tt
    return normalizeConfiguredSpeechProviderId(config.provider) ?? config.provider;
  }

-  for (const provider of sortSpeechProvidersForAutoSelection()) {
+  const effectiveCfg = config.sourceConfig;
+  for (const provider of sortSpeechProvidersForAutoSelection(effectiveCfg)) {
    if (
      provider.isConfigured({
+        cfg: effectiveCfg,
        providerConfig: config.providerConfigs[provider.id] ?? {},
        timeoutMs: config.timeoutMs,
      })
--- a/test/helpers/plugins/plugin-registration-contract-cases.ts
+++ b/test/helpers/plugins/plugin-registration-contract-cases.ts
@@ -55,6 +55,7 @@ export const pluginRegistrationContractCases = {
    pluginId: "google",
    providerIds: ["google", "google-gemini-cli"],
    webSearchProviderIds: ["gemini"],
+    speechProviderIds: ["google"],
    mediaUnderstandingProviderIds: ["google"],
    imageGenerationProviderIds: ["google"],
    requireDescribeImages: true,
--- a/test/helpers/plugins/tts-contract-suites.ts
+++ b/test/helpers/plugins/tts-contract-suites.ts
@@ -307,7 +307,8 @@ function buildTestMicrosoftSpeechProvider(): SpeechProviderPlugin {
        outputFormat: edgeConfig.outputFormat ?? "audio-24khz-48kbitrate-mono-mp3",
      };
    },
-    isConfigured: () => true,
+    isConfigured: ({ providerConfig }) =>
+      (providerConfig as Record<string, unknown> | undefined)?.enabled !== false,
    synthesize: async () => ({
      audioBuffer: createAudioBuffer(),
      outputFormat: "mp3",
@@ -368,6 +369,32 @@ function buildTestElevenLabsSpeechProvider(): SpeechProviderPlugin {
  };
 }

+function buildTestGoogleSpeechProvider(): SpeechProviderPlugin {
+  return {
+    id: "google",
+    label: "Google",
+    autoSelectOrder: 50,
+    resolveConfig: ({ rawConfig }) => resolveTestProviderConfig(rawConfig, "google"),
+    isConfigured: ({ cfg, providerConfig }) =>
+      typeof (providerConfig as Record<string, unknown> | undefined)?.apiKey === "string" ||
+      typeof cfg?.models?.providers?.google?.apiKey === "string" ||
+      typeof process.env.GEMINI_API_KEY === "string" ||
+      typeof process.env.GOOGLE_API_KEY === "string",
+    synthesize: async () => ({
+      audioBuffer: createAudioBuffer(),
+      outputFormat: "wav",
+      fileExtension: ".wav",
+      voiceCompatible: false,
+    }),
+    synthesizeTelephony: async () => ({
+      audioBuffer: createAudioBuffer(),
+      outputFormat: "pcm",
+      sampleRate: 24_000,
+    }),
+    listVoices: async () => [{ id: "Kore", label: "Kore" }],
+  };
+}
+
 async function loadTtsRuntime(): Promise<TtsRuntimeModule> {
  ttsRuntimePromise ??= import("../../../src/tts/tts.js");
  return await ttsRuntimePromise;
@@ -395,6 +422,7 @@ function setupTestSpeechProviderRegistry() {
    { pluginId: "openai", provider: buildTestOpenAISpeechProvider(), source: "test" },
    { pluginId: "microsoft", provider: buildTestMicrosoftSpeechProvider(), source: "test" },
    { pluginId: "elevenlabs", provider: buildTestElevenLabsSpeechProvider(), source: "test" },
+    { pluginId: "google", provider: buildTestGoogleSpeechProvider(), source: "test" },
  ];
  const { cacheKey } = pluginLoaderTesting.resolvePluginLoadCacheContext({ config: {} });
  setActivePluginRegistry(registry, cacheKey);
@@ -613,6 +641,32 @@ export function describeTtsConfigContract() {
          expect(provider).toBe(testCase.expected);
        });
      });
+
+      it("passes cfg into auto-selection so model-provider Google keys can configure TTS", () => {
+        const cfg = asLegacyOpenClawConfig({
+          agents: { defaults: { model: { primary: "openai/gpt-4o-mini" } } },
+          models: {
+            providers: {
+              google: {
+                apiKey: "model-provider-google-key",
+              },
+            },
+          },
+          messages: {
+            tts: {
+              providers: {
+                microsoft: {
+                  enabled: false,
+                },
+              },
+            },
+          },
+        });
+        const config = resolveTtsConfig(cfg);
+        const prefsPath = `/tmp/tts-prefs-google-model-provider-${Date.now()}.json`;
+
+        expect(getTtsProvider(config, prefsPath)).toBe("google");
+      });
    });

    describe("resolveTtsConfig provider normalization", () => {