diff --git a/CHANGELOG.md b/CHANGELOG.md index 702eb3f1c4a..7e4a3ba5318 100644 --- a/CHANGELOG.md +++ b/CHANGELOG.md @@ -6,6 +6,8 @@ Docs: https://docs.openclaw.ai ### Changes +- Google/TTS: add Gemini text-to-speech support to the bundled `google` plugin, including provider registration, voice selection, WAV reply output, PCM telephony output, and setup/docs guidance. (#67515) Thanks @barronlroth. + ### Fixes - Gateway/tools: anchor trusted local `MEDIA:` tool-result passthrough on the exact raw name of this run's registered built-in tools, and reject client tool definitions whose names normalize-collide with a built-in or with another client tool in the same request (`400 invalid_request_error` on both JSON and SSE paths), so a client-supplied tool named like a built-in can no longer inherit its local-media trust. (#67303) diff --git a/docs/providers/google.md b/docs/providers/google.md index 70ee5d16693..7dabc43a100 100644 --- a/docs/providers/google.md +++ b/docs/providers/google.md @@ -1,6 +1,6 @@ --- title: "Google (Gemini)" -summary: "Google Gemini setup (API key + OAuth, image generation, media understanding, web search)" +summary: "Google Gemini setup (API key + OAuth, image generation, media understanding, TTS, web search)" read_when: - You want to use Google Gemini models with OpenClaw - You need the API key or OAuth auth flow @@ -9,7 +9,7 @@ read_when: # Google (Gemini) The Google plugin provides access to Gemini models through Google AI Studio, plus -image generation, media understanding (image/audio/video), and web search via +image generation, media understanding (image/audio/video), text-to-speech, and web search via Gemini Grounding. - Provider: `google` @@ -133,6 +133,7 @@ Choose your preferred auth method and follow the setup steps. | Chat completions | Yes | | Image generation | Yes | | Music generation | Yes | +| Text-to-speech | Yes | | Image understanding | Yes | | Audio transcription | Yes | | Video understanding | Yes | @@ -233,6 +234,50 @@ To use Google as the default music provider: See [Music Generation](/tools/music-generation) for shared tool parameters, provider selection, and failover behavior. +## Text-to-speech + +The bundled `google` speech provider uses the Gemini API TTS path with +`gemini-3.1-flash-tts-preview`. + +- Default voice: `Kore` +- Auth: `messages.tts.providers.google.apiKey`, `models.providers.google.apiKey`, `GEMINI_API_KEY`, or `GOOGLE_API_KEY` +- Output: WAV for regular TTS attachments, PCM for Talk/telephony +- Native voice-note output: not supported on this Gemini API path because the API returns PCM rather than Opus + +To use Google as the default TTS provider: + +```json5 +{ + messages: { + tts: { + auto: "always", + provider: "google", + providers: { + google: { + model: "gemini-3.1-flash-tts-preview", + voiceName: "Kore", + }, + }, + }, + }, +} +``` + +Gemini API TTS accepts expressive square-bracket audio tags in the text, such as +`[whispers]` or `[laughs]`. To keep tags out of the visible chat reply while +sending them to TTS, put them inside a `[[tts:text]]...[[/tts:text]]` block: + +```text +Here is the clean reply text. + +[[tts:text]][whispers] Here is the spoken version.[[/tts:text]] +``` + + +A Google Cloud Console API key restricted to the Gemini API is valid for this +provider. This is not the separate Cloud Text-to-Speech API path. + + ## Advanced configuration diff --git a/docs/tools/tts.md b/docs/tools/tts.md index 0f4a7075e3f..cdb59116720 100644 --- a/docs/tools/tts.md +++ b/docs/tools/tts.md @@ -9,12 +9,13 @@ title: "Text-to-Speech" # Text-to-speech (TTS) -OpenClaw can convert outbound replies into audio using ElevenLabs, Microsoft, MiniMax, or OpenAI. +OpenClaw can convert outbound replies into audio using ElevenLabs, Google Gemini, Microsoft, MiniMax, or OpenAI. It works anywhere OpenClaw can send audio. ## Supported services - **ElevenLabs** (primary or fallback provider) +- **Google Gemini** (primary or fallback provider; uses Gemini API TTS) - **Microsoft** (primary or fallback provider; current bundled implementation uses `node-edge-tts`) - **MiniMax** (primary or fallback provider; uses the T2A v2 API) - **OpenAI** (primary or fallback provider; also used for summaries) @@ -34,9 +35,10 @@ or ElevenLabs. ## Optional keys -If you want OpenAI, ElevenLabs, or MiniMax: +If you want OpenAI, ElevenLabs, Google Gemini, or MiniMax: - `ELEVENLABS_API_KEY` (or `XI_API_KEY`) +- `GEMINI_API_KEY` (or `GOOGLE_API_KEY`) - `MINIMAX_API_KEY` - `OPENAI_API_KEY` @@ -170,6 +172,32 @@ Full schema is in [Gateway configuration](/gateway/configuration). } ``` +### Google Gemini primary + +```json5 +{ + messages: { + tts: { + auto: "always", + provider: "google", + providers: { + google: { + apiKey: "gemini_api_key", + model: "gemini-3.1-flash-tts-preview", + voiceName: "Kore", + }, + }, + }, + }, +} +``` + +Google Gemini TTS uses the Gemini API key path. A Google Cloud Console API key +restricted to the Gemini API is valid here, and it is the same style of key used +by the bundled Google image-generation provider. Resolution order is +`messages.tts.providers.google.apiKey` -> `models.providers.google.apiKey` -> +`GEMINI_API_KEY` -> `GOOGLE_API_KEY`. + ### Disable Microsoft speech ```json5 @@ -238,7 +266,7 @@ Then run: - `tagged` only sends audio when the reply includes `[[tts:key=value]]` directives or a `[[tts:text]]...[[/tts:text]]` block. - `enabled`: legacy toggle (doctor migrates this to `auto`). - `mode`: `"final"` (default) or `"all"` (includes tool/block replies). -- `provider`: speech provider id such as `"elevenlabs"`, `"microsoft"`, `"minimax"`, or `"openai"` (fallback is automatic). +- `provider`: speech provider id such as `"elevenlabs"`, `"google"`, `"microsoft"`, `"minimax"`, or `"openai"` (fallback is automatic). - If `provider` is **unset**, OpenClaw uses the first configured speech provider in registry auto-select order. - Legacy `provider: "edge"` still works and is normalized to `microsoft`. - `summaryModel`: optional cheap model for auto-summary; defaults to `agents.defaults.model.primary`. @@ -250,7 +278,7 @@ Then run: - `maxTextLength`: hard cap for TTS input (chars). `/tts audio` fails if exceeded. - `timeoutMs`: request timeout (ms). - `prefsPath`: override the local prefs JSON path (provider/limit/summary). -- `apiKey` values fall back to env vars (`ELEVENLABS_API_KEY`/`XI_API_KEY`, `MINIMAX_API_KEY`, `OPENAI_API_KEY`). +- `apiKey` values fall back to env vars (`ELEVENLABS_API_KEY`/`XI_API_KEY`, `GEMINI_API_KEY`/`GOOGLE_API_KEY`, `MINIMAX_API_KEY`, `OPENAI_API_KEY`). - `providers.elevenlabs.baseUrl`: override ElevenLabs API base URL. - `providers.openai.baseUrl`: override the OpenAI TTS endpoint. - Resolution order: `messages.tts.providers.openai.baseUrl` -> `OPENAI_TTS_BASE_URL` -> `https://api.openai.com/v1` @@ -268,6 +296,10 @@ Then run: - `providers.minimax.speed`: playback speed `0.5..2.0` (default 1.0). - `providers.minimax.vol`: volume `(0, 10]` (default 1.0; must be greater than 0). - `providers.minimax.pitch`: pitch shift `-12..12` (default 0). +- `providers.google.model`: Gemini TTS model (default `gemini-3.1-flash-tts-preview`). +- `providers.google.voiceName`: Gemini prebuilt voice name (default `Kore`; `voice` is also accepted). +- `providers.google.baseUrl`: override the Gemini API base URL. Only `https://generativelanguage.googleapis.com` is accepted. + - If `messages.tts.providers.google.apiKey` is omitted, TTS can reuse `models.providers.google.apiKey` before env fallback. - `providers.microsoft.enabled`: allow Microsoft speech usage (default `true`; no API key). - `providers.microsoft.voice`: Microsoft neural voice name (e.g. `en-US-MichelleNeural`). - `providers.microsoft.lang`: language code (e.g. `en-US`). @@ -302,9 +334,9 @@ Here you go. Available directive keys (when enabled): -- `provider` (registered speech provider id, for example `openai`, `elevenlabs`, `minimax`, or `microsoft`; requires `allowProvider: true`) -- `voice` (OpenAI voice) or `voiceId` (ElevenLabs / MiniMax) -- `model` (OpenAI TTS model, ElevenLabs model id, or MiniMax model) +- `provider` (registered speech provider id, for example `openai`, `elevenlabs`, `google`, `minimax`, or `microsoft`; requires `allowProvider: true`) +- `voice` (OpenAI voice), `voiceName` / `voice_name` / `google_voice` (Google voice), or `voiceId` (ElevenLabs / MiniMax) +- `model` (OpenAI TTS model, ElevenLabs model id, or MiniMax model) or `google_model` (Google TTS model) - `stability`, `similarityBoost`, `style`, `speed`, `useSpeakerBoost` - `vol` / `volume` (MiniMax volume, 0-10) - `pitch` (MiniMax pitch, -12 to 12) @@ -364,6 +396,7 @@ These override `messages.tts.*` for that host. - **Other channels**: MP3 (`mp3_44100_128` from ElevenLabs, `mp3` from OpenAI). - 44.1kHz / 128kbps is the default balance for speech clarity. - **MiniMax**: MP3 (`speech-2.8-hd` model, 32kHz sample rate). Voice-note format not natively supported; use OpenAI or ElevenLabs for guaranteed Opus voice messages. +- **Google Gemini**: Gemini API TTS returns raw 24kHz PCM. OpenClaw wraps it as WAV for audio attachments and returns PCM directly for Talk/telephony. Native Opus voice-note format is not supported by this path. - **Microsoft**: uses `microsoft.outputFormat` (default `audio-24khz-48kbitrate-mono-mp3`). - The bundled transport accepts an `outputFormat`, but not all formats are available from the service. - Output format values follow Microsoft Speech output formats (including Ogg/WebM Opus). diff --git a/extensions/google/index.ts b/extensions/google/index.ts index 697abbe8cb5..3f8b36eea9b 100644 --- a/extensions/google/index.ts +++ b/extensions/google/index.ts @@ -5,18 +5,19 @@ import { buildGoogleGeminiCliBackend } from "./cli-backend.js"; import { registerGoogleGeminiCliProvider } from "./gemini-cli-provider.js"; import { buildGoogleMusicGenerationProvider } from "./music-generation-provider.js"; import { registerGoogleProvider } from "./provider-registration.js"; +import { buildGoogleSpeechProvider } from "./speech-provider.js"; import { createGeminiWebSearchProvider } from "./src/gemini-web-search-provider.js"; import { buildGoogleVideoGenerationProvider } from "./video-generation-provider.js"; let googleImageGenerationProviderPromise: Promise | null = null; let googleMediaUnderstandingProviderPromise: Promise | null = null; -type GoogleMediaUnderstandingProvider = MediaUnderstandingProvider & { - describeImage: NonNullable; - describeImages: NonNullable; - transcribeAudio: NonNullable; - describeVideo: NonNullable; -}; +type GoogleMediaUnderstandingProvider = Required< + Pick< + MediaUnderstandingProvider, + "describeImage" | "describeImages" | "transcribeAudio" | "describeVideo" + > +>; async function loadGoogleImageGenerationProvider(): Promise { if (!googleImageGenerationProviderPromise) { @@ -113,6 +114,7 @@ export default definePluginEntry({ api.registerImageGenerationProvider(createLazyGoogleImageGenerationProvider()); api.registerMediaUnderstandingProvider(createLazyGoogleMediaUnderstandingProvider()); api.registerMusicGenerationProvider(buildGoogleMusicGenerationProvider()); + api.registerSpeechProvider(buildGoogleSpeechProvider()); api.registerVideoGenerationProvider(buildGoogleVideoGenerationProvider()); api.registerWebSearchProvider(createGeminiWebSearchProvider()); }, diff --git a/extensions/google/openclaw.plugin.json b/extensions/google/openclaw.plugin.json index 7eea69423ef..40f0ad25e4d 100644 --- a/extensions/google/openclaw.plugin.json +++ b/extensions/google/openclaw.plugin.json @@ -48,6 +48,7 @@ "mediaUnderstandingProviders": ["google"], "imageGenerationProviders": ["google"], "musicGenerationProviders": ["google"], + "speechProviders": ["google"], "videoGenerationProviders": ["google"], "webSearchProviders": ["gemini"] }, diff --git a/extensions/google/plugin-registration.contract.test.ts b/extensions/google/plugin-registration.contract.test.ts index 0b8fbe52bbf..3c18525e463 100644 --- a/extensions/google/plugin-registration.contract.test.ts +++ b/extensions/google/plugin-registration.contract.test.ts @@ -3,6 +3,7 @@ import { describePluginRegistrationContract } from "../../test/helpers/plugins/p describePluginRegistrationContract({ ...pluginRegistrationContractCases.google, + speechProviderIds: ["google"], videoGenerationProviderIds: ["google"], webSearchProviderIds: ["gemini"], requireDescribeImages: true, diff --git a/extensions/google/speech-provider.test.ts b/extensions/google/speech-provider.test.ts new file mode 100644 index 00000000000..29ae0d57d9b --- /dev/null +++ b/extensions/google/speech-provider.test.ts @@ -0,0 +1,248 @@ +import { afterEach, describe, expect, it, vi } from "vitest"; +import { buildGoogleSpeechProvider, __testing } from "./speech-provider.js"; + +function installGoogleTtsFetchMock(pcm = Buffer.from([1, 0, 2, 0])) { + const fetchMock = vi.fn().mockResolvedValue({ + ok: true, + json: async () => ({ + candidates: [ + { + content: { + parts: [ + { + inlineData: { + mimeType: "audio/L16;codec=pcm;rate=24000", + data: pcm.toString("base64"), + }, + }, + ], + }, + }, + ], + }), + }); + vi.stubGlobal("fetch", fetchMock); + return fetchMock; +} + +describe("Google speech provider", () => { + afterEach(() => { + vi.restoreAllMocks(); + vi.unstubAllGlobals(); + vi.unstubAllEnvs(); + }); + + it("synthesizes Gemini PCM as WAV and preserves audio tags in the request text", async () => { + const fetchMock = installGoogleTtsFetchMock(); + const provider = buildGoogleSpeechProvider(); + + const result = await provider.synthesize({ + text: "[whispers] The door is open.", + cfg: {}, + providerConfig: { + apiKey: "google-test-key", + model: "google/gemini-3.1-flash-tts", + voiceName: "Puck", + }, + target: "audio-file", + timeoutMs: 12_345, + }); + + expect(fetchMock).toHaveBeenCalledWith( + "https://generativelanguage.googleapis.com/v1beta/models/gemini-3.1-flash-tts-preview:generateContent", + expect.objectContaining({ + method: "POST", + body: JSON.stringify({ + contents: [ + { + role: "user", + parts: [{ text: "[whispers] The door is open." }], + }, + ], + generationConfig: { + responseModalities: ["AUDIO"], + speechConfig: { + voiceConfig: { + prebuiltVoiceConfig: { + voiceName: "Puck", + }, + }, + }, + }, + }), + }), + ); + const [, init] = fetchMock.mock.calls[0]; + expect(new Headers(init.headers).get("x-goog-api-key")).toBe("google-test-key"); + expect(result.outputFormat).toBe("wav"); + expect(result.fileExtension).toBe(".wav"); + expect(result.voiceCompatible).toBe(false); + expect(result.audioBuffer.subarray(0, 4).toString("ascii")).toBe("RIFF"); + expect(result.audioBuffer.subarray(8, 12).toString("ascii")).toBe("WAVE"); + expect(result.audioBuffer.readUInt32LE(24)).toBe(__testing.GOOGLE_TTS_SAMPLE_RATE); + expect(result.audioBuffer.subarray(44)).toEqual(Buffer.from([1, 0, 2, 0])); + }); + + it("falls back to GEMINI_API_KEY and configured Google API base URL", async () => { + vi.stubEnv("GEMINI_API_KEY", "env-google-key"); + const fetchMock = installGoogleTtsFetchMock(); + const provider = buildGoogleSpeechProvider(); + + expect(provider.isConfigured({ providerConfig: {}, timeoutMs: 1 })).toBe(true); + + await provider.synthesize({ + text: "Read this plainly.", + cfg: { + models: { + providers: { + google: { + baseUrl: "https://generativelanguage.googleapis.com/v1beta/openai", + models: [], + }, + }, + }, + }, + providerConfig: {}, + target: "voice-note", + timeoutMs: 10_000, + }); + + expect(fetchMock).toHaveBeenCalledWith( + "https://generativelanguage.googleapis.com/v1beta/models/gemini-3.1-flash-tts-preview:generateContent", + expect.any(Object), + ); + const [, init] = fetchMock.mock.calls[0]; + expect(new Headers(init.headers).get("x-goog-api-key")).toBe("env-google-key"); + }); + + it("can reuse a configured Google model-provider API key without auth profiles", async () => { + const fetchMock = installGoogleTtsFetchMock(); + const provider = buildGoogleSpeechProvider(); + const cfg = { + models: { + providers: { + google: { + apiKey: "model-provider-google-key", + baseUrl: "https://generativelanguage.googleapis.com", + models: [], + }, + }, + }, + }; + + expect(provider.isConfigured({ cfg, providerConfig: {}, timeoutMs: 1 })).toBe(true); + + await provider.synthesize({ + text: "Use the configured model provider key.", + cfg, + providerConfig: {}, + target: "audio-file", + timeoutMs: 10_000, + }); + + const [, init] = fetchMock.mock.calls[0]; + expect(new Headers(init.headers).get("x-goog-api-key")).toBe("model-provider-google-key"); + }); + + it("returns Gemini PCM directly for telephony synthesis", async () => { + const pcm = Buffer.from([3, 0, 4, 0]); + installGoogleTtsFetchMock(pcm); + const provider = buildGoogleSpeechProvider(); + + const result = await provider.synthesizeTelephony?.({ + text: "Phone call audio.", + cfg: {}, + providerConfig: { + apiKey: "google-test-key", + voice: "Kore", + }, + timeoutMs: 5_000, + }); + + expect(result).toEqual({ + audioBuffer: pcm, + outputFormat: "pcm", + sampleRate: 24_000, + }); + }); + + it("resolves provider config and directive overrides", () => { + const provider = buildGoogleSpeechProvider(); + + expect( + provider.resolveConfig?.({ + cfg: {}, + rawConfig: { + providers: { + google: { + apiKey: "configured-key", + model: "google/gemini-3.1-flash-tts-preview", + voice: "Leda", + }, + }, + }, + timeoutMs: 1, + }), + ).toEqual({ + apiKey: "configured-key", + baseUrl: undefined, + model: "gemini-3.1-flash-tts-preview", + voiceName: "Leda", + }); + + expect( + provider.parseDirectiveToken?.({ + key: "google_voice", + value: "Aoede", + policy: { + enabled: true, + allowText: true, + allowProvider: true, + allowVoice: true, + allowModelId: true, + allowVoiceSettings: true, + allowNormalization: true, + allowSeed: true, + }, + }), + ).toEqual({ + handled: true, + overrides: { + voiceName: "Aoede", + }, + }); + + expect( + provider.parseDirectiveToken?.({ + key: "google_model", + value: "gemini-3.1-flash-tts-preview", + policy: { + enabled: true, + allowText: true, + allowProvider: true, + allowVoice: true, + allowModelId: true, + allowVoiceSettings: true, + allowNormalization: true, + allowSeed: true, + }, + }), + ).toEqual({ + handled: true, + overrides: { + model: "gemini-3.1-flash-tts-preview", + }, + }); + }); + + it("lists Gemini prebuilt TTS voices", async () => { + const provider = buildGoogleSpeechProvider(); + + await expect(provider.listVoices?.({ providerConfig: {} })).resolves.toEqual( + expect.arrayContaining([ + { id: "Kore", name: "Kore" }, + { id: "Puck", name: "Puck" }, + ]), + ); + }); +}); diff --git a/extensions/google/speech-provider.ts b/extensions/google/speech-provider.ts new file mode 100644 index 00000000000..0c22fb18f95 --- /dev/null +++ b/extensions/google/speech-provider.ts @@ -0,0 +1,391 @@ +import { assertOkOrThrowHttpError, postJsonRequest } from "openclaw/plugin-sdk/provider-http"; +import type { OpenClawConfig } from "openclaw/plugin-sdk/provider-onboard"; +import { normalizeResolvedSecretInputString } from "openclaw/plugin-sdk/secret-input"; +import type { + SpeechDirectiveTokenParseContext, + SpeechProviderConfig, + SpeechProviderOverrides, + SpeechProviderPlugin, +} from "openclaw/plugin-sdk/speech-core"; +import { asObject, trimToUndefined } from "openclaw/plugin-sdk/speech-core"; +import { normalizeOptionalString } from "openclaw/plugin-sdk/text-runtime"; +import { resolveGoogleGenerativeAiHttpRequestConfig } from "./api.js"; + +const DEFAULT_GOOGLE_TTS_MODEL = "gemini-3.1-flash-tts-preview"; +const DEFAULT_GOOGLE_TTS_VOICE = "Kore"; +const GOOGLE_TTS_SAMPLE_RATE = 24_000; +const GOOGLE_TTS_CHANNELS = 1; +const GOOGLE_TTS_BITS_PER_SAMPLE = 16; + +const GOOGLE_TTS_VOICES = [ + "Zephyr", + "Puck", + "Charon", + "Kore", + "Fenrir", + "Leda", + "Orus", + "Aoede", + "Callirrhoe", + "Autonoe", + "Enceladus", + "Iapetus", + "Umbriel", + "Algieba", + "Despina", + "Erinome", + "Algenib", + "Rasalgethi", + "Laomedeia", + "Achernar", + "Alnilam", + "Schedar", + "Gacrux", + "Pulcherrima", + "Achird", + "Zubenelgenubi", + "Vindemiatrix", + "Sadachbia", + "Sadaltager", + "Sulafat", +] as const; + +type GoogleTtsProviderConfig = { + apiKey?: string; + baseUrl?: string; + model: string; + voiceName: string; +}; + +type GoogleTtsProviderOverrides = { + model?: string; + voiceName?: string; +}; + +type Maybe = T | undefined; + +type GoogleInlineDataPart = { + mimeType?: string; + mime_type?: string; + data?: string; +}; + +type GoogleGenerateSpeechResponse = { + candidates?: Array<{ + content?: { + parts?: Array<{ + text?: string; + inlineData?: GoogleInlineDataPart; + inline_data?: GoogleInlineDataPart; + }>; + }; + }>; +}; + +function normalizeGoogleTtsModel(model: unknown): string { + const trimmed = normalizeOptionalString(model); + if (!trimmed) { + return DEFAULT_GOOGLE_TTS_MODEL; + } + const withoutProvider = trimmed.startsWith("google/") ? trimmed.slice("google/".length) : trimmed; + return withoutProvider === "gemini-3.1-flash-tts" ? DEFAULT_GOOGLE_TTS_MODEL : withoutProvider; +} + +function normalizeGoogleTtsVoiceName(voiceName: unknown): string { + return normalizeOptionalString(voiceName) ?? DEFAULT_GOOGLE_TTS_VOICE; +} + +function resolveGoogleTtsEnvApiKey(): string | undefined { + return ( + normalizeOptionalString(process.env.GEMINI_API_KEY) ?? + normalizeOptionalString(process.env.GOOGLE_API_KEY) + ); +} + +function resolveGoogleTtsModelProviderApiKey(cfg?: OpenClawConfig): string | undefined { + return normalizeResolvedSecretInputString({ + value: cfg?.models?.providers?.google?.apiKey, + path: "models.providers.google.apiKey", + }); +} + +function resolveGoogleTtsApiKey(params: { + cfg?: OpenClawConfig; + providerConfig: SpeechProviderConfig; +}): string | undefined { + return ( + readGoogleTtsProviderConfig(params.providerConfig).apiKey ?? + resolveGoogleTtsModelProviderApiKey(params.cfg) ?? + resolveGoogleTtsEnvApiKey() + ); +} + +function resolveGoogleTtsBaseUrl(params: { + cfg?: OpenClawConfig; + providerConfig: GoogleTtsProviderConfig; +}): string | undefined { + return ( + params.providerConfig.baseUrl ?? trimToUndefined(params.cfg?.models?.providers?.google?.baseUrl) + ); +} + +function resolveGoogleTtsConfigRecord( + rawConfig: Record, +): Record | undefined { + const providers = asObject(rawConfig.providers); + return asObject(providers?.google) ?? asObject(rawConfig.google); +} + +function normalizeGoogleTtsProviderConfig( + rawConfig: Record, +): GoogleTtsProviderConfig { + const raw = resolveGoogleTtsConfigRecord(rawConfig); + return { + apiKey: normalizeResolvedSecretInputString({ + value: raw?.apiKey, + path: "messages.tts.providers.google.apiKey", + }), + baseUrl: trimToUndefined(raw?.baseUrl), + model: normalizeGoogleTtsModel(raw?.model), + voiceName: normalizeGoogleTtsVoiceName(raw?.voiceName ?? raw?.voice), + }; +} + +function readGoogleTtsProviderConfig(config: SpeechProviderConfig): GoogleTtsProviderConfig { + const normalized = normalizeGoogleTtsProviderConfig({}); + return { + apiKey: trimToUndefined(config.apiKey) ?? normalized.apiKey, + baseUrl: trimToUndefined(config.baseUrl) ?? normalized.baseUrl, + model: normalizeGoogleTtsModel(config.model ?? normalized.model), + voiceName: normalizeGoogleTtsVoiceName( + config.voiceName ?? config.voice ?? normalized.voiceName, + ), + }; +} + +function readGoogleTtsOverrides( + overrides: Maybe, +): GoogleTtsProviderOverrides { + if (!overrides) { + return {}; + } + return { + model: normalizeOptionalString(overrides.model), + voiceName: normalizeOptionalString(overrides.voiceName ?? overrides.voice), + }; +} + +function parseDirectiveToken(ctx: SpeechDirectiveTokenParseContext): { + handled: boolean; + overrides?: SpeechProviderOverrides; + warnings?: string[]; +} { + switch (ctx.key) { + case "voicename": + case "voice_name": + case "google_voice": + case "googlevoice": + if (!ctx.policy.allowVoice) { + return { handled: true }; + } + return { handled: true, overrides: { voiceName: ctx.value } }; + case "google_model": + case "googlemodel": + if (!ctx.policy.allowModelId) { + return { handled: true }; + } + return { handled: true, overrides: { model: ctx.value } }; + default: + return { handled: false }; + } +} + +function extractGoogleSpeechPcm(payload: GoogleGenerateSpeechResponse): Buffer { + for (const candidate of payload.candidates ?? []) { + for (const part of candidate.content?.parts ?? []) { + const inline = part.inlineData ?? part.inline_data; + const data = normalizeOptionalString(inline?.data); + if (!data) { + continue; + } + return Buffer.from(data, "base64"); + } + } + throw new Error("Google TTS response missing audio data"); +} + +function wrapPcm16MonoToWav(pcm: Buffer, sampleRate = GOOGLE_TTS_SAMPLE_RATE): Buffer { + const byteRate = sampleRate * GOOGLE_TTS_CHANNELS * (GOOGLE_TTS_BITS_PER_SAMPLE / 8); + const blockAlign = GOOGLE_TTS_CHANNELS * (GOOGLE_TTS_BITS_PER_SAMPLE / 8); + const header = Buffer.alloc(44); + + header.write("RIFF", 0, "ascii"); + header.writeUInt32LE(36 + pcm.length, 4); + header.write("WAVE", 8, "ascii"); + header.write("fmt ", 12, "ascii"); + header.writeUInt32LE(16, 16); + header.writeUInt16LE(1, 20); + header.writeUInt16LE(GOOGLE_TTS_CHANNELS, 22); + header.writeUInt32LE(sampleRate, 24); + header.writeUInt32LE(byteRate, 28); + header.writeUInt16LE(blockAlign, 32); + header.writeUInt16LE(GOOGLE_TTS_BITS_PER_SAMPLE, 34); + header.write("data", 36, "ascii"); + header.writeUInt32LE(pcm.length, 40); + + return Buffer.concat([header, pcm]); +} + +async function synthesizeGoogleTtsPcm(params: { + text: string; + apiKey: string; + baseUrl?: string; + model: string; + voiceName: string; + timeoutMs: number; +}): Promise { + const { baseUrl, allowPrivateNetwork, headers, dispatcherPolicy } = + resolveGoogleGenerativeAiHttpRequestConfig({ + apiKey: params.apiKey, + baseUrl: params.baseUrl, + capability: "audio", + transport: "http", + }); + + const { response: res, release } = await postJsonRequest({ + url: `${baseUrl}/models/${params.model}:generateContent`, + headers, + body: { + contents: [ + { + role: "user", + parts: [{ text: params.text }], + }, + ], + generationConfig: { + responseModalities: ["AUDIO"], + speechConfig: { + voiceConfig: { + prebuiltVoiceConfig: { + voiceName: params.voiceName, + }, + }, + }, + }, + }, + timeoutMs: params.timeoutMs, + fetchFn: fetch, + pinDns: false, + allowPrivateNetwork, + dispatcherPolicy, + }); + + try { + await assertOkOrThrowHttpError(res, "Google TTS failed"); + return extractGoogleSpeechPcm((await res.json()) as GoogleGenerateSpeechResponse); + } finally { + await release(); + } +} + +export function buildGoogleSpeechProvider(): SpeechProviderPlugin { + return { + id: "google", + label: "Google", + autoSelectOrder: 50, + models: [DEFAULT_GOOGLE_TTS_MODEL], + voices: GOOGLE_TTS_VOICES, + resolveConfig: ({ rawConfig }) => normalizeGoogleTtsProviderConfig(rawConfig), + parseDirectiveToken, + resolveTalkConfig: ({ baseTtsConfig, talkProviderConfig }) => { + const base = normalizeGoogleTtsProviderConfig(baseTtsConfig); + return { + ...base, + ...(talkProviderConfig.apiKey === undefined + ? {} + : { + apiKey: normalizeResolvedSecretInputString({ + value: talkProviderConfig.apiKey, + path: "talk.providers.google.apiKey", + }), + }), + ...(trimToUndefined(talkProviderConfig.baseUrl) == null + ? {} + : { baseUrl: trimToUndefined(talkProviderConfig.baseUrl) }), + ...(trimToUndefined(talkProviderConfig.modelId) == null + ? {} + : { model: normalizeGoogleTtsModel(talkProviderConfig.modelId) }), + ...(trimToUndefined(talkProviderConfig.voiceId) == null + ? {} + : { voiceName: normalizeGoogleTtsVoiceName(talkProviderConfig.voiceId) }), + }; + }, + resolveTalkOverrides: ({ params }) => ({ + ...(trimToUndefined(params.voiceId) == null + ? {} + : { voiceName: normalizeGoogleTtsVoiceName(params.voiceId) }), + ...(trimToUndefined(params.modelId) == null + ? {} + : { model: normalizeGoogleTtsModel(params.modelId) }), + }), + listVoices: async () => GOOGLE_TTS_VOICES.map((voice) => ({ id: voice, name: voice })), + isConfigured: ({ cfg, providerConfig }) => + Boolean(resolveGoogleTtsApiKey({ cfg, providerConfig })), + synthesize: async (req) => { + const config = readGoogleTtsProviderConfig(req.providerConfig); + const overrides = readGoogleTtsOverrides(req.providerOverrides); + const apiKey = resolveGoogleTtsApiKey({ + cfg: req.cfg, + providerConfig: req.providerConfig, + }); + if (!apiKey) { + throw new Error("Google API key missing"); + } + const pcm = await synthesizeGoogleTtsPcm({ + text: req.text, + apiKey, + baseUrl: resolveGoogleTtsBaseUrl({ cfg: req.cfg, providerConfig: config }), + model: normalizeGoogleTtsModel(overrides.model ?? config.model), + voiceName: normalizeGoogleTtsVoiceName(overrides.voiceName ?? config.voiceName), + timeoutMs: req.timeoutMs, + }); + return { + audioBuffer: wrapPcm16MonoToWav(pcm), + outputFormat: "wav", + fileExtension: ".wav", + voiceCompatible: false, + }; + }, + synthesizeTelephony: async (req) => { + const config = readGoogleTtsProviderConfig(req.providerConfig); + const apiKey = resolveGoogleTtsApiKey({ + cfg: req.cfg, + providerConfig: req.providerConfig, + }); + if (!apiKey) { + throw new Error("Google API key missing"); + } + const pcm = await synthesizeGoogleTtsPcm({ + text: req.text, + apiKey, + baseUrl: resolveGoogleTtsBaseUrl({ cfg: req.cfg, providerConfig: config }), + model: config.model, + voiceName: config.voiceName, + timeoutMs: req.timeoutMs, + }); + return { + audioBuffer: pcm, + outputFormat: "pcm", + sampleRate: GOOGLE_TTS_SAMPLE_RATE, + }; + }, + }; +} + +export const __testing = { + DEFAULT_GOOGLE_TTS_MODEL, + DEFAULT_GOOGLE_TTS_VOICE, + GOOGLE_TTS_SAMPLE_RATE, + normalizeGoogleTtsModel, + wrapPcm16MonoToWav, +}; diff --git a/extensions/google/test-api.ts b/extensions/google/test-api.ts index 2ae0aa66c1a..44dfec0dddd 100644 --- a/extensions/google/test-api.ts +++ b/extensions/google/test-api.ts @@ -1,5 +1,6 @@ export { buildGoogleGeminiCliBackend } from "./cli-backend.js"; export { buildGoogleImageGenerationProvider } from "./image-generation-provider.js"; export { buildGoogleMusicGenerationProvider } from "./music-generation-provider.js"; +export { buildGoogleSpeechProvider } from "./speech-provider.js"; export { googleMediaUnderstandingProvider } from "./media-understanding-provider.js"; export { buildGoogleVideoGenerationProvider } from "./video-generation-provider.js"; diff --git a/extensions/speech-core/src/tts.ts b/extensions/speech-core/src/tts.ts index f95fc5cc43c..2e3ec1d6321 100644 --- a/extensions/speech-core/src/tts.ts +++ b/extensions/speech-core/src/tts.ts @@ -474,9 +474,11 @@ export function getTtsProvider(config: ResolvedTtsConfig, prefsPath: string): Tt return normalizeConfiguredSpeechProviderId(config.provider) ?? config.provider; } - for (const provider of sortSpeechProvidersForAutoSelection()) { + const effectiveCfg = config.sourceConfig; + for (const provider of sortSpeechProvidersForAutoSelection(effectiveCfg)) { if ( provider.isConfigured({ + cfg: effectiveCfg, providerConfig: config.providerConfigs[provider.id] ?? {}, timeoutMs: config.timeoutMs, }) diff --git a/test/helpers/plugins/plugin-registration-contract-cases.ts b/test/helpers/plugins/plugin-registration-contract-cases.ts index 44e32e5af5a..7f912b6b8ba 100644 --- a/test/helpers/plugins/plugin-registration-contract-cases.ts +++ b/test/helpers/plugins/plugin-registration-contract-cases.ts @@ -55,6 +55,7 @@ export const pluginRegistrationContractCases = { pluginId: "google", providerIds: ["google", "google-gemini-cli"], webSearchProviderIds: ["gemini"], + speechProviderIds: ["google"], mediaUnderstandingProviderIds: ["google"], imageGenerationProviderIds: ["google"], requireDescribeImages: true, diff --git a/test/helpers/plugins/tts-contract-suites.ts b/test/helpers/plugins/tts-contract-suites.ts index 0d959bed618..4ebc8318397 100644 --- a/test/helpers/plugins/tts-contract-suites.ts +++ b/test/helpers/plugins/tts-contract-suites.ts @@ -307,7 +307,8 @@ function buildTestMicrosoftSpeechProvider(): SpeechProviderPlugin { outputFormat: edgeConfig.outputFormat ?? "audio-24khz-48kbitrate-mono-mp3", }; }, - isConfigured: () => true, + isConfigured: ({ providerConfig }) => + (providerConfig as Record | undefined)?.enabled !== false, synthesize: async () => ({ audioBuffer: createAudioBuffer(), outputFormat: "mp3", @@ -368,6 +369,32 @@ function buildTestElevenLabsSpeechProvider(): SpeechProviderPlugin { }; } +function buildTestGoogleSpeechProvider(): SpeechProviderPlugin { + return { + id: "google", + label: "Google", + autoSelectOrder: 50, + resolveConfig: ({ rawConfig }) => resolveTestProviderConfig(rawConfig, "google"), + isConfigured: ({ cfg, providerConfig }) => + typeof (providerConfig as Record | undefined)?.apiKey === "string" || + typeof cfg?.models?.providers?.google?.apiKey === "string" || + typeof process.env.GEMINI_API_KEY === "string" || + typeof process.env.GOOGLE_API_KEY === "string", + synthesize: async () => ({ + audioBuffer: createAudioBuffer(), + outputFormat: "wav", + fileExtension: ".wav", + voiceCompatible: false, + }), + synthesizeTelephony: async () => ({ + audioBuffer: createAudioBuffer(), + outputFormat: "pcm", + sampleRate: 24_000, + }), + listVoices: async () => [{ id: "Kore", label: "Kore" }], + }; +} + async function loadTtsRuntime(): Promise { ttsRuntimePromise ??= import("../../../src/tts/tts.js"); return await ttsRuntimePromise; @@ -395,6 +422,7 @@ function setupTestSpeechProviderRegistry() { { pluginId: "openai", provider: buildTestOpenAISpeechProvider(), source: "test" }, { pluginId: "microsoft", provider: buildTestMicrosoftSpeechProvider(), source: "test" }, { pluginId: "elevenlabs", provider: buildTestElevenLabsSpeechProvider(), source: "test" }, + { pluginId: "google", provider: buildTestGoogleSpeechProvider(), source: "test" }, ]; const { cacheKey } = pluginLoaderTesting.resolvePluginLoadCacheContext({ config: {} }); setActivePluginRegistry(registry, cacheKey); @@ -613,6 +641,32 @@ export function describeTtsConfigContract() { expect(provider).toBe(testCase.expected); }); }); + + it("passes cfg into auto-selection so model-provider Google keys can configure TTS", () => { + const cfg = asLegacyOpenClawConfig({ + agents: { defaults: { model: { primary: "openai/gpt-4o-mini" } } }, + models: { + providers: { + google: { + apiKey: "model-provider-google-key", + }, + }, + }, + messages: { + tts: { + providers: { + microsoft: { + enabled: false, + }, + }, + }, + }, + }); + const config = resolveTtsConfig(cfg); + const prefsPath = `/tmp/tts-prefs-google-model-provider-${Date.now()}.json`; + + expect(getTtsProvider(config, prefsPath)).toBe("google"); + }); }); describe("resolveTtsConfig provider normalization", () => {