fix: add Google Gemini TTS provider (#67515) (thanks @barronlroth)

* Add Google Gemini TTS provider

* Remove committed planning artifact

* Explain Google media provider type shape

* google: distill Gemini TTS provider

* fix: add Google Gemini TTS provider (#67515) (thanks @barronlroth)

* fix: honor cfg-backed Google TTS selection (#67515) (thanks @barronlroth)

* fix: narrow Google TTS directive aliases (#67515) (thanks @barronlroth)

---------

Co-authored-by: Ayaan Zaidi <hi@obviy.us>
This commit is contained in:
Barron Roth
2026-04-15 23:24:35 -07:00
committed by GitHub
parent b10ae0bf13
commit bf59917cd1
12 changed files with 798 additions and 17 deletions

View File

@@ -6,6 +6,8 @@ Docs: https://docs.openclaw.ai
### Changes
- Google/TTS: add Gemini text-to-speech support to the bundled `google` plugin, including provider registration, voice selection, WAV reply output, PCM telephony output, and setup/docs guidance. (#67515) Thanks @barronlroth.
### Fixes
- Gateway/tools: anchor trusted local `MEDIA:` tool-result passthrough on the exact raw name of this run's registered built-in tools, and reject client tool definitions whose names normalize-collide with a built-in or with another client tool in the same request (`400 invalid_request_error` on both JSON and SSE paths), so a client-supplied tool named like a built-in can no longer inherit its local-media trust. (#67303)

View File

@@ -1,6 +1,6 @@
---
title: "Google (Gemini)"
summary: "Google Gemini setup (API key + OAuth, image generation, media understanding, web search)"
summary: "Google Gemini setup (API key + OAuth, image generation, media understanding, TTS, web search)"
read_when:
- You want to use Google Gemini models with OpenClaw
- You need the API key or OAuth auth flow
@@ -9,7 +9,7 @@ read_when:
# Google (Gemini)
The Google plugin provides access to Gemini models through Google AI Studio, plus
image generation, media understanding (image/audio/video), and web search via
image generation, media understanding (image/audio/video), text-to-speech, and web search via
Gemini Grounding.
- Provider: `google`
@@ -133,6 +133,7 @@ Choose your preferred auth method and follow the setup steps.
| Chat completions | Yes |
| Image generation | Yes |
| Music generation | Yes |
| Text-to-speech | Yes |
| Image understanding | Yes |
| Audio transcription | Yes |
| Video understanding | Yes |
@@ -233,6 +234,50 @@ To use Google as the default music provider:
See [Music Generation](/tools/music-generation) for shared tool parameters, provider selection, and failover behavior.
</Note>
## Text-to-speech
The bundled `google` speech provider uses the Gemini API TTS path with
`gemini-3.1-flash-tts-preview`.
- Default voice: `Kore`
- Auth: `messages.tts.providers.google.apiKey`, `models.providers.google.apiKey`, `GEMINI_API_KEY`, or `GOOGLE_API_KEY`
- Output: WAV for regular TTS attachments, PCM for Talk/telephony
- Native voice-note output: not supported on this Gemini API path because the API returns PCM rather than Opus
To use Google as the default TTS provider:
```json5
{
messages: {
tts: {
auto: "always",
provider: "google",
providers: {
google: {
model: "gemini-3.1-flash-tts-preview",
voiceName: "Kore",
},
},
},
},
}
```
Gemini API TTS accepts expressive square-bracket audio tags in the text, such as
`[whispers]` or `[laughs]`. To keep tags out of the visible chat reply while
sending them to TTS, put them inside a `[[tts:text]]...[[/tts:text]]` block:
```text
Here is the clean reply text.
[[tts:text]][whispers] Here is the spoken version.[[/tts:text]]
```
<Note>
A Google Cloud Console API key restricted to the Gemini API is valid for this
provider. This is not the separate Cloud Text-to-Speech API path.
</Note>
## Advanced configuration
<AccordionGroup>

View File

@@ -9,12 +9,13 @@ title: "Text-to-Speech"
# Text-to-speech (TTS)
OpenClaw can convert outbound replies into audio using ElevenLabs, Microsoft, MiniMax, or OpenAI.
OpenClaw can convert outbound replies into audio using ElevenLabs, Google Gemini, Microsoft, MiniMax, or OpenAI.
It works anywhere OpenClaw can send audio.
## Supported services
- **ElevenLabs** (primary or fallback provider)
- **Google Gemini** (primary or fallback provider; uses Gemini API TTS)
- **Microsoft** (primary or fallback provider; current bundled implementation uses `node-edge-tts`)
- **MiniMax** (primary or fallback provider; uses the T2A v2 API)
- **OpenAI** (primary or fallback provider; also used for summaries)
@@ -34,9 +35,10 @@ or ElevenLabs.
## Optional keys
If you want OpenAI, ElevenLabs, or MiniMax:
If you want OpenAI, ElevenLabs, Google Gemini, or MiniMax:
- `ELEVENLABS_API_KEY` (or `XI_API_KEY`)
- `GEMINI_API_KEY` (or `GOOGLE_API_KEY`)
- `MINIMAX_API_KEY`
- `OPENAI_API_KEY`
@@ -170,6 +172,32 @@ Full schema is in [Gateway configuration](/gateway/configuration).
}
```
### Google Gemini primary
```json5
{
messages: {
tts: {
auto: "always",
provider: "google",
providers: {
google: {
apiKey: "gemini_api_key",
model: "gemini-3.1-flash-tts-preview",
voiceName: "Kore",
},
},
},
},
}
```
Google Gemini TTS uses the Gemini API key path. A Google Cloud Console API key
restricted to the Gemini API is valid here, and it is the same style of key used
by the bundled Google image-generation provider. Resolution order is
`messages.tts.providers.google.apiKey` -> `models.providers.google.apiKey` ->
`GEMINI_API_KEY` -> `GOOGLE_API_KEY`.
### Disable Microsoft speech
```json5
@@ -238,7 +266,7 @@ Then run:
- `tagged` only sends audio when the reply includes `[[tts:key=value]]` directives or a `[[tts:text]]...[[/tts:text]]` block.
- `enabled`: legacy toggle (doctor migrates this to `auto`).
- `mode`: `"final"` (default) or `"all"` (includes tool/block replies).
- `provider`: speech provider id such as `"elevenlabs"`, `"microsoft"`, `"minimax"`, or `"openai"` (fallback is automatic).
- `provider`: speech provider id such as `"elevenlabs"`, `"google"`, `"microsoft"`, `"minimax"`, or `"openai"` (fallback is automatic).
- If `provider` is **unset**, OpenClaw uses the first configured speech provider in registry auto-select order.
- Legacy `provider: "edge"` still works and is normalized to `microsoft`.
- `summaryModel`: optional cheap model for auto-summary; defaults to `agents.defaults.model.primary`.
@@ -250,7 +278,7 @@ Then run:
- `maxTextLength`: hard cap for TTS input (chars). `/tts audio` fails if exceeded.
- `timeoutMs`: request timeout (ms).
- `prefsPath`: override the local prefs JSON path (provider/limit/summary).
- `apiKey` values fall back to env vars (`ELEVENLABS_API_KEY`/`XI_API_KEY`, `MINIMAX_API_KEY`, `OPENAI_API_KEY`).
- `apiKey` values fall back to env vars (`ELEVENLABS_API_KEY`/`XI_API_KEY`, `GEMINI_API_KEY`/`GOOGLE_API_KEY`, `MINIMAX_API_KEY`, `OPENAI_API_KEY`).
- `providers.elevenlabs.baseUrl`: override ElevenLabs API base URL.
- `providers.openai.baseUrl`: override the OpenAI TTS endpoint.
- Resolution order: `messages.tts.providers.openai.baseUrl` -> `OPENAI_TTS_BASE_URL` -> `https://api.openai.com/v1`
@@ -268,6 +296,10 @@ Then run:
- `providers.minimax.speed`: playback speed `0.5..2.0` (default 1.0).
- `providers.minimax.vol`: volume `(0, 10]` (default 1.0; must be greater than 0).
- `providers.minimax.pitch`: pitch shift `-12..12` (default 0).
- `providers.google.model`: Gemini TTS model (default `gemini-3.1-flash-tts-preview`).
- `providers.google.voiceName`: Gemini prebuilt voice name (default `Kore`; `voice` is also accepted).
- `providers.google.baseUrl`: override the Gemini API base URL. Only `https://generativelanguage.googleapis.com` is accepted.
- If `messages.tts.providers.google.apiKey` is omitted, TTS can reuse `models.providers.google.apiKey` before env fallback.
- `providers.microsoft.enabled`: allow Microsoft speech usage (default `true`; no API key).
- `providers.microsoft.voice`: Microsoft neural voice name (e.g. `en-US-MichelleNeural`).
- `providers.microsoft.lang`: language code (e.g. `en-US`).
@@ -302,9 +334,9 @@ Here you go.
Available directive keys (when enabled):
- `provider` (registered speech provider id, for example `openai`, `elevenlabs`, `minimax`, or `microsoft`; requires `allowProvider: true`)
- `voice` (OpenAI voice) or `voiceId` (ElevenLabs / MiniMax)
- `model` (OpenAI TTS model, ElevenLabs model id, or MiniMax model)
- `provider` (registered speech provider id, for example `openai`, `elevenlabs`, `google`, `minimax`, or `microsoft`; requires `allowProvider: true`)
- `voice` (OpenAI voice), `voiceName` / `voice_name` / `google_voice` (Google voice), or `voiceId` (ElevenLabs / MiniMax)
- `model` (OpenAI TTS model, ElevenLabs model id, or MiniMax model) or `google_model` (Google TTS model)
- `stability`, `similarityBoost`, `style`, `speed`, `useSpeakerBoost`
- `vol` / `volume` (MiniMax volume, 0-10)
- `pitch` (MiniMax pitch, -12 to 12)
@@ -364,6 +396,7 @@ These override `messages.tts.*` for that host.
- **Other channels**: MP3 (`mp3_44100_128` from ElevenLabs, `mp3` from OpenAI).
- 44.1kHz / 128kbps is the default balance for speech clarity.
- **MiniMax**: MP3 (`speech-2.8-hd` model, 32kHz sample rate). Voice-note format not natively supported; use OpenAI or ElevenLabs for guaranteed Opus voice messages.
- **Google Gemini**: Gemini API TTS returns raw 24kHz PCM. OpenClaw wraps it as WAV for audio attachments and returns PCM directly for Talk/telephony. Native Opus voice-note format is not supported by this path.
- **Microsoft**: uses `microsoft.outputFormat` (default `audio-24khz-48kbitrate-mono-mp3`).
- The bundled transport accepts an `outputFormat`, but not all formats are available from the service.
- Output format values follow Microsoft Speech output formats (including Ogg/WebM Opus).

View File

@@ -5,18 +5,19 @@ import { buildGoogleGeminiCliBackend } from "./cli-backend.js";
import { registerGoogleGeminiCliProvider } from "./gemini-cli-provider.js";
import { buildGoogleMusicGenerationProvider } from "./music-generation-provider.js";
import { registerGoogleProvider } from "./provider-registration.js";
import { buildGoogleSpeechProvider } from "./speech-provider.js";
import { createGeminiWebSearchProvider } from "./src/gemini-web-search-provider.js";
import { buildGoogleVideoGenerationProvider } from "./video-generation-provider.js";
let googleImageGenerationProviderPromise: Promise<ImageGenerationProvider> | null = null;
let googleMediaUnderstandingProviderPromise: Promise<MediaUnderstandingProvider> | null = null;
type GoogleMediaUnderstandingProvider = MediaUnderstandingProvider & {
describeImage: NonNullable<MediaUnderstandingProvider["describeImage"]>;
describeImages: NonNullable<MediaUnderstandingProvider["describeImages"]>;
transcribeAudio: NonNullable<MediaUnderstandingProvider["transcribeAudio"]>;
describeVideo: NonNullable<MediaUnderstandingProvider["describeVideo"]>;
};
type GoogleMediaUnderstandingProvider = Required<
Pick<
MediaUnderstandingProvider,
"describeImage" | "describeImages" | "transcribeAudio" | "describeVideo"
>
>;
async function loadGoogleImageGenerationProvider(): Promise<ImageGenerationProvider> {
if (!googleImageGenerationProviderPromise) {
@@ -113,6 +114,7 @@ export default definePluginEntry({
api.registerImageGenerationProvider(createLazyGoogleImageGenerationProvider());
api.registerMediaUnderstandingProvider(createLazyGoogleMediaUnderstandingProvider());
api.registerMusicGenerationProvider(buildGoogleMusicGenerationProvider());
api.registerSpeechProvider(buildGoogleSpeechProvider());
api.registerVideoGenerationProvider(buildGoogleVideoGenerationProvider());
api.registerWebSearchProvider(createGeminiWebSearchProvider());
},

View File

@@ -48,6 +48,7 @@
"mediaUnderstandingProviders": ["google"],
"imageGenerationProviders": ["google"],
"musicGenerationProviders": ["google"],
"speechProviders": ["google"],
"videoGenerationProviders": ["google"],
"webSearchProviders": ["gemini"]
},

View File

@@ -3,6 +3,7 @@ import { describePluginRegistrationContract } from "../../test/helpers/plugins/p
describePluginRegistrationContract({
...pluginRegistrationContractCases.google,
speechProviderIds: ["google"],
videoGenerationProviderIds: ["google"],
webSearchProviderIds: ["gemini"],
requireDescribeImages: true,

View File

@@ -0,0 +1,248 @@
import { afterEach, describe, expect, it, vi } from "vitest";
import { buildGoogleSpeechProvider, __testing } from "./speech-provider.js";
function installGoogleTtsFetchMock(pcm = Buffer.from([1, 0, 2, 0])) {
const fetchMock = vi.fn().mockResolvedValue({
ok: true,
json: async () => ({
candidates: [
{
content: {
parts: [
{
inlineData: {
mimeType: "audio/L16;codec=pcm;rate=24000",
data: pcm.toString("base64"),
},
},
],
},
},
],
}),
});
vi.stubGlobal("fetch", fetchMock);
return fetchMock;
}
describe("Google speech provider", () => {
afterEach(() => {
vi.restoreAllMocks();
vi.unstubAllGlobals();
vi.unstubAllEnvs();
});
it("synthesizes Gemini PCM as WAV and preserves audio tags in the request text", async () => {
const fetchMock = installGoogleTtsFetchMock();
const provider = buildGoogleSpeechProvider();
const result = await provider.synthesize({
text: "[whispers] The door is open.",
cfg: {},
providerConfig: {
apiKey: "google-test-key",
model: "google/gemini-3.1-flash-tts",
voiceName: "Puck",
},
target: "audio-file",
timeoutMs: 12_345,
});
expect(fetchMock).toHaveBeenCalledWith(
"https://generativelanguage.googleapis.com/v1beta/models/gemini-3.1-flash-tts-preview:generateContent",
expect.objectContaining({
method: "POST",
body: JSON.stringify({
contents: [
{
role: "user",
parts: [{ text: "[whispers] The door is open." }],
},
],
generationConfig: {
responseModalities: ["AUDIO"],
speechConfig: {
voiceConfig: {
prebuiltVoiceConfig: {
voiceName: "Puck",
},
},
},
},
}),
}),
);
const [, init] = fetchMock.mock.calls[0];
expect(new Headers(init.headers).get("x-goog-api-key")).toBe("google-test-key");
expect(result.outputFormat).toBe("wav");
expect(result.fileExtension).toBe(".wav");
expect(result.voiceCompatible).toBe(false);
expect(result.audioBuffer.subarray(0, 4).toString("ascii")).toBe("RIFF");
expect(result.audioBuffer.subarray(8, 12).toString("ascii")).toBe("WAVE");
expect(result.audioBuffer.readUInt32LE(24)).toBe(__testing.GOOGLE_TTS_SAMPLE_RATE);
expect(result.audioBuffer.subarray(44)).toEqual(Buffer.from([1, 0, 2, 0]));
});
it("falls back to GEMINI_API_KEY and configured Google API base URL", async () => {
vi.stubEnv("GEMINI_API_KEY", "env-google-key");
const fetchMock = installGoogleTtsFetchMock();
const provider = buildGoogleSpeechProvider();
expect(provider.isConfigured({ providerConfig: {}, timeoutMs: 1 })).toBe(true);
await provider.synthesize({
text: "Read this plainly.",
cfg: {
models: {
providers: {
google: {
baseUrl: "https://generativelanguage.googleapis.com/v1beta/openai",
models: [],
},
},
},
},
providerConfig: {},
target: "voice-note",
timeoutMs: 10_000,
});
expect(fetchMock).toHaveBeenCalledWith(
"https://generativelanguage.googleapis.com/v1beta/models/gemini-3.1-flash-tts-preview:generateContent",
expect.any(Object),
);
const [, init] = fetchMock.mock.calls[0];
expect(new Headers(init.headers).get("x-goog-api-key")).toBe("env-google-key");
});
it("can reuse a configured Google model-provider API key without auth profiles", async () => {
const fetchMock = installGoogleTtsFetchMock();
const provider = buildGoogleSpeechProvider();
const cfg = {
models: {
providers: {
google: {
apiKey: "model-provider-google-key",
baseUrl: "https://generativelanguage.googleapis.com",
models: [],
},
},
},
};
expect(provider.isConfigured({ cfg, providerConfig: {}, timeoutMs: 1 })).toBe(true);
await provider.synthesize({
text: "Use the configured model provider key.",
cfg,
providerConfig: {},
target: "audio-file",
timeoutMs: 10_000,
});
const [, init] = fetchMock.mock.calls[0];
expect(new Headers(init.headers).get("x-goog-api-key")).toBe("model-provider-google-key");
});
it("returns Gemini PCM directly for telephony synthesis", async () => {
const pcm = Buffer.from([3, 0, 4, 0]);
installGoogleTtsFetchMock(pcm);
const provider = buildGoogleSpeechProvider();
const result = await provider.synthesizeTelephony?.({
text: "Phone call audio.",
cfg: {},
providerConfig: {
apiKey: "google-test-key",
voice: "Kore",
},
timeoutMs: 5_000,
});
expect(result).toEqual({
audioBuffer: pcm,
outputFormat: "pcm",
sampleRate: 24_000,
});
});
it("resolves provider config and directive overrides", () => {
const provider = buildGoogleSpeechProvider();
expect(
provider.resolveConfig?.({
cfg: {},
rawConfig: {
providers: {
google: {
apiKey: "configured-key",
model: "google/gemini-3.1-flash-tts-preview",
voice: "Leda",
},
},
},
timeoutMs: 1,
}),
).toEqual({
apiKey: "configured-key",
baseUrl: undefined,
model: "gemini-3.1-flash-tts-preview",
voiceName: "Leda",
});
expect(
provider.parseDirectiveToken?.({
key: "google_voice",
value: "Aoede",
policy: {
enabled: true,
allowText: true,
allowProvider: true,
allowVoice: true,
allowModelId: true,
allowVoiceSettings: true,
allowNormalization: true,
allowSeed: true,
},
}),
).toEqual({
handled: true,
overrides: {
voiceName: "Aoede",
},
});
expect(
provider.parseDirectiveToken?.({
key: "google_model",
value: "gemini-3.1-flash-tts-preview",
policy: {
enabled: true,
allowText: true,
allowProvider: true,
allowVoice: true,
allowModelId: true,
allowVoiceSettings: true,
allowNormalization: true,
allowSeed: true,
},
}),
).toEqual({
handled: true,
overrides: {
model: "gemini-3.1-flash-tts-preview",
},
});
});
it("lists Gemini prebuilt TTS voices", async () => {
const provider = buildGoogleSpeechProvider();
await expect(provider.listVoices?.({ providerConfig: {} })).resolves.toEqual(
expect.arrayContaining([
{ id: "Kore", name: "Kore" },
{ id: "Puck", name: "Puck" },
]),
);
});
});

View File

@@ -0,0 +1,391 @@
import { assertOkOrThrowHttpError, postJsonRequest } from "openclaw/plugin-sdk/provider-http";
import type { OpenClawConfig } from "openclaw/plugin-sdk/provider-onboard";
import { normalizeResolvedSecretInputString } from "openclaw/plugin-sdk/secret-input";
import type {
SpeechDirectiveTokenParseContext,
SpeechProviderConfig,
SpeechProviderOverrides,
SpeechProviderPlugin,
} from "openclaw/plugin-sdk/speech-core";
import { asObject, trimToUndefined } from "openclaw/plugin-sdk/speech-core";
import { normalizeOptionalString } from "openclaw/plugin-sdk/text-runtime";
import { resolveGoogleGenerativeAiHttpRequestConfig } from "./api.js";
const DEFAULT_GOOGLE_TTS_MODEL = "gemini-3.1-flash-tts-preview";
const DEFAULT_GOOGLE_TTS_VOICE = "Kore";
const GOOGLE_TTS_SAMPLE_RATE = 24_000;
const GOOGLE_TTS_CHANNELS = 1;
const GOOGLE_TTS_BITS_PER_SAMPLE = 16;
const GOOGLE_TTS_VOICES = [
"Zephyr",
"Puck",
"Charon",
"Kore",
"Fenrir",
"Leda",
"Orus",
"Aoede",
"Callirrhoe",
"Autonoe",
"Enceladus",
"Iapetus",
"Umbriel",
"Algieba",
"Despina",
"Erinome",
"Algenib",
"Rasalgethi",
"Laomedeia",
"Achernar",
"Alnilam",
"Schedar",
"Gacrux",
"Pulcherrima",
"Achird",
"Zubenelgenubi",
"Vindemiatrix",
"Sadachbia",
"Sadaltager",
"Sulafat",
] as const;
type GoogleTtsProviderConfig = {
apiKey?: string;
baseUrl?: string;
model: string;
voiceName: string;
};
type GoogleTtsProviderOverrides = {
model?: string;
voiceName?: string;
};
type Maybe<T> = T | undefined;
type GoogleInlineDataPart = {
mimeType?: string;
mime_type?: string;
data?: string;
};
type GoogleGenerateSpeechResponse = {
candidates?: Array<{
content?: {
parts?: Array<{
text?: string;
inlineData?: GoogleInlineDataPart;
inline_data?: GoogleInlineDataPart;
}>;
};
}>;
};
function normalizeGoogleTtsModel(model: unknown): string {
const trimmed = normalizeOptionalString(model);
if (!trimmed) {
return DEFAULT_GOOGLE_TTS_MODEL;
}
const withoutProvider = trimmed.startsWith("google/") ? trimmed.slice("google/".length) : trimmed;
return withoutProvider === "gemini-3.1-flash-tts" ? DEFAULT_GOOGLE_TTS_MODEL : withoutProvider;
}
function normalizeGoogleTtsVoiceName(voiceName: unknown): string {
return normalizeOptionalString(voiceName) ?? DEFAULT_GOOGLE_TTS_VOICE;
}
function resolveGoogleTtsEnvApiKey(): string | undefined {
return (
normalizeOptionalString(process.env.GEMINI_API_KEY) ??
normalizeOptionalString(process.env.GOOGLE_API_KEY)
);
}
function resolveGoogleTtsModelProviderApiKey(cfg?: OpenClawConfig): string | undefined {
return normalizeResolvedSecretInputString({
value: cfg?.models?.providers?.google?.apiKey,
path: "models.providers.google.apiKey",
});
}
function resolveGoogleTtsApiKey(params: {
cfg?: OpenClawConfig;
providerConfig: SpeechProviderConfig;
}): string | undefined {
return (
readGoogleTtsProviderConfig(params.providerConfig).apiKey ??
resolveGoogleTtsModelProviderApiKey(params.cfg) ??
resolveGoogleTtsEnvApiKey()
);
}
function resolveGoogleTtsBaseUrl(params: {
cfg?: OpenClawConfig;
providerConfig: GoogleTtsProviderConfig;
}): string | undefined {
return (
params.providerConfig.baseUrl ?? trimToUndefined(params.cfg?.models?.providers?.google?.baseUrl)
);
}
function resolveGoogleTtsConfigRecord(
rawConfig: Record<string, unknown>,
): Record<string, unknown> | undefined {
const providers = asObject(rawConfig.providers);
return asObject(providers?.google) ?? asObject(rawConfig.google);
}
function normalizeGoogleTtsProviderConfig(
rawConfig: Record<string, unknown>,
): GoogleTtsProviderConfig {
const raw = resolveGoogleTtsConfigRecord(rawConfig);
return {
apiKey: normalizeResolvedSecretInputString({
value: raw?.apiKey,
path: "messages.tts.providers.google.apiKey",
}),
baseUrl: trimToUndefined(raw?.baseUrl),
model: normalizeGoogleTtsModel(raw?.model),
voiceName: normalizeGoogleTtsVoiceName(raw?.voiceName ?? raw?.voice),
};
}
function readGoogleTtsProviderConfig(config: SpeechProviderConfig): GoogleTtsProviderConfig {
const normalized = normalizeGoogleTtsProviderConfig({});
return {
apiKey: trimToUndefined(config.apiKey) ?? normalized.apiKey,
baseUrl: trimToUndefined(config.baseUrl) ?? normalized.baseUrl,
model: normalizeGoogleTtsModel(config.model ?? normalized.model),
voiceName: normalizeGoogleTtsVoiceName(
config.voiceName ?? config.voice ?? normalized.voiceName,
),
};
}
function readGoogleTtsOverrides(
overrides: Maybe<SpeechProviderOverrides>,
): GoogleTtsProviderOverrides {
if (!overrides) {
return {};
}
return {
model: normalizeOptionalString(overrides.model),
voiceName: normalizeOptionalString(overrides.voiceName ?? overrides.voice),
};
}
function parseDirectiveToken(ctx: SpeechDirectiveTokenParseContext): {
handled: boolean;
overrides?: SpeechProviderOverrides;
warnings?: string[];
} {
switch (ctx.key) {
case "voicename":
case "voice_name":
case "google_voice":
case "googlevoice":
if (!ctx.policy.allowVoice) {
return { handled: true };
}
return { handled: true, overrides: { voiceName: ctx.value } };
case "google_model":
case "googlemodel":
if (!ctx.policy.allowModelId) {
return { handled: true };
}
return { handled: true, overrides: { model: ctx.value } };
default:
return { handled: false };
}
}
function extractGoogleSpeechPcm(payload: GoogleGenerateSpeechResponse): Buffer {
for (const candidate of payload.candidates ?? []) {
for (const part of candidate.content?.parts ?? []) {
const inline = part.inlineData ?? part.inline_data;
const data = normalizeOptionalString(inline?.data);
if (!data) {
continue;
}
return Buffer.from(data, "base64");
}
}
throw new Error("Google TTS response missing audio data");
}
function wrapPcm16MonoToWav(pcm: Buffer, sampleRate = GOOGLE_TTS_SAMPLE_RATE): Buffer {
const byteRate = sampleRate * GOOGLE_TTS_CHANNELS * (GOOGLE_TTS_BITS_PER_SAMPLE / 8);
const blockAlign = GOOGLE_TTS_CHANNELS * (GOOGLE_TTS_BITS_PER_SAMPLE / 8);
const header = Buffer.alloc(44);
header.write("RIFF", 0, "ascii");
header.writeUInt32LE(36 + pcm.length, 4);
header.write("WAVE", 8, "ascii");
header.write("fmt ", 12, "ascii");
header.writeUInt32LE(16, 16);
header.writeUInt16LE(1, 20);
header.writeUInt16LE(GOOGLE_TTS_CHANNELS, 22);
header.writeUInt32LE(sampleRate, 24);
header.writeUInt32LE(byteRate, 28);
header.writeUInt16LE(blockAlign, 32);
header.writeUInt16LE(GOOGLE_TTS_BITS_PER_SAMPLE, 34);
header.write("data", 36, "ascii");
header.writeUInt32LE(pcm.length, 40);
return Buffer.concat([header, pcm]);
}
async function synthesizeGoogleTtsPcm(params: {
text: string;
apiKey: string;
baseUrl?: string;
model: string;
voiceName: string;
timeoutMs: number;
}): Promise<Buffer> {
const { baseUrl, allowPrivateNetwork, headers, dispatcherPolicy } =
resolveGoogleGenerativeAiHttpRequestConfig({
apiKey: params.apiKey,
baseUrl: params.baseUrl,
capability: "audio",
transport: "http",
});
const { response: res, release } = await postJsonRequest({
url: `${baseUrl}/models/${params.model}:generateContent`,
headers,
body: {
contents: [
{
role: "user",
parts: [{ text: params.text }],
},
],
generationConfig: {
responseModalities: ["AUDIO"],
speechConfig: {
voiceConfig: {
prebuiltVoiceConfig: {
voiceName: params.voiceName,
},
},
},
},
},
timeoutMs: params.timeoutMs,
fetchFn: fetch,
pinDns: false,
allowPrivateNetwork,
dispatcherPolicy,
});
try {
await assertOkOrThrowHttpError(res, "Google TTS failed");
return extractGoogleSpeechPcm((await res.json()) as GoogleGenerateSpeechResponse);
} finally {
await release();
}
}
export function buildGoogleSpeechProvider(): SpeechProviderPlugin {
return {
id: "google",
label: "Google",
autoSelectOrder: 50,
models: [DEFAULT_GOOGLE_TTS_MODEL],
voices: GOOGLE_TTS_VOICES,
resolveConfig: ({ rawConfig }) => normalizeGoogleTtsProviderConfig(rawConfig),
parseDirectiveToken,
resolveTalkConfig: ({ baseTtsConfig, talkProviderConfig }) => {
const base = normalizeGoogleTtsProviderConfig(baseTtsConfig);
return {
...base,
...(talkProviderConfig.apiKey === undefined
? {}
: {
apiKey: normalizeResolvedSecretInputString({
value: talkProviderConfig.apiKey,
path: "talk.providers.google.apiKey",
}),
}),
...(trimToUndefined(talkProviderConfig.baseUrl) == null
? {}
: { baseUrl: trimToUndefined(talkProviderConfig.baseUrl) }),
...(trimToUndefined(talkProviderConfig.modelId) == null
? {}
: { model: normalizeGoogleTtsModel(talkProviderConfig.modelId) }),
...(trimToUndefined(talkProviderConfig.voiceId) == null
? {}
: { voiceName: normalizeGoogleTtsVoiceName(talkProviderConfig.voiceId) }),
};
},
resolveTalkOverrides: ({ params }) => ({
...(trimToUndefined(params.voiceId) == null
? {}
: { voiceName: normalizeGoogleTtsVoiceName(params.voiceId) }),
...(trimToUndefined(params.modelId) == null
? {}
: { model: normalizeGoogleTtsModel(params.modelId) }),
}),
listVoices: async () => GOOGLE_TTS_VOICES.map((voice) => ({ id: voice, name: voice })),
isConfigured: ({ cfg, providerConfig }) =>
Boolean(resolveGoogleTtsApiKey({ cfg, providerConfig })),
synthesize: async (req) => {
const config = readGoogleTtsProviderConfig(req.providerConfig);
const overrides = readGoogleTtsOverrides(req.providerOverrides);
const apiKey = resolveGoogleTtsApiKey({
cfg: req.cfg,
providerConfig: req.providerConfig,
});
if (!apiKey) {
throw new Error("Google API key missing");
}
const pcm = await synthesizeGoogleTtsPcm({
text: req.text,
apiKey,
baseUrl: resolveGoogleTtsBaseUrl({ cfg: req.cfg, providerConfig: config }),
model: normalizeGoogleTtsModel(overrides.model ?? config.model),
voiceName: normalizeGoogleTtsVoiceName(overrides.voiceName ?? config.voiceName),
timeoutMs: req.timeoutMs,
});
return {
audioBuffer: wrapPcm16MonoToWav(pcm),
outputFormat: "wav",
fileExtension: ".wav",
voiceCompatible: false,
};
},
synthesizeTelephony: async (req) => {
const config = readGoogleTtsProviderConfig(req.providerConfig);
const apiKey = resolveGoogleTtsApiKey({
cfg: req.cfg,
providerConfig: req.providerConfig,
});
if (!apiKey) {
throw new Error("Google API key missing");
}
const pcm = await synthesizeGoogleTtsPcm({
text: req.text,
apiKey,
baseUrl: resolveGoogleTtsBaseUrl({ cfg: req.cfg, providerConfig: config }),
model: config.model,
voiceName: config.voiceName,
timeoutMs: req.timeoutMs,
});
return {
audioBuffer: pcm,
outputFormat: "pcm",
sampleRate: GOOGLE_TTS_SAMPLE_RATE,
};
},
};
}
export const __testing = {
DEFAULT_GOOGLE_TTS_MODEL,
DEFAULT_GOOGLE_TTS_VOICE,
GOOGLE_TTS_SAMPLE_RATE,
normalizeGoogleTtsModel,
wrapPcm16MonoToWav,
};

View File

@@ -1,5 +1,6 @@
export { buildGoogleGeminiCliBackend } from "./cli-backend.js";
export { buildGoogleImageGenerationProvider } from "./image-generation-provider.js";
export { buildGoogleMusicGenerationProvider } from "./music-generation-provider.js";
export { buildGoogleSpeechProvider } from "./speech-provider.js";
export { googleMediaUnderstandingProvider } from "./media-understanding-provider.js";
export { buildGoogleVideoGenerationProvider } from "./video-generation-provider.js";

View File

@@ -474,9 +474,11 @@ export function getTtsProvider(config: ResolvedTtsConfig, prefsPath: string): Tt
return normalizeConfiguredSpeechProviderId(config.provider) ?? config.provider;
}
for (const provider of sortSpeechProvidersForAutoSelection()) {
const effectiveCfg = config.sourceConfig;
for (const provider of sortSpeechProvidersForAutoSelection(effectiveCfg)) {
if (
provider.isConfigured({
cfg: effectiveCfg,
providerConfig: config.providerConfigs[provider.id] ?? {},
timeoutMs: config.timeoutMs,
})

View File

@@ -55,6 +55,7 @@ export const pluginRegistrationContractCases = {
pluginId: "google",
providerIds: ["google", "google-gemini-cli"],
webSearchProviderIds: ["gemini"],
speechProviderIds: ["google"],
mediaUnderstandingProviderIds: ["google"],
imageGenerationProviderIds: ["google"],
requireDescribeImages: true,

View File

@@ -307,7 +307,8 @@ function buildTestMicrosoftSpeechProvider(): SpeechProviderPlugin {
outputFormat: edgeConfig.outputFormat ?? "audio-24khz-48kbitrate-mono-mp3",
};
},
isConfigured: () => true,
isConfigured: ({ providerConfig }) =>
(providerConfig as Record<string, unknown> | undefined)?.enabled !== false,
synthesize: async () => ({
audioBuffer: createAudioBuffer(),
outputFormat: "mp3",
@@ -368,6 +369,32 @@ function buildTestElevenLabsSpeechProvider(): SpeechProviderPlugin {
};
}
function buildTestGoogleSpeechProvider(): SpeechProviderPlugin {
return {
id: "google",
label: "Google",
autoSelectOrder: 50,
resolveConfig: ({ rawConfig }) => resolveTestProviderConfig(rawConfig, "google"),
isConfigured: ({ cfg, providerConfig }) =>
typeof (providerConfig as Record<string, unknown> | undefined)?.apiKey === "string" ||
typeof cfg?.models?.providers?.google?.apiKey === "string" ||
typeof process.env.GEMINI_API_KEY === "string" ||
typeof process.env.GOOGLE_API_KEY === "string",
synthesize: async () => ({
audioBuffer: createAudioBuffer(),
outputFormat: "wav",
fileExtension: ".wav",
voiceCompatible: false,
}),
synthesizeTelephony: async () => ({
audioBuffer: createAudioBuffer(),
outputFormat: "pcm",
sampleRate: 24_000,
}),
listVoices: async () => [{ id: "Kore", label: "Kore" }],
};
}
async function loadTtsRuntime(): Promise<TtsRuntimeModule> {
ttsRuntimePromise ??= import("../../../src/tts/tts.js");
return await ttsRuntimePromise;
@@ -395,6 +422,7 @@ function setupTestSpeechProviderRegistry() {
{ pluginId: "openai", provider: buildTestOpenAISpeechProvider(), source: "test" },
{ pluginId: "microsoft", provider: buildTestMicrosoftSpeechProvider(), source: "test" },
{ pluginId: "elevenlabs", provider: buildTestElevenLabsSpeechProvider(), source: "test" },
{ pluginId: "google", provider: buildTestGoogleSpeechProvider(), source: "test" },
];
const { cacheKey } = pluginLoaderTesting.resolvePluginLoadCacheContext({ config: {} });
setActivePluginRegistry(registry, cacheKey);
@@ -613,6 +641,32 @@ export function describeTtsConfigContract() {
expect(provider).toBe(testCase.expected);
});
});
it("passes cfg into auto-selection so model-provider Google keys can configure TTS", () => {
const cfg = asLegacyOpenClawConfig({
agents: { defaults: { model: { primary: "openai/gpt-4o-mini" } } },
models: {
providers: {
google: {
apiKey: "model-provider-google-key",
},
},
},
messages: {
tts: {
providers: {
microsoft: {
enabled: false,
},
},
},
},
});
const config = resolveTtsConfig(cfg);
const prefsPath = `/tmp/tts-prefs-google-model-provider-${Date.now()}.json`;
expect(getTtsProvider(config, prefsPath)).toBe("google");
});
});
describe("resolveTtsConfig provider normalization", () => {