mirror of
https://github.com/openclaw/openclaw.git
synced 2026-05-06 05:10:44 +00:00
fix: add Google Gemini TTS provider (#67515) (thanks @barronlroth)
* Add Google Gemini TTS provider * Remove committed planning artifact * Explain Google media provider type shape * google: distill Gemini TTS provider * fix: add Google Gemini TTS provider (#67515) (thanks @barronlroth) * fix: honor cfg-backed Google TTS selection (#67515) (thanks @barronlroth) * fix: narrow Google TTS directive aliases (#67515) (thanks @barronlroth) --------- Co-authored-by: Ayaan Zaidi <hi@obviy.us>
This commit is contained in:
@@ -6,6 +6,8 @@ Docs: https://docs.openclaw.ai
|
||||
|
||||
### Changes
|
||||
|
||||
- Google/TTS: add Gemini text-to-speech support to the bundled `google` plugin, including provider registration, voice selection, WAV reply output, PCM telephony output, and setup/docs guidance. (#67515) Thanks @barronlroth.
|
||||
|
||||
### Fixes
|
||||
|
||||
- Gateway/tools: anchor trusted local `MEDIA:` tool-result passthrough on the exact raw name of this run's registered built-in tools, and reject client tool definitions whose names normalize-collide with a built-in or with another client tool in the same request (`400 invalid_request_error` on both JSON and SSE paths), so a client-supplied tool named like a built-in can no longer inherit its local-media trust. (#67303)
|
||||
|
||||
@@ -1,6 +1,6 @@
|
||||
---
|
||||
title: "Google (Gemini)"
|
||||
summary: "Google Gemini setup (API key + OAuth, image generation, media understanding, web search)"
|
||||
summary: "Google Gemini setup (API key + OAuth, image generation, media understanding, TTS, web search)"
|
||||
read_when:
|
||||
- You want to use Google Gemini models with OpenClaw
|
||||
- You need the API key or OAuth auth flow
|
||||
@@ -9,7 +9,7 @@ read_when:
|
||||
# Google (Gemini)
|
||||
|
||||
The Google plugin provides access to Gemini models through Google AI Studio, plus
|
||||
image generation, media understanding (image/audio/video), and web search via
|
||||
image generation, media understanding (image/audio/video), text-to-speech, and web search via
|
||||
Gemini Grounding.
|
||||
|
||||
- Provider: `google`
|
||||
@@ -133,6 +133,7 @@ Choose your preferred auth method and follow the setup steps.
|
||||
| Chat completions | Yes |
|
||||
| Image generation | Yes |
|
||||
| Music generation | Yes |
|
||||
| Text-to-speech | Yes |
|
||||
| Image understanding | Yes |
|
||||
| Audio transcription | Yes |
|
||||
| Video understanding | Yes |
|
||||
@@ -233,6 +234,50 @@ To use Google as the default music provider:
|
||||
See [Music Generation](/tools/music-generation) for shared tool parameters, provider selection, and failover behavior.
|
||||
</Note>
|
||||
|
||||
## Text-to-speech
|
||||
|
||||
The bundled `google` speech provider uses the Gemini API TTS path with
|
||||
`gemini-3.1-flash-tts-preview`.
|
||||
|
||||
- Default voice: `Kore`
|
||||
- Auth: `messages.tts.providers.google.apiKey`, `models.providers.google.apiKey`, `GEMINI_API_KEY`, or `GOOGLE_API_KEY`
|
||||
- Output: WAV for regular TTS attachments, PCM for Talk/telephony
|
||||
- Native voice-note output: not supported on this Gemini API path because the API returns PCM rather than Opus
|
||||
|
||||
To use Google as the default TTS provider:
|
||||
|
||||
```json5
|
||||
{
|
||||
messages: {
|
||||
tts: {
|
||||
auto: "always",
|
||||
provider: "google",
|
||||
providers: {
|
||||
google: {
|
||||
model: "gemini-3.1-flash-tts-preview",
|
||||
voiceName: "Kore",
|
||||
},
|
||||
},
|
||||
},
|
||||
},
|
||||
}
|
||||
```
|
||||
|
||||
Gemini API TTS accepts expressive square-bracket audio tags in the text, such as
|
||||
`[whispers]` or `[laughs]`. To keep tags out of the visible chat reply while
|
||||
sending them to TTS, put them inside a `[[tts:text]]...[[/tts:text]]` block:
|
||||
|
||||
```text
|
||||
Here is the clean reply text.
|
||||
|
||||
[[tts:text]][whispers] Here is the spoken version.[[/tts:text]]
|
||||
```
|
||||
|
||||
<Note>
|
||||
A Google Cloud Console API key restricted to the Gemini API is valid for this
|
||||
provider. This is not the separate Cloud Text-to-Speech API path.
|
||||
</Note>
|
||||
|
||||
## Advanced configuration
|
||||
|
||||
<AccordionGroup>
|
||||
|
||||
@@ -9,12 +9,13 @@ title: "Text-to-Speech"
|
||||
|
||||
# Text-to-speech (TTS)
|
||||
|
||||
OpenClaw can convert outbound replies into audio using ElevenLabs, Microsoft, MiniMax, or OpenAI.
|
||||
OpenClaw can convert outbound replies into audio using ElevenLabs, Google Gemini, Microsoft, MiniMax, or OpenAI.
|
||||
It works anywhere OpenClaw can send audio.
|
||||
|
||||
## Supported services
|
||||
|
||||
- **ElevenLabs** (primary or fallback provider)
|
||||
- **Google Gemini** (primary or fallback provider; uses Gemini API TTS)
|
||||
- **Microsoft** (primary or fallback provider; current bundled implementation uses `node-edge-tts`)
|
||||
- **MiniMax** (primary or fallback provider; uses the T2A v2 API)
|
||||
- **OpenAI** (primary or fallback provider; also used for summaries)
|
||||
@@ -34,9 +35,10 @@ or ElevenLabs.
|
||||
|
||||
## Optional keys
|
||||
|
||||
If you want OpenAI, ElevenLabs, or MiniMax:
|
||||
If you want OpenAI, ElevenLabs, Google Gemini, or MiniMax:
|
||||
|
||||
- `ELEVENLABS_API_KEY` (or `XI_API_KEY`)
|
||||
- `GEMINI_API_KEY` (or `GOOGLE_API_KEY`)
|
||||
- `MINIMAX_API_KEY`
|
||||
- `OPENAI_API_KEY`
|
||||
|
||||
@@ -170,6 +172,32 @@ Full schema is in [Gateway configuration](/gateway/configuration).
|
||||
}
|
||||
```
|
||||
|
||||
### Google Gemini primary
|
||||
|
||||
```json5
|
||||
{
|
||||
messages: {
|
||||
tts: {
|
||||
auto: "always",
|
||||
provider: "google",
|
||||
providers: {
|
||||
google: {
|
||||
apiKey: "gemini_api_key",
|
||||
model: "gemini-3.1-flash-tts-preview",
|
||||
voiceName: "Kore",
|
||||
},
|
||||
},
|
||||
},
|
||||
},
|
||||
}
|
||||
```
|
||||
|
||||
Google Gemini TTS uses the Gemini API key path. A Google Cloud Console API key
|
||||
restricted to the Gemini API is valid here, and it is the same style of key used
|
||||
by the bundled Google image-generation provider. Resolution order is
|
||||
`messages.tts.providers.google.apiKey` -> `models.providers.google.apiKey` ->
|
||||
`GEMINI_API_KEY` -> `GOOGLE_API_KEY`.
|
||||
|
||||
### Disable Microsoft speech
|
||||
|
||||
```json5
|
||||
@@ -238,7 +266,7 @@ Then run:
|
||||
- `tagged` only sends audio when the reply includes `[[tts:key=value]]` directives or a `[[tts:text]]...[[/tts:text]]` block.
|
||||
- `enabled`: legacy toggle (doctor migrates this to `auto`).
|
||||
- `mode`: `"final"` (default) or `"all"` (includes tool/block replies).
|
||||
- `provider`: speech provider id such as `"elevenlabs"`, `"microsoft"`, `"minimax"`, or `"openai"` (fallback is automatic).
|
||||
- `provider`: speech provider id such as `"elevenlabs"`, `"google"`, `"microsoft"`, `"minimax"`, or `"openai"` (fallback is automatic).
|
||||
- If `provider` is **unset**, OpenClaw uses the first configured speech provider in registry auto-select order.
|
||||
- Legacy `provider: "edge"` still works and is normalized to `microsoft`.
|
||||
- `summaryModel`: optional cheap model for auto-summary; defaults to `agents.defaults.model.primary`.
|
||||
@@ -250,7 +278,7 @@ Then run:
|
||||
- `maxTextLength`: hard cap for TTS input (chars). `/tts audio` fails if exceeded.
|
||||
- `timeoutMs`: request timeout (ms).
|
||||
- `prefsPath`: override the local prefs JSON path (provider/limit/summary).
|
||||
- `apiKey` values fall back to env vars (`ELEVENLABS_API_KEY`/`XI_API_KEY`, `MINIMAX_API_KEY`, `OPENAI_API_KEY`).
|
||||
- `apiKey` values fall back to env vars (`ELEVENLABS_API_KEY`/`XI_API_KEY`, `GEMINI_API_KEY`/`GOOGLE_API_KEY`, `MINIMAX_API_KEY`, `OPENAI_API_KEY`).
|
||||
- `providers.elevenlabs.baseUrl`: override ElevenLabs API base URL.
|
||||
- `providers.openai.baseUrl`: override the OpenAI TTS endpoint.
|
||||
- Resolution order: `messages.tts.providers.openai.baseUrl` -> `OPENAI_TTS_BASE_URL` -> `https://api.openai.com/v1`
|
||||
@@ -268,6 +296,10 @@ Then run:
|
||||
- `providers.minimax.speed`: playback speed `0.5..2.0` (default 1.0).
|
||||
- `providers.minimax.vol`: volume `(0, 10]` (default 1.0; must be greater than 0).
|
||||
- `providers.minimax.pitch`: pitch shift `-12..12` (default 0).
|
||||
- `providers.google.model`: Gemini TTS model (default `gemini-3.1-flash-tts-preview`).
|
||||
- `providers.google.voiceName`: Gemini prebuilt voice name (default `Kore`; `voice` is also accepted).
|
||||
- `providers.google.baseUrl`: override the Gemini API base URL. Only `https://generativelanguage.googleapis.com` is accepted.
|
||||
- If `messages.tts.providers.google.apiKey` is omitted, TTS can reuse `models.providers.google.apiKey` before env fallback.
|
||||
- `providers.microsoft.enabled`: allow Microsoft speech usage (default `true`; no API key).
|
||||
- `providers.microsoft.voice`: Microsoft neural voice name (e.g. `en-US-MichelleNeural`).
|
||||
- `providers.microsoft.lang`: language code (e.g. `en-US`).
|
||||
@@ -302,9 +334,9 @@ Here you go.
|
||||
|
||||
Available directive keys (when enabled):
|
||||
|
||||
- `provider` (registered speech provider id, for example `openai`, `elevenlabs`, `minimax`, or `microsoft`; requires `allowProvider: true`)
|
||||
- `voice` (OpenAI voice) or `voiceId` (ElevenLabs / MiniMax)
|
||||
- `model` (OpenAI TTS model, ElevenLabs model id, or MiniMax model)
|
||||
- `provider` (registered speech provider id, for example `openai`, `elevenlabs`, `google`, `minimax`, or `microsoft`; requires `allowProvider: true`)
|
||||
- `voice` (OpenAI voice), `voiceName` / `voice_name` / `google_voice` (Google voice), or `voiceId` (ElevenLabs / MiniMax)
|
||||
- `model` (OpenAI TTS model, ElevenLabs model id, or MiniMax model) or `google_model` (Google TTS model)
|
||||
- `stability`, `similarityBoost`, `style`, `speed`, `useSpeakerBoost`
|
||||
- `vol` / `volume` (MiniMax volume, 0-10)
|
||||
- `pitch` (MiniMax pitch, -12 to 12)
|
||||
@@ -364,6 +396,7 @@ These override `messages.tts.*` for that host.
|
||||
- **Other channels**: MP3 (`mp3_44100_128` from ElevenLabs, `mp3` from OpenAI).
|
||||
- 44.1kHz / 128kbps is the default balance for speech clarity.
|
||||
- **MiniMax**: MP3 (`speech-2.8-hd` model, 32kHz sample rate). Voice-note format not natively supported; use OpenAI or ElevenLabs for guaranteed Opus voice messages.
|
||||
- **Google Gemini**: Gemini API TTS returns raw 24kHz PCM. OpenClaw wraps it as WAV for audio attachments and returns PCM directly for Talk/telephony. Native Opus voice-note format is not supported by this path.
|
||||
- **Microsoft**: uses `microsoft.outputFormat` (default `audio-24khz-48kbitrate-mono-mp3`).
|
||||
- The bundled transport accepts an `outputFormat`, but not all formats are available from the service.
|
||||
- Output format values follow Microsoft Speech output formats (including Ogg/WebM Opus).
|
||||
|
||||
@@ -5,18 +5,19 @@ import { buildGoogleGeminiCliBackend } from "./cli-backend.js";
|
||||
import { registerGoogleGeminiCliProvider } from "./gemini-cli-provider.js";
|
||||
import { buildGoogleMusicGenerationProvider } from "./music-generation-provider.js";
|
||||
import { registerGoogleProvider } from "./provider-registration.js";
|
||||
import { buildGoogleSpeechProvider } from "./speech-provider.js";
|
||||
import { createGeminiWebSearchProvider } from "./src/gemini-web-search-provider.js";
|
||||
import { buildGoogleVideoGenerationProvider } from "./video-generation-provider.js";
|
||||
|
||||
let googleImageGenerationProviderPromise: Promise<ImageGenerationProvider> | null = null;
|
||||
let googleMediaUnderstandingProviderPromise: Promise<MediaUnderstandingProvider> | null = null;
|
||||
|
||||
type GoogleMediaUnderstandingProvider = MediaUnderstandingProvider & {
|
||||
describeImage: NonNullable<MediaUnderstandingProvider["describeImage"]>;
|
||||
describeImages: NonNullable<MediaUnderstandingProvider["describeImages"]>;
|
||||
transcribeAudio: NonNullable<MediaUnderstandingProvider["transcribeAudio"]>;
|
||||
describeVideo: NonNullable<MediaUnderstandingProvider["describeVideo"]>;
|
||||
};
|
||||
type GoogleMediaUnderstandingProvider = Required<
|
||||
Pick<
|
||||
MediaUnderstandingProvider,
|
||||
"describeImage" | "describeImages" | "transcribeAudio" | "describeVideo"
|
||||
>
|
||||
>;
|
||||
|
||||
async function loadGoogleImageGenerationProvider(): Promise<ImageGenerationProvider> {
|
||||
if (!googleImageGenerationProviderPromise) {
|
||||
@@ -113,6 +114,7 @@ export default definePluginEntry({
|
||||
api.registerImageGenerationProvider(createLazyGoogleImageGenerationProvider());
|
||||
api.registerMediaUnderstandingProvider(createLazyGoogleMediaUnderstandingProvider());
|
||||
api.registerMusicGenerationProvider(buildGoogleMusicGenerationProvider());
|
||||
api.registerSpeechProvider(buildGoogleSpeechProvider());
|
||||
api.registerVideoGenerationProvider(buildGoogleVideoGenerationProvider());
|
||||
api.registerWebSearchProvider(createGeminiWebSearchProvider());
|
||||
},
|
||||
|
||||
@@ -48,6 +48,7 @@
|
||||
"mediaUnderstandingProviders": ["google"],
|
||||
"imageGenerationProviders": ["google"],
|
||||
"musicGenerationProviders": ["google"],
|
||||
"speechProviders": ["google"],
|
||||
"videoGenerationProviders": ["google"],
|
||||
"webSearchProviders": ["gemini"]
|
||||
},
|
||||
|
||||
@@ -3,6 +3,7 @@ import { describePluginRegistrationContract } from "../../test/helpers/plugins/p
|
||||
|
||||
describePluginRegistrationContract({
|
||||
...pluginRegistrationContractCases.google,
|
||||
speechProviderIds: ["google"],
|
||||
videoGenerationProviderIds: ["google"],
|
||||
webSearchProviderIds: ["gemini"],
|
||||
requireDescribeImages: true,
|
||||
|
||||
248
extensions/google/speech-provider.test.ts
Normal file
248
extensions/google/speech-provider.test.ts
Normal file
@@ -0,0 +1,248 @@
|
||||
import { afterEach, describe, expect, it, vi } from "vitest";
|
||||
import { buildGoogleSpeechProvider, __testing } from "./speech-provider.js";
|
||||
|
||||
function installGoogleTtsFetchMock(pcm = Buffer.from([1, 0, 2, 0])) {
|
||||
const fetchMock = vi.fn().mockResolvedValue({
|
||||
ok: true,
|
||||
json: async () => ({
|
||||
candidates: [
|
||||
{
|
||||
content: {
|
||||
parts: [
|
||||
{
|
||||
inlineData: {
|
||||
mimeType: "audio/L16;codec=pcm;rate=24000",
|
||||
data: pcm.toString("base64"),
|
||||
},
|
||||
},
|
||||
],
|
||||
},
|
||||
},
|
||||
],
|
||||
}),
|
||||
});
|
||||
vi.stubGlobal("fetch", fetchMock);
|
||||
return fetchMock;
|
||||
}
|
||||
|
||||
describe("Google speech provider", () => {
|
||||
afterEach(() => {
|
||||
vi.restoreAllMocks();
|
||||
vi.unstubAllGlobals();
|
||||
vi.unstubAllEnvs();
|
||||
});
|
||||
|
||||
it("synthesizes Gemini PCM as WAV and preserves audio tags in the request text", async () => {
|
||||
const fetchMock = installGoogleTtsFetchMock();
|
||||
const provider = buildGoogleSpeechProvider();
|
||||
|
||||
const result = await provider.synthesize({
|
||||
text: "[whispers] The door is open.",
|
||||
cfg: {},
|
||||
providerConfig: {
|
||||
apiKey: "google-test-key",
|
||||
model: "google/gemini-3.1-flash-tts",
|
||||
voiceName: "Puck",
|
||||
},
|
||||
target: "audio-file",
|
||||
timeoutMs: 12_345,
|
||||
});
|
||||
|
||||
expect(fetchMock).toHaveBeenCalledWith(
|
||||
"https://generativelanguage.googleapis.com/v1beta/models/gemini-3.1-flash-tts-preview:generateContent",
|
||||
expect.objectContaining({
|
||||
method: "POST",
|
||||
body: JSON.stringify({
|
||||
contents: [
|
||||
{
|
||||
role: "user",
|
||||
parts: [{ text: "[whispers] The door is open." }],
|
||||
},
|
||||
],
|
||||
generationConfig: {
|
||||
responseModalities: ["AUDIO"],
|
||||
speechConfig: {
|
||||
voiceConfig: {
|
||||
prebuiltVoiceConfig: {
|
||||
voiceName: "Puck",
|
||||
},
|
||||
},
|
||||
},
|
||||
},
|
||||
}),
|
||||
}),
|
||||
);
|
||||
const [, init] = fetchMock.mock.calls[0];
|
||||
expect(new Headers(init.headers).get("x-goog-api-key")).toBe("google-test-key");
|
||||
expect(result.outputFormat).toBe("wav");
|
||||
expect(result.fileExtension).toBe(".wav");
|
||||
expect(result.voiceCompatible).toBe(false);
|
||||
expect(result.audioBuffer.subarray(0, 4).toString("ascii")).toBe("RIFF");
|
||||
expect(result.audioBuffer.subarray(8, 12).toString("ascii")).toBe("WAVE");
|
||||
expect(result.audioBuffer.readUInt32LE(24)).toBe(__testing.GOOGLE_TTS_SAMPLE_RATE);
|
||||
expect(result.audioBuffer.subarray(44)).toEqual(Buffer.from([1, 0, 2, 0]));
|
||||
});
|
||||
|
||||
it("falls back to GEMINI_API_KEY and configured Google API base URL", async () => {
|
||||
vi.stubEnv("GEMINI_API_KEY", "env-google-key");
|
||||
const fetchMock = installGoogleTtsFetchMock();
|
||||
const provider = buildGoogleSpeechProvider();
|
||||
|
||||
expect(provider.isConfigured({ providerConfig: {}, timeoutMs: 1 })).toBe(true);
|
||||
|
||||
await provider.synthesize({
|
||||
text: "Read this plainly.",
|
||||
cfg: {
|
||||
models: {
|
||||
providers: {
|
||||
google: {
|
||||
baseUrl: "https://generativelanguage.googleapis.com/v1beta/openai",
|
||||
models: [],
|
||||
},
|
||||
},
|
||||
},
|
||||
},
|
||||
providerConfig: {},
|
||||
target: "voice-note",
|
||||
timeoutMs: 10_000,
|
||||
});
|
||||
|
||||
expect(fetchMock).toHaveBeenCalledWith(
|
||||
"https://generativelanguage.googleapis.com/v1beta/models/gemini-3.1-flash-tts-preview:generateContent",
|
||||
expect.any(Object),
|
||||
);
|
||||
const [, init] = fetchMock.mock.calls[0];
|
||||
expect(new Headers(init.headers).get("x-goog-api-key")).toBe("env-google-key");
|
||||
});
|
||||
|
||||
it("can reuse a configured Google model-provider API key without auth profiles", async () => {
|
||||
const fetchMock = installGoogleTtsFetchMock();
|
||||
const provider = buildGoogleSpeechProvider();
|
||||
const cfg = {
|
||||
models: {
|
||||
providers: {
|
||||
google: {
|
||||
apiKey: "model-provider-google-key",
|
||||
baseUrl: "https://generativelanguage.googleapis.com",
|
||||
models: [],
|
||||
},
|
||||
},
|
||||
},
|
||||
};
|
||||
|
||||
expect(provider.isConfigured({ cfg, providerConfig: {}, timeoutMs: 1 })).toBe(true);
|
||||
|
||||
await provider.synthesize({
|
||||
text: "Use the configured model provider key.",
|
||||
cfg,
|
||||
providerConfig: {},
|
||||
target: "audio-file",
|
||||
timeoutMs: 10_000,
|
||||
});
|
||||
|
||||
const [, init] = fetchMock.mock.calls[0];
|
||||
expect(new Headers(init.headers).get("x-goog-api-key")).toBe("model-provider-google-key");
|
||||
});
|
||||
|
||||
it("returns Gemini PCM directly for telephony synthesis", async () => {
|
||||
const pcm = Buffer.from([3, 0, 4, 0]);
|
||||
installGoogleTtsFetchMock(pcm);
|
||||
const provider = buildGoogleSpeechProvider();
|
||||
|
||||
const result = await provider.synthesizeTelephony?.({
|
||||
text: "Phone call audio.",
|
||||
cfg: {},
|
||||
providerConfig: {
|
||||
apiKey: "google-test-key",
|
||||
voice: "Kore",
|
||||
},
|
||||
timeoutMs: 5_000,
|
||||
});
|
||||
|
||||
expect(result).toEqual({
|
||||
audioBuffer: pcm,
|
||||
outputFormat: "pcm",
|
||||
sampleRate: 24_000,
|
||||
});
|
||||
});
|
||||
|
||||
it("resolves provider config and directive overrides", () => {
|
||||
const provider = buildGoogleSpeechProvider();
|
||||
|
||||
expect(
|
||||
provider.resolveConfig?.({
|
||||
cfg: {},
|
||||
rawConfig: {
|
||||
providers: {
|
||||
google: {
|
||||
apiKey: "configured-key",
|
||||
model: "google/gemini-3.1-flash-tts-preview",
|
||||
voice: "Leda",
|
||||
},
|
||||
},
|
||||
},
|
||||
timeoutMs: 1,
|
||||
}),
|
||||
).toEqual({
|
||||
apiKey: "configured-key",
|
||||
baseUrl: undefined,
|
||||
model: "gemini-3.1-flash-tts-preview",
|
||||
voiceName: "Leda",
|
||||
});
|
||||
|
||||
expect(
|
||||
provider.parseDirectiveToken?.({
|
||||
key: "google_voice",
|
||||
value: "Aoede",
|
||||
policy: {
|
||||
enabled: true,
|
||||
allowText: true,
|
||||
allowProvider: true,
|
||||
allowVoice: true,
|
||||
allowModelId: true,
|
||||
allowVoiceSettings: true,
|
||||
allowNormalization: true,
|
||||
allowSeed: true,
|
||||
},
|
||||
}),
|
||||
).toEqual({
|
||||
handled: true,
|
||||
overrides: {
|
||||
voiceName: "Aoede",
|
||||
},
|
||||
});
|
||||
|
||||
expect(
|
||||
provider.parseDirectiveToken?.({
|
||||
key: "google_model",
|
||||
value: "gemini-3.1-flash-tts-preview",
|
||||
policy: {
|
||||
enabled: true,
|
||||
allowText: true,
|
||||
allowProvider: true,
|
||||
allowVoice: true,
|
||||
allowModelId: true,
|
||||
allowVoiceSettings: true,
|
||||
allowNormalization: true,
|
||||
allowSeed: true,
|
||||
},
|
||||
}),
|
||||
).toEqual({
|
||||
handled: true,
|
||||
overrides: {
|
||||
model: "gemini-3.1-flash-tts-preview",
|
||||
},
|
||||
});
|
||||
});
|
||||
|
||||
it("lists Gemini prebuilt TTS voices", async () => {
|
||||
const provider = buildGoogleSpeechProvider();
|
||||
|
||||
await expect(provider.listVoices?.({ providerConfig: {} })).resolves.toEqual(
|
||||
expect.arrayContaining([
|
||||
{ id: "Kore", name: "Kore" },
|
||||
{ id: "Puck", name: "Puck" },
|
||||
]),
|
||||
);
|
||||
});
|
||||
});
|
||||
391
extensions/google/speech-provider.ts
Normal file
391
extensions/google/speech-provider.ts
Normal file
@@ -0,0 +1,391 @@
|
||||
import { assertOkOrThrowHttpError, postJsonRequest } from "openclaw/plugin-sdk/provider-http";
|
||||
import type { OpenClawConfig } from "openclaw/plugin-sdk/provider-onboard";
|
||||
import { normalizeResolvedSecretInputString } from "openclaw/plugin-sdk/secret-input";
|
||||
import type {
|
||||
SpeechDirectiveTokenParseContext,
|
||||
SpeechProviderConfig,
|
||||
SpeechProviderOverrides,
|
||||
SpeechProviderPlugin,
|
||||
} from "openclaw/plugin-sdk/speech-core";
|
||||
import { asObject, trimToUndefined } from "openclaw/plugin-sdk/speech-core";
|
||||
import { normalizeOptionalString } from "openclaw/plugin-sdk/text-runtime";
|
||||
import { resolveGoogleGenerativeAiHttpRequestConfig } from "./api.js";
|
||||
|
||||
const DEFAULT_GOOGLE_TTS_MODEL = "gemini-3.1-flash-tts-preview";
|
||||
const DEFAULT_GOOGLE_TTS_VOICE = "Kore";
|
||||
const GOOGLE_TTS_SAMPLE_RATE = 24_000;
|
||||
const GOOGLE_TTS_CHANNELS = 1;
|
||||
const GOOGLE_TTS_BITS_PER_SAMPLE = 16;
|
||||
|
||||
const GOOGLE_TTS_VOICES = [
|
||||
"Zephyr",
|
||||
"Puck",
|
||||
"Charon",
|
||||
"Kore",
|
||||
"Fenrir",
|
||||
"Leda",
|
||||
"Orus",
|
||||
"Aoede",
|
||||
"Callirrhoe",
|
||||
"Autonoe",
|
||||
"Enceladus",
|
||||
"Iapetus",
|
||||
"Umbriel",
|
||||
"Algieba",
|
||||
"Despina",
|
||||
"Erinome",
|
||||
"Algenib",
|
||||
"Rasalgethi",
|
||||
"Laomedeia",
|
||||
"Achernar",
|
||||
"Alnilam",
|
||||
"Schedar",
|
||||
"Gacrux",
|
||||
"Pulcherrima",
|
||||
"Achird",
|
||||
"Zubenelgenubi",
|
||||
"Vindemiatrix",
|
||||
"Sadachbia",
|
||||
"Sadaltager",
|
||||
"Sulafat",
|
||||
] as const;
|
||||
|
||||
type GoogleTtsProviderConfig = {
|
||||
apiKey?: string;
|
||||
baseUrl?: string;
|
||||
model: string;
|
||||
voiceName: string;
|
||||
};
|
||||
|
||||
type GoogleTtsProviderOverrides = {
|
||||
model?: string;
|
||||
voiceName?: string;
|
||||
};
|
||||
|
||||
type Maybe<T> = T | undefined;
|
||||
|
||||
type GoogleInlineDataPart = {
|
||||
mimeType?: string;
|
||||
mime_type?: string;
|
||||
data?: string;
|
||||
};
|
||||
|
||||
type GoogleGenerateSpeechResponse = {
|
||||
candidates?: Array<{
|
||||
content?: {
|
||||
parts?: Array<{
|
||||
text?: string;
|
||||
inlineData?: GoogleInlineDataPart;
|
||||
inline_data?: GoogleInlineDataPart;
|
||||
}>;
|
||||
};
|
||||
}>;
|
||||
};
|
||||
|
||||
function normalizeGoogleTtsModel(model: unknown): string {
|
||||
const trimmed = normalizeOptionalString(model);
|
||||
if (!trimmed) {
|
||||
return DEFAULT_GOOGLE_TTS_MODEL;
|
||||
}
|
||||
const withoutProvider = trimmed.startsWith("google/") ? trimmed.slice("google/".length) : trimmed;
|
||||
return withoutProvider === "gemini-3.1-flash-tts" ? DEFAULT_GOOGLE_TTS_MODEL : withoutProvider;
|
||||
}
|
||||
|
||||
function normalizeGoogleTtsVoiceName(voiceName: unknown): string {
|
||||
return normalizeOptionalString(voiceName) ?? DEFAULT_GOOGLE_TTS_VOICE;
|
||||
}
|
||||
|
||||
function resolveGoogleTtsEnvApiKey(): string | undefined {
|
||||
return (
|
||||
normalizeOptionalString(process.env.GEMINI_API_KEY) ??
|
||||
normalizeOptionalString(process.env.GOOGLE_API_KEY)
|
||||
);
|
||||
}
|
||||
|
||||
function resolveGoogleTtsModelProviderApiKey(cfg?: OpenClawConfig): string | undefined {
|
||||
return normalizeResolvedSecretInputString({
|
||||
value: cfg?.models?.providers?.google?.apiKey,
|
||||
path: "models.providers.google.apiKey",
|
||||
});
|
||||
}
|
||||
|
||||
function resolveGoogleTtsApiKey(params: {
|
||||
cfg?: OpenClawConfig;
|
||||
providerConfig: SpeechProviderConfig;
|
||||
}): string | undefined {
|
||||
return (
|
||||
readGoogleTtsProviderConfig(params.providerConfig).apiKey ??
|
||||
resolveGoogleTtsModelProviderApiKey(params.cfg) ??
|
||||
resolveGoogleTtsEnvApiKey()
|
||||
);
|
||||
}
|
||||
|
||||
function resolveGoogleTtsBaseUrl(params: {
|
||||
cfg?: OpenClawConfig;
|
||||
providerConfig: GoogleTtsProviderConfig;
|
||||
}): string | undefined {
|
||||
return (
|
||||
params.providerConfig.baseUrl ?? trimToUndefined(params.cfg?.models?.providers?.google?.baseUrl)
|
||||
);
|
||||
}
|
||||
|
||||
function resolveGoogleTtsConfigRecord(
|
||||
rawConfig: Record<string, unknown>,
|
||||
): Record<string, unknown> | undefined {
|
||||
const providers = asObject(rawConfig.providers);
|
||||
return asObject(providers?.google) ?? asObject(rawConfig.google);
|
||||
}
|
||||
|
||||
function normalizeGoogleTtsProviderConfig(
|
||||
rawConfig: Record<string, unknown>,
|
||||
): GoogleTtsProviderConfig {
|
||||
const raw = resolveGoogleTtsConfigRecord(rawConfig);
|
||||
return {
|
||||
apiKey: normalizeResolvedSecretInputString({
|
||||
value: raw?.apiKey,
|
||||
path: "messages.tts.providers.google.apiKey",
|
||||
}),
|
||||
baseUrl: trimToUndefined(raw?.baseUrl),
|
||||
model: normalizeGoogleTtsModel(raw?.model),
|
||||
voiceName: normalizeGoogleTtsVoiceName(raw?.voiceName ?? raw?.voice),
|
||||
};
|
||||
}
|
||||
|
||||
function readGoogleTtsProviderConfig(config: SpeechProviderConfig): GoogleTtsProviderConfig {
|
||||
const normalized = normalizeGoogleTtsProviderConfig({});
|
||||
return {
|
||||
apiKey: trimToUndefined(config.apiKey) ?? normalized.apiKey,
|
||||
baseUrl: trimToUndefined(config.baseUrl) ?? normalized.baseUrl,
|
||||
model: normalizeGoogleTtsModel(config.model ?? normalized.model),
|
||||
voiceName: normalizeGoogleTtsVoiceName(
|
||||
config.voiceName ?? config.voice ?? normalized.voiceName,
|
||||
),
|
||||
};
|
||||
}
|
||||
|
||||
function readGoogleTtsOverrides(
|
||||
overrides: Maybe<SpeechProviderOverrides>,
|
||||
): GoogleTtsProviderOverrides {
|
||||
if (!overrides) {
|
||||
return {};
|
||||
}
|
||||
return {
|
||||
model: normalizeOptionalString(overrides.model),
|
||||
voiceName: normalizeOptionalString(overrides.voiceName ?? overrides.voice),
|
||||
};
|
||||
}
|
||||
|
||||
function parseDirectiveToken(ctx: SpeechDirectiveTokenParseContext): {
|
||||
handled: boolean;
|
||||
overrides?: SpeechProviderOverrides;
|
||||
warnings?: string[];
|
||||
} {
|
||||
switch (ctx.key) {
|
||||
case "voicename":
|
||||
case "voice_name":
|
||||
case "google_voice":
|
||||
case "googlevoice":
|
||||
if (!ctx.policy.allowVoice) {
|
||||
return { handled: true };
|
||||
}
|
||||
return { handled: true, overrides: { voiceName: ctx.value } };
|
||||
case "google_model":
|
||||
case "googlemodel":
|
||||
if (!ctx.policy.allowModelId) {
|
||||
return { handled: true };
|
||||
}
|
||||
return { handled: true, overrides: { model: ctx.value } };
|
||||
default:
|
||||
return { handled: false };
|
||||
}
|
||||
}
|
||||
|
||||
function extractGoogleSpeechPcm(payload: GoogleGenerateSpeechResponse): Buffer {
|
||||
for (const candidate of payload.candidates ?? []) {
|
||||
for (const part of candidate.content?.parts ?? []) {
|
||||
const inline = part.inlineData ?? part.inline_data;
|
||||
const data = normalizeOptionalString(inline?.data);
|
||||
if (!data) {
|
||||
continue;
|
||||
}
|
||||
return Buffer.from(data, "base64");
|
||||
}
|
||||
}
|
||||
throw new Error("Google TTS response missing audio data");
|
||||
}
|
||||
|
||||
function wrapPcm16MonoToWav(pcm: Buffer, sampleRate = GOOGLE_TTS_SAMPLE_RATE): Buffer {
|
||||
const byteRate = sampleRate * GOOGLE_TTS_CHANNELS * (GOOGLE_TTS_BITS_PER_SAMPLE / 8);
|
||||
const blockAlign = GOOGLE_TTS_CHANNELS * (GOOGLE_TTS_BITS_PER_SAMPLE / 8);
|
||||
const header = Buffer.alloc(44);
|
||||
|
||||
header.write("RIFF", 0, "ascii");
|
||||
header.writeUInt32LE(36 + pcm.length, 4);
|
||||
header.write("WAVE", 8, "ascii");
|
||||
header.write("fmt ", 12, "ascii");
|
||||
header.writeUInt32LE(16, 16);
|
||||
header.writeUInt16LE(1, 20);
|
||||
header.writeUInt16LE(GOOGLE_TTS_CHANNELS, 22);
|
||||
header.writeUInt32LE(sampleRate, 24);
|
||||
header.writeUInt32LE(byteRate, 28);
|
||||
header.writeUInt16LE(blockAlign, 32);
|
||||
header.writeUInt16LE(GOOGLE_TTS_BITS_PER_SAMPLE, 34);
|
||||
header.write("data", 36, "ascii");
|
||||
header.writeUInt32LE(pcm.length, 40);
|
||||
|
||||
return Buffer.concat([header, pcm]);
|
||||
}
|
||||
|
||||
async function synthesizeGoogleTtsPcm(params: {
|
||||
text: string;
|
||||
apiKey: string;
|
||||
baseUrl?: string;
|
||||
model: string;
|
||||
voiceName: string;
|
||||
timeoutMs: number;
|
||||
}): Promise<Buffer> {
|
||||
const { baseUrl, allowPrivateNetwork, headers, dispatcherPolicy } =
|
||||
resolveGoogleGenerativeAiHttpRequestConfig({
|
||||
apiKey: params.apiKey,
|
||||
baseUrl: params.baseUrl,
|
||||
capability: "audio",
|
||||
transport: "http",
|
||||
});
|
||||
|
||||
const { response: res, release } = await postJsonRequest({
|
||||
url: `${baseUrl}/models/${params.model}:generateContent`,
|
||||
headers,
|
||||
body: {
|
||||
contents: [
|
||||
{
|
||||
role: "user",
|
||||
parts: [{ text: params.text }],
|
||||
},
|
||||
],
|
||||
generationConfig: {
|
||||
responseModalities: ["AUDIO"],
|
||||
speechConfig: {
|
||||
voiceConfig: {
|
||||
prebuiltVoiceConfig: {
|
||||
voiceName: params.voiceName,
|
||||
},
|
||||
},
|
||||
},
|
||||
},
|
||||
},
|
||||
timeoutMs: params.timeoutMs,
|
||||
fetchFn: fetch,
|
||||
pinDns: false,
|
||||
allowPrivateNetwork,
|
||||
dispatcherPolicy,
|
||||
});
|
||||
|
||||
try {
|
||||
await assertOkOrThrowHttpError(res, "Google TTS failed");
|
||||
return extractGoogleSpeechPcm((await res.json()) as GoogleGenerateSpeechResponse);
|
||||
} finally {
|
||||
await release();
|
||||
}
|
||||
}
|
||||
|
||||
export function buildGoogleSpeechProvider(): SpeechProviderPlugin {
|
||||
return {
|
||||
id: "google",
|
||||
label: "Google",
|
||||
autoSelectOrder: 50,
|
||||
models: [DEFAULT_GOOGLE_TTS_MODEL],
|
||||
voices: GOOGLE_TTS_VOICES,
|
||||
resolveConfig: ({ rawConfig }) => normalizeGoogleTtsProviderConfig(rawConfig),
|
||||
parseDirectiveToken,
|
||||
resolveTalkConfig: ({ baseTtsConfig, talkProviderConfig }) => {
|
||||
const base = normalizeGoogleTtsProviderConfig(baseTtsConfig);
|
||||
return {
|
||||
...base,
|
||||
...(talkProviderConfig.apiKey === undefined
|
||||
? {}
|
||||
: {
|
||||
apiKey: normalizeResolvedSecretInputString({
|
||||
value: talkProviderConfig.apiKey,
|
||||
path: "talk.providers.google.apiKey",
|
||||
}),
|
||||
}),
|
||||
...(trimToUndefined(talkProviderConfig.baseUrl) == null
|
||||
? {}
|
||||
: { baseUrl: trimToUndefined(talkProviderConfig.baseUrl) }),
|
||||
...(trimToUndefined(talkProviderConfig.modelId) == null
|
||||
? {}
|
||||
: { model: normalizeGoogleTtsModel(talkProviderConfig.modelId) }),
|
||||
...(trimToUndefined(talkProviderConfig.voiceId) == null
|
||||
? {}
|
||||
: { voiceName: normalizeGoogleTtsVoiceName(talkProviderConfig.voiceId) }),
|
||||
};
|
||||
},
|
||||
resolveTalkOverrides: ({ params }) => ({
|
||||
...(trimToUndefined(params.voiceId) == null
|
||||
? {}
|
||||
: { voiceName: normalizeGoogleTtsVoiceName(params.voiceId) }),
|
||||
...(trimToUndefined(params.modelId) == null
|
||||
? {}
|
||||
: { model: normalizeGoogleTtsModel(params.modelId) }),
|
||||
}),
|
||||
listVoices: async () => GOOGLE_TTS_VOICES.map((voice) => ({ id: voice, name: voice })),
|
||||
isConfigured: ({ cfg, providerConfig }) =>
|
||||
Boolean(resolveGoogleTtsApiKey({ cfg, providerConfig })),
|
||||
synthesize: async (req) => {
|
||||
const config = readGoogleTtsProviderConfig(req.providerConfig);
|
||||
const overrides = readGoogleTtsOverrides(req.providerOverrides);
|
||||
const apiKey = resolveGoogleTtsApiKey({
|
||||
cfg: req.cfg,
|
||||
providerConfig: req.providerConfig,
|
||||
});
|
||||
if (!apiKey) {
|
||||
throw new Error("Google API key missing");
|
||||
}
|
||||
const pcm = await synthesizeGoogleTtsPcm({
|
||||
text: req.text,
|
||||
apiKey,
|
||||
baseUrl: resolveGoogleTtsBaseUrl({ cfg: req.cfg, providerConfig: config }),
|
||||
model: normalizeGoogleTtsModel(overrides.model ?? config.model),
|
||||
voiceName: normalizeGoogleTtsVoiceName(overrides.voiceName ?? config.voiceName),
|
||||
timeoutMs: req.timeoutMs,
|
||||
});
|
||||
return {
|
||||
audioBuffer: wrapPcm16MonoToWav(pcm),
|
||||
outputFormat: "wav",
|
||||
fileExtension: ".wav",
|
||||
voiceCompatible: false,
|
||||
};
|
||||
},
|
||||
synthesizeTelephony: async (req) => {
|
||||
const config = readGoogleTtsProviderConfig(req.providerConfig);
|
||||
const apiKey = resolveGoogleTtsApiKey({
|
||||
cfg: req.cfg,
|
||||
providerConfig: req.providerConfig,
|
||||
});
|
||||
if (!apiKey) {
|
||||
throw new Error("Google API key missing");
|
||||
}
|
||||
const pcm = await synthesizeGoogleTtsPcm({
|
||||
text: req.text,
|
||||
apiKey,
|
||||
baseUrl: resolveGoogleTtsBaseUrl({ cfg: req.cfg, providerConfig: config }),
|
||||
model: config.model,
|
||||
voiceName: config.voiceName,
|
||||
timeoutMs: req.timeoutMs,
|
||||
});
|
||||
return {
|
||||
audioBuffer: pcm,
|
||||
outputFormat: "pcm",
|
||||
sampleRate: GOOGLE_TTS_SAMPLE_RATE,
|
||||
};
|
||||
},
|
||||
};
|
||||
}
|
||||
|
||||
export const __testing = {
|
||||
DEFAULT_GOOGLE_TTS_MODEL,
|
||||
DEFAULT_GOOGLE_TTS_VOICE,
|
||||
GOOGLE_TTS_SAMPLE_RATE,
|
||||
normalizeGoogleTtsModel,
|
||||
wrapPcm16MonoToWav,
|
||||
};
|
||||
@@ -1,5 +1,6 @@
|
||||
export { buildGoogleGeminiCliBackend } from "./cli-backend.js";
|
||||
export { buildGoogleImageGenerationProvider } from "./image-generation-provider.js";
|
||||
export { buildGoogleMusicGenerationProvider } from "./music-generation-provider.js";
|
||||
export { buildGoogleSpeechProvider } from "./speech-provider.js";
|
||||
export { googleMediaUnderstandingProvider } from "./media-understanding-provider.js";
|
||||
export { buildGoogleVideoGenerationProvider } from "./video-generation-provider.js";
|
||||
|
||||
@@ -474,9 +474,11 @@ export function getTtsProvider(config: ResolvedTtsConfig, prefsPath: string): Tt
|
||||
return normalizeConfiguredSpeechProviderId(config.provider) ?? config.provider;
|
||||
}
|
||||
|
||||
for (const provider of sortSpeechProvidersForAutoSelection()) {
|
||||
const effectiveCfg = config.sourceConfig;
|
||||
for (const provider of sortSpeechProvidersForAutoSelection(effectiveCfg)) {
|
||||
if (
|
||||
provider.isConfigured({
|
||||
cfg: effectiveCfg,
|
||||
providerConfig: config.providerConfigs[provider.id] ?? {},
|
||||
timeoutMs: config.timeoutMs,
|
||||
})
|
||||
|
||||
@@ -55,6 +55,7 @@ export const pluginRegistrationContractCases = {
|
||||
pluginId: "google",
|
||||
providerIds: ["google", "google-gemini-cli"],
|
||||
webSearchProviderIds: ["gemini"],
|
||||
speechProviderIds: ["google"],
|
||||
mediaUnderstandingProviderIds: ["google"],
|
||||
imageGenerationProviderIds: ["google"],
|
||||
requireDescribeImages: true,
|
||||
|
||||
@@ -307,7 +307,8 @@ function buildTestMicrosoftSpeechProvider(): SpeechProviderPlugin {
|
||||
outputFormat: edgeConfig.outputFormat ?? "audio-24khz-48kbitrate-mono-mp3",
|
||||
};
|
||||
},
|
||||
isConfigured: () => true,
|
||||
isConfigured: ({ providerConfig }) =>
|
||||
(providerConfig as Record<string, unknown> | undefined)?.enabled !== false,
|
||||
synthesize: async () => ({
|
||||
audioBuffer: createAudioBuffer(),
|
||||
outputFormat: "mp3",
|
||||
@@ -368,6 +369,32 @@ function buildTestElevenLabsSpeechProvider(): SpeechProviderPlugin {
|
||||
};
|
||||
}
|
||||
|
||||
function buildTestGoogleSpeechProvider(): SpeechProviderPlugin {
|
||||
return {
|
||||
id: "google",
|
||||
label: "Google",
|
||||
autoSelectOrder: 50,
|
||||
resolveConfig: ({ rawConfig }) => resolveTestProviderConfig(rawConfig, "google"),
|
||||
isConfigured: ({ cfg, providerConfig }) =>
|
||||
typeof (providerConfig as Record<string, unknown> | undefined)?.apiKey === "string" ||
|
||||
typeof cfg?.models?.providers?.google?.apiKey === "string" ||
|
||||
typeof process.env.GEMINI_API_KEY === "string" ||
|
||||
typeof process.env.GOOGLE_API_KEY === "string",
|
||||
synthesize: async () => ({
|
||||
audioBuffer: createAudioBuffer(),
|
||||
outputFormat: "wav",
|
||||
fileExtension: ".wav",
|
||||
voiceCompatible: false,
|
||||
}),
|
||||
synthesizeTelephony: async () => ({
|
||||
audioBuffer: createAudioBuffer(),
|
||||
outputFormat: "pcm",
|
||||
sampleRate: 24_000,
|
||||
}),
|
||||
listVoices: async () => [{ id: "Kore", label: "Kore" }],
|
||||
};
|
||||
}
|
||||
|
||||
async function loadTtsRuntime(): Promise<TtsRuntimeModule> {
|
||||
ttsRuntimePromise ??= import("../../../src/tts/tts.js");
|
||||
return await ttsRuntimePromise;
|
||||
@@ -395,6 +422,7 @@ function setupTestSpeechProviderRegistry() {
|
||||
{ pluginId: "openai", provider: buildTestOpenAISpeechProvider(), source: "test" },
|
||||
{ pluginId: "microsoft", provider: buildTestMicrosoftSpeechProvider(), source: "test" },
|
||||
{ pluginId: "elevenlabs", provider: buildTestElevenLabsSpeechProvider(), source: "test" },
|
||||
{ pluginId: "google", provider: buildTestGoogleSpeechProvider(), source: "test" },
|
||||
];
|
||||
const { cacheKey } = pluginLoaderTesting.resolvePluginLoadCacheContext({ config: {} });
|
||||
setActivePluginRegistry(registry, cacheKey);
|
||||
@@ -613,6 +641,32 @@ export function describeTtsConfigContract() {
|
||||
expect(provider).toBe(testCase.expected);
|
||||
});
|
||||
});
|
||||
|
||||
it("passes cfg into auto-selection so model-provider Google keys can configure TTS", () => {
|
||||
const cfg = asLegacyOpenClawConfig({
|
||||
agents: { defaults: { model: { primary: "openai/gpt-4o-mini" } } },
|
||||
models: {
|
||||
providers: {
|
||||
google: {
|
||||
apiKey: "model-provider-google-key",
|
||||
},
|
||||
},
|
||||
},
|
||||
messages: {
|
||||
tts: {
|
||||
providers: {
|
||||
microsoft: {
|
||||
enabled: false,
|
||||
},
|
||||
},
|
||||
},
|
||||
},
|
||||
});
|
||||
const config = resolveTtsConfig(cfg);
|
||||
const prefsPath = `/tmp/tts-prefs-google-model-provider-${Date.now()}.json`;
|
||||
|
||||
expect(getTtsProvider(config, prefsPath)).toBe("google");
|
||||
});
|
||||
});
|
||||
|
||||
describe("resolveTtsConfig provider normalization", () => {
|
||||
|
||||
Reference in New Issue
Block a user