feat(google): add realtime voice provider

This commit is contained in:
Peter Steinberger
2026-04-24 09:08:09 +01:00
parent c138368040
commit b5e5f2cede
13 changed files with 1127 additions and 141 deletions

View File

@@ -132,6 +132,7 @@ Choose your preferred auth method and follow the setup steps.
| Image generation | Yes |
| Music generation | Yes |
| Text-to-speech | Yes |
| Realtime voice | Yes (Google Live API) |
| Image understanding | Yes |
| Audio transcription | Yes |
| Video understanding | Yes |
@@ -281,6 +282,63 @@ A Google Cloud Console API key restricted to the Gemini API is valid for this
provider. This is not the separate Cloud Text-to-Speech API path.
</Note>
## Realtime voice
The bundled `google` plugin registers a realtime voice provider backed by the
Gemini Live API for backend audio bridges such as Voice Call and Google Meet.
| Setting | Config path | Default |
| --------------------- | ------------------------------------------------------------------- | ------------------------------------------------------------------------------------- |
| Model | `plugins.entries.voice-call.config.realtime.providers.google.model` | `gemini-2.5-flash-native-audio-preview-12-2025` |
| Voice | `...google.voice` | `Kore` |
| Temperature | `...google.temperature` | (unset) |
| VAD start sensitivity | `...google.startSensitivity` | (unset) |
| VAD end sensitivity | `...google.endSensitivity` | (unset) |
| Silence duration | `...google.silenceDurationMs` | (unset) |
| API key | `...google.apiKey` | Falls back to `models.providers.google.apiKey`, `GEMINI_API_KEY`, or `GOOGLE_API_KEY` |
Example Voice Call realtime config:
```json5
{
plugins: {
entries: {
"voice-call": {
enabled: true,
config: {
realtime: {
enabled: true,
provider: "google",
providers: {
google: {
model: "gemini-2.5-flash-native-audio-preview-12-2025",
voice: "Kore",
},
},
},
},
},
},
},
}
```
<Note>
Google Live API uses bidirectional audio and function calling over a WebSocket.
OpenClaw adapts telephony/Meet bridge audio to Gemini's PCM Live API stream and
keeps tool calls on the shared realtime voice contract. Leave `temperature`
unset unless you need sampling changes; OpenClaw omits non-positive values
because Google Live can return transcripts without audio for `temperature: 0`.
Gemini API transcription is enabled without `languageCodes`; the current Google
SDK rejects language-code hints on this API path.
</Note>
<Note>
Control UI Talk browser sessions still require a realtime voice provider with a
browser WebRTC session implementation. Today that path is OpenAI Realtime; the
Google provider is for backend realtime bridges.
</Note>
## Advanced configuration
<AccordionGroup>

View File

@@ -156,12 +156,14 @@ Cron jobs panel notes:
- `chat.history` also strips display-only inline directive tags from visible assistant text (for example `[[reply_to_*]]` and `[[audio_as_voice]]`), plain-text tool-call XML payloads (including `<tool_call>...</tool_call>`, `<function_call>...</function_call>`, `<tool_calls>...</tool_calls>`, `<function_calls>...</function_calls>`, and truncated tool-call blocks), and leaked ASCII/full-width model control tokens, and omits assistant entries whose whole visible text is only the exact silent token `NO_REPLY` / `no_reply`.
- `chat.inject` appends an assistant note to the session transcript and broadcasts a `chat` event for UI-only updates (no agent run, no channel delivery).
- The chat header model and thinking pickers patch the active session immediately through `sessions.patch`; they are persistent session overrides, not one-turn-only send options.
- Talk mode uses the registered realtime voice provider. Configure OpenAI with
`talk.provider: "openai"` plus `talk.providers.openai.apiKey`, or reuse the
Voice Call realtime provider config. The browser never receives the standard
OpenAI API key; it receives only the ephemeral Realtime client secret. The
Realtime session prompt is assembled by the Gateway; `talk.realtime.session`
does not accept caller-provided instruction overrides.
- Talk mode uses a registered realtime voice provider that supports browser
WebRTC sessions. Configure OpenAI with `talk.provider: "openai"` plus
`talk.providers.openai.apiKey`, or reuse the Voice Call realtime provider
config. The browser never receives the standard OpenAI API key; it receives
only the ephemeral Realtime client secret. Google Live realtime voice is
supported for backend Voice Call and Google Meet bridges, but not this browser
WebRTC path yet. The Realtime session prompt is assembled by the Gateway;
`talk.realtime.session` does not accept caller-provided instruction overrides.
- In the Chat composer, the Talk control is the waves button next to the
microphone dictation button. When Talk starts, the composer status row shows
`Connecting Talk...`, then `Talk live` while audio is connected, or