feat(google): add realtime voice provider

2026-05-06 09:40:43 +00:00 · 2026-04-24 09:08:09 +01:00
parent c138368040
commit b5e5f2cede
13 changed files with 1127 additions and 141 deletions
--- a/docs/providers/google.md
+++ b/docs/providers/google.md
@@ -132,6 +132,7 @@ Choose your preferred auth method and follow the setup steps.
 | Image generation       | Yes                           |
 | Music generation       | Yes                           |
 | Text-to-speech         | Yes                           |
+| Realtime voice         | Yes (Google Live API)         |
 | Image understanding    | Yes                           |
 | Audio transcription    | Yes                           |
 | Video understanding    | Yes                           |
@@ -281,6 +282,63 @@ A Google Cloud Console API key restricted to the Gemini API is valid for this
 provider. This is not the separate Cloud Text-to-Speech API path.
 </Note>

+## Realtime voice
+
+The bundled `google` plugin registers a realtime voice provider backed by the
+Gemini Live API for backend audio bridges such as Voice Call and Google Meet.
+
+| Setting               | Config path                                                         | Default                                                                               |
+| --------------------- | ------------------------------------------------------------------- | ------------------------------------------------------------------------------------- |
+| Model                 | `plugins.entries.voice-call.config.realtime.providers.google.model` | `gemini-2.5-flash-native-audio-preview-12-2025`                                       |
+| Voice                 | `...google.voice`                                                   | `Kore`                                                                                |
+| Temperature           | `...google.temperature`                                             | (unset)                                                                               |
+| VAD start sensitivity | `...google.startSensitivity`                                        | (unset)                                                                               |
+| VAD end sensitivity   | `...google.endSensitivity`                                          | (unset)                                                                               |
+| Silence duration      | `...google.silenceDurationMs`                                       | (unset)                                                                               |
+| API key               | `...google.apiKey`                                                  | Falls back to `models.providers.google.apiKey`, `GEMINI_API_KEY`, or `GOOGLE_API_KEY` |
+
+Example Voice Call realtime config:
+
+```json5
+{
+  plugins: {
+    entries: {
+      "voice-call": {
+        enabled: true,
+        config: {
+          realtime: {
+            enabled: true,
+            provider: "google",
+            providers: {
+              google: {
+                model: "gemini-2.5-flash-native-audio-preview-12-2025",
+                voice: "Kore",
+              },
+            },
+          },
+        },
+      },
+    },
+  },
+}
+```
+
+<Note>
+Google Live API uses bidirectional audio and function calling over a WebSocket.
+OpenClaw adapts telephony/Meet bridge audio to Gemini's PCM Live API stream and
+keeps tool calls on the shared realtime voice contract. Leave `temperature`
+unset unless you need sampling changes; OpenClaw omits non-positive values
+because Google Live can return transcripts without audio for `temperature: 0`.
+Gemini API transcription is enabled without `languageCodes`; the current Google
+SDK rejects language-code hints on this API path.
+</Note>
+
+<Note>
+Control UI Talk browser sessions still require a realtime voice provider with a
+browser WebRTC session implementation. Today that path is OpenAI Realtime; the
+Google provider is for backend realtime bridges.
+</Note>
+
 ## Advanced configuration

 <AccordionGroup>
--- a/docs/web/control-ui.md
+++ b/docs/web/control-ui.md
@@ -156,12 +156,14 @@ Cron jobs panel notes:
 - `chat.history` also strips display-only inline directive tags from visible assistant text (for example `[[reply_to_*]]` and `[[audio_as_voice]]`), plain-text tool-call XML payloads (including `<tool_call>...</tool_call>`, `<function_call>...</function_call>`, `<tool_calls>...</tool_calls>`, `<function_calls>...</function_calls>`, and truncated tool-call blocks), and leaked ASCII/full-width model control tokens, and omits assistant entries whose whole visible text is only the exact silent token `NO_REPLY` / `no_reply`.
 - `chat.inject` appends an assistant note to the session transcript and broadcasts a `chat` event for UI-only updates (no agent run, no channel delivery).
 - The chat header model and thinking pickers patch the active session immediately through `sessions.patch`; they are persistent session overrides, not one-turn-only send options.
- Talk mode uses the registered realtime voice provider. Configure OpenAI with
-  `talk.provider: "openai"` plus `talk.providers.openai.apiKey`, or reuse the
-  Voice Call realtime provider config. The browser never receives the standard
-  OpenAI API key; it receives only the ephemeral Realtime client secret. The
-  Realtime session prompt is assembled by the Gateway; `talk.realtime.session`
-  does not accept caller-provided instruction overrides.
+- Talk mode uses a registered realtime voice provider that supports browser
+  WebRTC sessions. Configure OpenAI with `talk.provider: "openai"` plus
+  `talk.providers.openai.apiKey`, or reuse the Voice Call realtime provider
+  config. The browser never receives the standard OpenAI API key; it receives
+  only the ephemeral Realtime client secret. Google Live realtime voice is
+  supported for backend Voice Call and Google Meet bridges, but not this browser
+  WebRTC path yet. The Realtime session prompt is assembled by the Gateway;
+  `talk.realtime.session` does not accept caller-provided instruction overrides.
 - In the Chat composer, the Talk control is the waves button next to the
  microphone dictation button. When Talk starts, the composer status row shows
  `Connecting Talk...`, then `Talk live` while audio is connected, or