From e5f55dd024fc4823bc7b673b20fe55a5115ae6f6 Mon Sep 17 00:00:00 2001
From: Peter Steinberger <steipete@gmail.com>
Date: Fri, 24 Apr 2026 10:14:19 +0100
Subject: [PATCH] docs: document Google realtime voice support

---
 docs/plugins/google-meet.md  | 19 ++++++--
 docs/plugins/voice-call.md   | 95 ++++++++++++++++++++++++++++++++++++
 docs/tools/media-overview.md | 52 ++++++++++----------
 3 files changed, 138 insertions(+), 28 deletions(-)

diff --git a/docs/plugins/google-meet.md b/docs/plugins/google-meet.md
index 96a9a9b8e4d..d653966a5bd 100644
--- a/docs/plugins/google-meet.md
+++ b/docs/plugins/google-meet.md
@@ -26,12 +26,15 @@ The plugin is explicit by design:
 
 ## Quick start
 
-Install the local audio dependencies and make sure the realtime provider can use
-OpenAI:
+Install the local audio dependencies and configure a backend realtime voice
+provider. OpenAI is the default; Google Gemini Live also works with
+`realtime.provider: "google"`:
 
 ```bash
 brew install blackhole-2ch sox
 export OPENAI_API_KEY=sk-...
+# or
+export GEMINI_API_KEY=...
 ```
 
 `blackhole-2ch` installs the `BlackHole 2ch` virtual audio device. Homebrew's
@@ -319,11 +322,14 @@ Workspace Developer Preview Program for Meet media APIs.
 ## Config
 
 The common Chrome realtime path only needs the plugin enabled, BlackHole, SoX,
-and an OpenAI key:
+and a backend realtime voice provider key. OpenAI is the default; set
+`realtime.provider: "google"` to use Google Gemini Live:
 
 ```bash
 brew install blackhole-2ch sox
 export OPENAI_API_KEY=sk-...
+# or
+export GEMINI_API_KEY=...
 ```
 
 Set the plugin config under `plugins.entries.google-meet.config`:
@@ -372,8 +378,15 @@ Optional overrides:
     node: "parallels-macos",
   },
   realtime: {
+    provider: "google",
     toolPolicy: "owner",
     introMessage: "Say exactly: I'm here.",
+    providers: {
+      google: {
+        model: "gemini-2.5-flash-native-audio-preview-12-2025",
+        voice: "Kore",
+      },
+    },
   },
 }
 ```
diff --git a/docs/plugins/voice-call.md b/docs/plugins/voice-call.md
index 70fd8fd2518..dc686b5137e 100644
--- a/docs/plugins/voice-call.md
+++ b/docs/plugins/voice-call.md
@@ -122,6 +122,17 @@ Set config under `plugins.entries.voice-call.config`:
             maxPendingConnectionsPerIp: 4,
             maxConnections: 128,
           },
+
+          realtime: {
+            enabled: false,
+            provider: "google", // optional; first registered realtime voice provider when unset
+            providers: {
+              google: {
+                model: "gemini-2.5-flash-native-audio-preview-12-2025",
+                voice: "Kore",
+              },
+            },
+          },
         },
       },
     },
@@ -140,6 +151,7 @@ Notes:
 - If you use ngrok free tier, set `publicUrl` to the exact ngrok URL; signature verification is always enforced.
 - `tunnel.allowNgrokFreeTierLoopbackBypass: true` allows Twilio webhooks with invalid signatures **only** when `tunnel.provider="ngrok"` and `serve.bind` is loopback (ngrok local agent). Use for local dev only.
 - Ngrok free tier URLs can change or add interstitial behavior; if `publicUrl` drifts, Twilio signatures will fail. For production, prefer a stable domain or Tailscale funnel.
+- `realtime.enabled` starts full voice-to-voice conversations; do not enable it together with `streaming.enabled`.
 - Streaming security defaults:
   - `streaming.preStartTimeoutMs` closes sockets that never send a valid `start` frame.
 - `streaming.maxPendingConnections` caps total unauthenticated pre-start sockets.
@@ -147,6 +159,89 @@ Notes:
 - `streaming.maxConnections` caps total open media stream sockets (pending + active).
 - Runtime fallback still accepts those old voice-call keys for now, but the rewrite path is `openclaw doctor --fix` and the compat shim is temporary.
 
+## Realtime voice conversations
+
+`realtime` selects a full duplex realtime voice provider for live call audio.
+It is separate from `streaming`, which only forwards audio to realtime
+transcription providers.
+
+Current runtime behavior:
+
+- `realtime.enabled` is supported for Twilio Media Streams.
+- `realtime.enabled` cannot be combined with `streaming.enabled`.
+- `realtime.provider` is optional. If unset, Voice Call uses the first
+  registered realtime voice provider.
+- Bundled realtime voice providers include Google Gemini Live (`google`) and
+  OpenAI (`openai`), registered by their provider plugins.
+- Provider-owned raw config lives under `realtime.providers.<providerId>`.
+- If `realtime.provider` points at an unregistered provider, or no realtime
+  voice provider is registered at all, Voice Call logs a warning and skips
+  realtime media instead of failing the whole plugin.
+
+Google Gemini Live realtime defaults:
+
+- API key: `realtime.providers.google.apiKey`, `GEMINI_API_KEY`, or
+  `GOOGLE_GENERATIVE_AI_API_KEY`
+- model: `gemini-2.5-flash-native-audio-preview-12-2025`
+- voice: `Kore`
+
+Example:
+
+```json5
+{
+  plugins: {
+    entries: {
+      "voice-call": {
+        config: {
+          provider: "twilio",
+          inboundPolicy: "allowlist",
+          allowFrom: ["+15550005678"],
+          realtime: {
+            enabled: true,
+            provider: "google",
+            instructions: "Speak briefly and ask before using tools.",
+            providers: {
+              google: {
+                apiKey: "${GEMINI_API_KEY}",
+                model: "gemini-2.5-flash-native-audio-preview-12-2025",
+                voice: "Kore",
+              },
+            },
+          },
+        },
+      },
+    },
+  },
+}
+```
+
+Use OpenAI instead:
+
+```json5
+{
+  plugins: {
+    entries: {
+      "voice-call": {
+        config: {
+          realtime: {
+            enabled: true,
+            provider: "openai",
+            providers: {
+              openai: {
+                apiKey: "${OPENAI_API_KEY}",
+              },
+            },
+          },
+        },
+      },
+    },
+  },
+}
+```
+
+See [Google provider](/providers/google) and [OpenAI provider](/providers/openai)
+for provider-specific realtime voice options.
+
 ## Streaming transcription
 
 `streaming` selects a realtime transcription provider for live call audio.
diff --git a/docs/tools/media-overview.md b/docs/tools/media-overview.md
index d53d67b2482..ffbb5784ecc 100644
--- a/docs/tools/media-overview.md
+++ b/docs/tools/media-overview.md
@@ -18,31 +18,31 @@ OpenClaw generates images, videos, and music, understands inbound media (images,
 | Image generation     | `image_generate` | ComfyUI, fal, Google, MiniMax, OpenAI, Vydra, xAI                                            | Creates or edits images from text prompts or references |
 | Video generation     | `video_generate` | Alibaba, BytePlus, ComfyUI, fal, Google, MiniMax, OpenAI, Qwen, Runway, Together, Vydra, xAI | Creates videos from text, images, or existing videos    |
 | Music generation     | `music_generate` | ComfyUI, Google, MiniMax                                                                     | Creates music or audio tracks from text prompts         |
-| Text-to-speech (TTS) | `tts`            | ElevenLabs, Microsoft, MiniMax, OpenAI, xAI                                                  | Converts outbound replies to spoken audio               |
+| Text-to-speech (TTS) | `tts`            | ElevenLabs, Google, Microsoft, MiniMax, OpenAI, xAI                                          | Converts outbound replies to spoken audio               |
 | Media understanding  | (automatic)      | Any vision/audio-capable model provider, plus CLI fallbacks                                  | Summarizes inbound images, audio, and video             |
 
 ## Provider capability matrix
 
 This table shows which providers support which media capabilities across the platform.
 
-| Provider   | Image | Video | Music | TTS | STT / Transcription | Media Understanding |
-| ---------- | ----- | ----- | ----- | --- | ------------------- | ------------------- |
-| Alibaba    |       | Yes   |       |     |                     |                     |
-| BytePlus   |       | Yes   |       |     |                     |                     |
-| ComfyUI    | Yes   | Yes   | Yes   |     |                     |                     |
-| Deepgram   |       |       |       |     | Yes                 |                     |
-| ElevenLabs |       |       |       | Yes | Yes                 |                     |
-| fal        | Yes   | Yes   |       |     |                     |                     |
-| Google     | Yes   | Yes   | Yes   |     |                     | Yes                 |
-| Microsoft  |       |       |       | Yes |                     |                     |
-| MiniMax    | Yes   | Yes   | Yes   | Yes |                     |                     |
-| Mistral    |       |       |       |     | Yes                 |                     |
-| OpenAI     | Yes   | Yes   |       | Yes | Yes                 | Yes                 |
-| Qwen       |       | Yes   |       |     |                     |                     |
-| Runway     |       | Yes   |       |     |                     |                     |
-| Together   |       | Yes   |       |     |                     |                     |
-| Vydra      | Yes   | Yes   |       |     |                     |                     |
-| xAI        | Yes   | Yes   |       | Yes | Yes                 | Yes                 |
+| Provider   | Image | Video | Music | TTS | STT / Transcription | Realtime Voice | Media Understanding |
+| ---------- | ----- | ----- | ----- | --- | ------------------- | -------------- | ------------------- |
+| Alibaba    |       | Yes   |       |     |                     |                |                     |
+| BytePlus   |       | Yes   |       |     |                     |                |                     |
+| ComfyUI    | Yes   | Yes   | Yes   |     |                     |                |                     |
+| Deepgram   |       |       |       |     | Yes                 |                |                     |
+| ElevenLabs |       |       |       | Yes | Yes                 |                |                     |
+| fal        | Yes   | Yes   |       |     |                     |                |                     |
+| Google     | Yes   | Yes   | Yes   | Yes |                     | Yes            | Yes                 |
+| Microsoft  |       |       |       | Yes |                     |                |                     |
+| MiniMax    | Yes   | Yes   | Yes   | Yes |                     |                |                     |
+| Mistral    |       |       |       |     | Yes                 |                |                     |
+| OpenAI     | Yes   | Yes   |       | Yes | Yes                 | Yes            | Yes                 |
+| Qwen       |       | Yes   |       |     |                     |                |                     |
+| Runway     |       | Yes   |       |     |                     |                |                     |
+| Together   |       | Yes   |       |     |                     |                |                     |
+| Vydra      | Yes   | Yes   |       |     |                     |                |                     |
+| xAI        | Yes   | Yes   |       | Yes | Yes                 |                | Yes                 |
 
 <Note>
 Media understanding uses any vision-capable or audio-capable model registered in your provider config. The table above highlights providers with dedicated media-understanding support; most LLM providers with multimodal models (Anthropic, Google, OpenAI, etc.) can also understand inbound media when configured as the active reply model.
@@ -58,12 +58,14 @@ ElevenLabs, Mistral, OpenAI, and xAI also register Voice Call streaming STT
 providers, so live phone audio can be forwarded to the selected vendor
 without waiting for a completed recording.
 
-OpenAI maps to OpenClaw's image, video, batch TTS, batch STT, Voice Call
-streaming STT, realtime voice, and memory embedding surfaces. xAI currently
-maps to OpenClaw's image, video, search, code-execution, batch TTS, batch STT,
-and Voice Call streaming STT surfaces. xAI Realtime voice is an upstream
-capability, but it is not registered in OpenClaw until the shared realtime
-voice contract can represent it.
+Google maps to OpenClaw's image, video, music, batch TTS, backend realtime
+voice, and media-understanding surfaces. OpenAI maps to OpenClaw's image,
+video, batch TTS, batch STT, Voice Call streaming STT, backend realtime voice,
+and memory embedding surfaces. xAI currently maps to OpenClaw's image, video,
+search, code-execution, batch TTS, batch STT, and Voice Call streaming STT
+surfaces. xAI Realtime voice is an upstream capability, but it is not
+registered in OpenClaw until the shared realtime voice contract can represent
+it.
 
 ## Quick links