From e5f55dd024fc4823bc7b673b20fe55a5115ae6f6 Mon Sep 17 00:00:00 2001 From: Peter Steinberger Date: Fri, 24 Apr 2026 10:14:19 +0100 Subject: [PATCH] docs: document Google realtime voice support --- docs/plugins/google-meet.md | 19 ++++++-- docs/plugins/voice-call.md | 95 ++++++++++++++++++++++++++++++++++++ docs/tools/media-overview.md | 52 ++++++++++---------- 3 files changed, 138 insertions(+), 28 deletions(-) diff --git a/docs/plugins/google-meet.md b/docs/plugins/google-meet.md index 96a9a9b8e4d..d653966a5bd 100644 --- a/docs/plugins/google-meet.md +++ b/docs/plugins/google-meet.md @@ -26,12 +26,15 @@ The plugin is explicit by design: ## Quick start -Install the local audio dependencies and make sure the realtime provider can use -OpenAI: +Install the local audio dependencies and configure a backend realtime voice +provider. OpenAI is the default; Google Gemini Live also works with +`realtime.provider: "google"`: ```bash brew install blackhole-2ch sox export OPENAI_API_KEY=sk-... +# or +export GEMINI_API_KEY=... ``` `blackhole-2ch` installs the `BlackHole 2ch` virtual audio device. Homebrew's @@ -319,11 +322,14 @@ Workspace Developer Preview Program for Meet media APIs. ## Config The common Chrome realtime path only needs the plugin enabled, BlackHole, SoX, -and an OpenAI key: +and a backend realtime voice provider key. OpenAI is the default; set +`realtime.provider: "google"` to use Google Gemini Live: ```bash brew install blackhole-2ch sox export OPENAI_API_KEY=sk-... +# or +export GEMINI_API_KEY=... ``` Set the plugin config under `plugins.entries.google-meet.config`: @@ -372,8 +378,15 @@ Optional overrides: node: "parallels-macos", }, realtime: { + provider: "google", toolPolicy: "owner", introMessage: "Say exactly: I'm here.", + providers: { + google: { + model: "gemini-2.5-flash-native-audio-preview-12-2025", + voice: "Kore", + }, + }, }, } ``` diff --git a/docs/plugins/voice-call.md b/docs/plugins/voice-call.md index 70fd8fd2518..dc686b5137e 100644 --- a/docs/plugins/voice-call.md +++ b/docs/plugins/voice-call.md @@ -122,6 +122,17 @@ Set config under `plugins.entries.voice-call.config`: maxPendingConnectionsPerIp: 4, maxConnections: 128, }, + + realtime: { + enabled: false, + provider: "google", // optional; first registered realtime voice provider when unset + providers: { + google: { + model: "gemini-2.5-flash-native-audio-preview-12-2025", + voice: "Kore", + }, + }, + }, }, }, }, @@ -140,6 +151,7 @@ Notes: - If you use ngrok free tier, set `publicUrl` to the exact ngrok URL; signature verification is always enforced. - `tunnel.allowNgrokFreeTierLoopbackBypass: true` allows Twilio webhooks with invalid signatures **only** when `tunnel.provider="ngrok"` and `serve.bind` is loopback (ngrok local agent). Use for local dev only. - Ngrok free tier URLs can change or add interstitial behavior; if `publicUrl` drifts, Twilio signatures will fail. For production, prefer a stable domain or Tailscale funnel. +- `realtime.enabled` starts full voice-to-voice conversations; do not enable it together with `streaming.enabled`. - Streaming security defaults: - `streaming.preStartTimeoutMs` closes sockets that never send a valid `start` frame. - `streaming.maxPendingConnections` caps total unauthenticated pre-start sockets. @@ -147,6 +159,89 @@ Notes: - `streaming.maxConnections` caps total open media stream sockets (pending + active). - Runtime fallback still accepts those old voice-call keys for now, but the rewrite path is `openclaw doctor --fix` and the compat shim is temporary. +## Realtime voice conversations + +`realtime` selects a full duplex realtime voice provider for live call audio. +It is separate from `streaming`, which only forwards audio to realtime +transcription providers. + +Current runtime behavior: + +- `realtime.enabled` is supported for Twilio Media Streams. +- `realtime.enabled` cannot be combined with `streaming.enabled`. +- `realtime.provider` is optional. If unset, Voice Call uses the first + registered realtime voice provider. +- Bundled realtime voice providers include Google Gemini Live (`google`) and + OpenAI (`openai`), registered by their provider plugins. +- Provider-owned raw config lives under `realtime.providers.`. +- If `realtime.provider` points at an unregistered provider, or no realtime + voice provider is registered at all, Voice Call logs a warning and skips + realtime media instead of failing the whole plugin. + +Google Gemini Live realtime defaults: + +- API key: `realtime.providers.google.apiKey`, `GEMINI_API_KEY`, or + `GOOGLE_GENERATIVE_AI_API_KEY` +- model: `gemini-2.5-flash-native-audio-preview-12-2025` +- voice: `Kore` + +Example: + +```json5 +{ + plugins: { + entries: { + "voice-call": { + config: { + provider: "twilio", + inboundPolicy: "allowlist", + allowFrom: ["+15550005678"], + realtime: { + enabled: true, + provider: "google", + instructions: "Speak briefly and ask before using tools.", + providers: { + google: { + apiKey: "${GEMINI_API_KEY}", + model: "gemini-2.5-flash-native-audio-preview-12-2025", + voice: "Kore", + }, + }, + }, + }, + }, + }, + }, +} +``` + +Use OpenAI instead: + +```json5 +{ + plugins: { + entries: { + "voice-call": { + config: { + realtime: { + enabled: true, + provider: "openai", + providers: { + openai: { + apiKey: "${OPENAI_API_KEY}", + }, + }, + }, + }, + }, + }, + }, +} +``` + +See [Google provider](/providers/google) and [OpenAI provider](/providers/openai) +for provider-specific realtime voice options. + ## Streaming transcription `streaming` selects a realtime transcription provider for live call audio. diff --git a/docs/tools/media-overview.md b/docs/tools/media-overview.md index d53d67b2482..ffbb5784ecc 100644 --- a/docs/tools/media-overview.md +++ b/docs/tools/media-overview.md @@ -18,31 +18,31 @@ OpenClaw generates images, videos, and music, understands inbound media (images, | Image generation | `image_generate` | ComfyUI, fal, Google, MiniMax, OpenAI, Vydra, xAI | Creates or edits images from text prompts or references | | Video generation | `video_generate` | Alibaba, BytePlus, ComfyUI, fal, Google, MiniMax, OpenAI, Qwen, Runway, Together, Vydra, xAI | Creates videos from text, images, or existing videos | | Music generation | `music_generate` | ComfyUI, Google, MiniMax | Creates music or audio tracks from text prompts | -| Text-to-speech (TTS) | `tts` | ElevenLabs, Microsoft, MiniMax, OpenAI, xAI | Converts outbound replies to spoken audio | +| Text-to-speech (TTS) | `tts` | ElevenLabs, Google, Microsoft, MiniMax, OpenAI, xAI | Converts outbound replies to spoken audio | | Media understanding | (automatic) | Any vision/audio-capable model provider, plus CLI fallbacks | Summarizes inbound images, audio, and video | ## Provider capability matrix This table shows which providers support which media capabilities across the platform. -| Provider | Image | Video | Music | TTS | STT / Transcription | Media Understanding | -| ---------- | ----- | ----- | ----- | --- | ------------------- | ------------------- | -| Alibaba | | Yes | | | | | -| BytePlus | | Yes | | | | | -| ComfyUI | Yes | Yes | Yes | | | | -| Deepgram | | | | | Yes | | -| ElevenLabs | | | | Yes | Yes | | -| fal | Yes | Yes | | | | | -| Google | Yes | Yes | Yes | | | Yes | -| Microsoft | | | | Yes | | | -| MiniMax | Yes | Yes | Yes | Yes | | | -| Mistral | | | | | Yes | | -| OpenAI | Yes | Yes | | Yes | Yes | Yes | -| Qwen | | Yes | | | | | -| Runway | | Yes | | | | | -| Together | | Yes | | | | | -| Vydra | Yes | Yes | | | | | -| xAI | Yes | Yes | | Yes | Yes | Yes | +| Provider | Image | Video | Music | TTS | STT / Transcription | Realtime Voice | Media Understanding | +| ---------- | ----- | ----- | ----- | --- | ------------------- | -------------- | ------------------- | +| Alibaba | | Yes | | | | | | +| BytePlus | | Yes | | | | | | +| ComfyUI | Yes | Yes | Yes | | | | | +| Deepgram | | | | | Yes | | | +| ElevenLabs | | | | Yes | Yes | | | +| fal | Yes | Yes | | | | | | +| Google | Yes | Yes | Yes | Yes | | Yes | Yes | +| Microsoft | | | | Yes | | | | +| MiniMax | Yes | Yes | Yes | Yes | | | | +| Mistral | | | | | Yes | | | +| OpenAI | Yes | Yes | | Yes | Yes | Yes | Yes | +| Qwen | | Yes | | | | | | +| Runway | | Yes | | | | | | +| Together | | Yes | | | | | | +| Vydra | Yes | Yes | | | | | | +| xAI | Yes | Yes | | Yes | Yes | | Yes | Media understanding uses any vision-capable or audio-capable model registered in your provider config. The table above highlights providers with dedicated media-understanding support; most LLM providers with multimodal models (Anthropic, Google, OpenAI, etc.) can also understand inbound media when configured as the active reply model. @@ -58,12 +58,14 @@ ElevenLabs, Mistral, OpenAI, and xAI also register Voice Call streaming STT providers, so live phone audio can be forwarded to the selected vendor without waiting for a completed recording. -OpenAI maps to OpenClaw's image, video, batch TTS, batch STT, Voice Call -streaming STT, realtime voice, and memory embedding surfaces. xAI currently -maps to OpenClaw's image, video, search, code-execution, batch TTS, batch STT, -and Voice Call streaming STT surfaces. xAI Realtime voice is an upstream -capability, but it is not registered in OpenClaw until the shared realtime -voice contract can represent it. +Google maps to OpenClaw's image, video, music, batch TTS, backend realtime +voice, and media-understanding surfaces. OpenAI maps to OpenClaw's image, +video, batch TTS, batch STT, Voice Call streaming STT, backend realtime voice, +and memory embedding surfaces. xAI currently maps to OpenClaw's image, video, +search, code-execution, batch TTS, batch STT, and Voice Call streaming STT +surfaces. xAI Realtime voice is an upstream capability, but it is not +registered in OpenClaw until the shared realtime voice contract can represent +it. ## Quick links