docs: document Google realtime voice support

This commit is contained in:
Peter Steinberger
2026-04-24 10:14:19 +01:00
parent 6831579267
commit e5f55dd024
3 changed files with 138 additions and 28 deletions

View File

@@ -26,12 +26,15 @@ The plugin is explicit by design:
## Quick start
Install the local audio dependencies and make sure the realtime provider can use
OpenAI:
Install the local audio dependencies and configure a backend realtime voice
provider. OpenAI is the default; Google Gemini Live also works with
`realtime.provider: "google"`:
```bash
brew install blackhole-2ch sox
export OPENAI_API_KEY=sk-...
# or
export GEMINI_API_KEY=...
```
`blackhole-2ch` installs the `BlackHole 2ch` virtual audio device. Homebrew's
@@ -319,11 +322,14 @@ Workspace Developer Preview Program for Meet media APIs.
## Config
The common Chrome realtime path only needs the plugin enabled, BlackHole, SoX,
and an OpenAI key:
and a backend realtime voice provider key. OpenAI is the default; set
`realtime.provider: "google"` to use Google Gemini Live:
```bash
brew install blackhole-2ch sox
export OPENAI_API_KEY=sk-...
# or
export GEMINI_API_KEY=...
```
Set the plugin config under `plugins.entries.google-meet.config`:
@@ -372,8 +378,15 @@ Optional overrides:
node: "parallels-macos",
},
realtime: {
provider: "google",
toolPolicy: "owner",
introMessage: "Say exactly: I'm here.",
providers: {
google: {
model: "gemini-2.5-flash-native-audio-preview-12-2025",
voice: "Kore",
},
},
},
}
```

View File

@@ -122,6 +122,17 @@ Set config under `plugins.entries.voice-call.config`:
maxPendingConnectionsPerIp: 4,
maxConnections: 128,
},
realtime: {
enabled: false,
provider: "google", // optional; first registered realtime voice provider when unset
providers: {
google: {
model: "gemini-2.5-flash-native-audio-preview-12-2025",
voice: "Kore",
},
},
},
},
},
},
@@ -140,6 +151,7 @@ Notes:
- If you use ngrok free tier, set `publicUrl` to the exact ngrok URL; signature verification is always enforced.
- `tunnel.allowNgrokFreeTierLoopbackBypass: true` allows Twilio webhooks with invalid signatures **only** when `tunnel.provider="ngrok"` and `serve.bind` is loopback (ngrok local agent). Use for local dev only.
- Ngrok free tier URLs can change or add interstitial behavior; if `publicUrl` drifts, Twilio signatures will fail. For production, prefer a stable domain or Tailscale funnel.
- `realtime.enabled` starts full voice-to-voice conversations; do not enable it together with `streaming.enabled`.
- Streaming security defaults:
- `streaming.preStartTimeoutMs` closes sockets that never send a valid `start` frame.
- `streaming.maxPendingConnections` caps total unauthenticated pre-start sockets.
@@ -147,6 +159,89 @@ Notes:
- `streaming.maxConnections` caps total open media stream sockets (pending + active).
- Runtime fallback still accepts those old voice-call keys for now, but the rewrite path is `openclaw doctor --fix` and the compat shim is temporary.
## Realtime voice conversations
`realtime` selects a full duplex realtime voice provider for live call audio.
It is separate from `streaming`, which only forwards audio to realtime
transcription providers.
Current runtime behavior:
- `realtime.enabled` is supported for Twilio Media Streams.
- `realtime.enabled` cannot be combined with `streaming.enabled`.
- `realtime.provider` is optional. If unset, Voice Call uses the first
registered realtime voice provider.
- Bundled realtime voice providers include Google Gemini Live (`google`) and
OpenAI (`openai`), registered by their provider plugins.
- Provider-owned raw config lives under `realtime.providers.<providerId>`.
- If `realtime.provider` points at an unregistered provider, or no realtime
voice provider is registered at all, Voice Call logs a warning and skips
realtime media instead of failing the whole plugin.
Google Gemini Live realtime defaults:
- API key: `realtime.providers.google.apiKey`, `GEMINI_API_KEY`, or
`GOOGLE_GENERATIVE_AI_API_KEY`
- model: `gemini-2.5-flash-native-audio-preview-12-2025`
- voice: `Kore`
Example:
```json5
{
plugins: {
entries: {
"voice-call": {
config: {
provider: "twilio",
inboundPolicy: "allowlist",
allowFrom: ["+15550005678"],
realtime: {
enabled: true,
provider: "google",
instructions: "Speak briefly and ask before using tools.",
providers: {
google: {
apiKey: "${GEMINI_API_KEY}",
model: "gemini-2.5-flash-native-audio-preview-12-2025",
voice: "Kore",
},
},
},
},
},
},
},
}
```
Use OpenAI instead:
```json5
{
plugins: {
entries: {
"voice-call": {
config: {
realtime: {
enabled: true,
provider: "openai",
providers: {
openai: {
apiKey: "${OPENAI_API_KEY}",
},
},
},
},
},
},
},
}
```
See [Google provider](/providers/google) and [OpenAI provider](/providers/openai)
for provider-specific realtime voice options.
## Streaming transcription
`streaming` selects a realtime transcription provider for live call audio.

View File

@@ -18,31 +18,31 @@ OpenClaw generates images, videos, and music, understands inbound media (images,
| Image generation | `image_generate` | ComfyUI, fal, Google, MiniMax, OpenAI, Vydra, xAI | Creates or edits images from text prompts or references |
| Video generation | `video_generate` | Alibaba, BytePlus, ComfyUI, fal, Google, MiniMax, OpenAI, Qwen, Runway, Together, Vydra, xAI | Creates videos from text, images, or existing videos |
| Music generation | `music_generate` | ComfyUI, Google, MiniMax | Creates music or audio tracks from text prompts |
| Text-to-speech (TTS) | `tts` | ElevenLabs, Microsoft, MiniMax, OpenAI, xAI | Converts outbound replies to spoken audio |
| Text-to-speech (TTS) | `tts` | ElevenLabs, Google, Microsoft, MiniMax, OpenAI, xAI | Converts outbound replies to spoken audio |
| Media understanding | (automatic) | Any vision/audio-capable model provider, plus CLI fallbacks | Summarizes inbound images, audio, and video |
## Provider capability matrix
This table shows which providers support which media capabilities across the platform.
| Provider | Image | Video | Music | TTS | STT / Transcription | Media Understanding |
| ---------- | ----- | ----- | ----- | --- | ------------------- | ------------------- |
| Alibaba | | Yes | | | | |
| BytePlus | | Yes | | | | |
| ComfyUI | Yes | Yes | Yes | | | |
| Deepgram | | | | | Yes | |
| ElevenLabs | | | | Yes | Yes | |
| fal | Yes | Yes | | | | |
| Google | Yes | Yes | Yes | | | Yes |
| Microsoft | | | | Yes | | |
| MiniMax | Yes | Yes | Yes | Yes | | |
| Mistral | | | | | Yes | |
| OpenAI | Yes | Yes | | Yes | Yes | Yes |
| Qwen | | Yes | | | | |
| Runway | | Yes | | | | |
| Together | | Yes | | | | |
| Vydra | Yes | Yes | | | | |
| xAI | Yes | Yes | | Yes | Yes | Yes |
| Provider | Image | Video | Music | TTS | STT / Transcription | Realtime Voice | Media Understanding |
| ---------- | ----- | ----- | ----- | --- | ------------------- | -------------- | ------------------- |
| Alibaba | | Yes | | | | | |
| BytePlus | | Yes | | | | | |
| ComfyUI | Yes | Yes | Yes | | | | |
| Deepgram | | | | | Yes | | |
| ElevenLabs | | | | Yes | Yes | | |
| fal | Yes | Yes | | | | | |
| Google | Yes | Yes | Yes | Yes | | Yes | Yes |
| Microsoft | | | | Yes | | | |
| MiniMax | Yes | Yes | Yes | Yes | | | |
| Mistral | | | | | Yes | | |
| OpenAI | Yes | Yes | | Yes | Yes | Yes | Yes |
| Qwen | | Yes | | | | | |
| Runway | | Yes | | | | | |
| Together | | Yes | | | | | |
| Vydra | Yes | Yes | | | | | |
| xAI | Yes | Yes | | Yes | Yes | | Yes |
<Note>
Media understanding uses any vision-capable or audio-capable model registered in your provider config. The table above highlights providers with dedicated media-understanding support; most LLM providers with multimodal models (Anthropic, Google, OpenAI, etc.) can also understand inbound media when configured as the active reply model.
@@ -58,12 +58,14 @@ ElevenLabs, Mistral, OpenAI, and xAI also register Voice Call streaming STT
providers, so live phone audio can be forwarded to the selected vendor
without waiting for a completed recording.
OpenAI maps to OpenClaw's image, video, batch TTS, batch STT, Voice Call
streaming STT, realtime voice, and memory embedding surfaces. xAI currently
maps to OpenClaw's image, video, search, code-execution, batch TTS, batch STT,
and Voice Call streaming STT surfaces. xAI Realtime voice is an upstream
capability, but it is not registered in OpenClaw until the shared realtime
voice contract can represent it.
Google maps to OpenClaw's image, video, music, batch TTS, backend realtime
voice, and media-understanding surfaces. OpenAI maps to OpenClaw's image,
video, batch TTS, batch STT, Voice Call streaming STT, backend realtime voice,
and memory embedding surfaces. xAI currently maps to OpenClaw's image, video,
search, code-execution, batch TTS, batch STT, and Voice Call streaming STT
surfaces. xAI Realtime voice is an upstream capability, but it is not
registered in OpenClaw until the shared realtime voice contract can represent
it.
## Quick links