mirror of
https://github.com/openclaw/openclaw.git
synced 2026-05-06 06:30:42 +00:00
docs: document Google realtime voice support
This commit is contained in:
@@ -26,12 +26,15 @@ The plugin is explicit by design:
|
||||
|
||||
## Quick start
|
||||
|
||||
Install the local audio dependencies and make sure the realtime provider can use
|
||||
OpenAI:
|
||||
Install the local audio dependencies and configure a backend realtime voice
|
||||
provider. OpenAI is the default; Google Gemini Live also works with
|
||||
`realtime.provider: "google"`:
|
||||
|
||||
```bash
|
||||
brew install blackhole-2ch sox
|
||||
export OPENAI_API_KEY=sk-...
|
||||
# or
|
||||
export GEMINI_API_KEY=...
|
||||
```
|
||||
|
||||
`blackhole-2ch` installs the `BlackHole 2ch` virtual audio device. Homebrew's
|
||||
@@ -319,11 +322,14 @@ Workspace Developer Preview Program for Meet media APIs.
|
||||
## Config
|
||||
|
||||
The common Chrome realtime path only needs the plugin enabled, BlackHole, SoX,
|
||||
and an OpenAI key:
|
||||
and a backend realtime voice provider key. OpenAI is the default; set
|
||||
`realtime.provider: "google"` to use Google Gemini Live:
|
||||
|
||||
```bash
|
||||
brew install blackhole-2ch sox
|
||||
export OPENAI_API_KEY=sk-...
|
||||
# or
|
||||
export GEMINI_API_KEY=...
|
||||
```
|
||||
|
||||
Set the plugin config under `plugins.entries.google-meet.config`:
|
||||
@@ -372,8 +378,15 @@ Optional overrides:
|
||||
node: "parallels-macos",
|
||||
},
|
||||
realtime: {
|
||||
provider: "google",
|
||||
toolPolicy: "owner",
|
||||
introMessage: "Say exactly: I'm here.",
|
||||
providers: {
|
||||
google: {
|
||||
model: "gemini-2.5-flash-native-audio-preview-12-2025",
|
||||
voice: "Kore",
|
||||
},
|
||||
},
|
||||
},
|
||||
}
|
||||
```
|
||||
|
||||
@@ -122,6 +122,17 @@ Set config under `plugins.entries.voice-call.config`:
|
||||
maxPendingConnectionsPerIp: 4,
|
||||
maxConnections: 128,
|
||||
},
|
||||
|
||||
realtime: {
|
||||
enabled: false,
|
||||
provider: "google", // optional; first registered realtime voice provider when unset
|
||||
providers: {
|
||||
google: {
|
||||
model: "gemini-2.5-flash-native-audio-preview-12-2025",
|
||||
voice: "Kore",
|
||||
},
|
||||
},
|
||||
},
|
||||
},
|
||||
},
|
||||
},
|
||||
@@ -140,6 +151,7 @@ Notes:
|
||||
- If you use ngrok free tier, set `publicUrl` to the exact ngrok URL; signature verification is always enforced.
|
||||
- `tunnel.allowNgrokFreeTierLoopbackBypass: true` allows Twilio webhooks with invalid signatures **only** when `tunnel.provider="ngrok"` and `serve.bind` is loopback (ngrok local agent). Use for local dev only.
|
||||
- Ngrok free tier URLs can change or add interstitial behavior; if `publicUrl` drifts, Twilio signatures will fail. For production, prefer a stable domain or Tailscale funnel.
|
||||
- `realtime.enabled` starts full voice-to-voice conversations; do not enable it together with `streaming.enabled`.
|
||||
- Streaming security defaults:
|
||||
- `streaming.preStartTimeoutMs` closes sockets that never send a valid `start` frame.
|
||||
- `streaming.maxPendingConnections` caps total unauthenticated pre-start sockets.
|
||||
@@ -147,6 +159,89 @@ Notes:
|
||||
- `streaming.maxConnections` caps total open media stream sockets (pending + active).
|
||||
- Runtime fallback still accepts those old voice-call keys for now, but the rewrite path is `openclaw doctor --fix` and the compat shim is temporary.
|
||||
|
||||
## Realtime voice conversations
|
||||
|
||||
`realtime` selects a full duplex realtime voice provider for live call audio.
|
||||
It is separate from `streaming`, which only forwards audio to realtime
|
||||
transcription providers.
|
||||
|
||||
Current runtime behavior:
|
||||
|
||||
- `realtime.enabled` is supported for Twilio Media Streams.
|
||||
- `realtime.enabled` cannot be combined with `streaming.enabled`.
|
||||
- `realtime.provider` is optional. If unset, Voice Call uses the first
|
||||
registered realtime voice provider.
|
||||
- Bundled realtime voice providers include Google Gemini Live (`google`) and
|
||||
OpenAI (`openai`), registered by their provider plugins.
|
||||
- Provider-owned raw config lives under `realtime.providers.<providerId>`.
|
||||
- If `realtime.provider` points at an unregistered provider, or no realtime
|
||||
voice provider is registered at all, Voice Call logs a warning and skips
|
||||
realtime media instead of failing the whole plugin.
|
||||
|
||||
Google Gemini Live realtime defaults:
|
||||
|
||||
- API key: `realtime.providers.google.apiKey`, `GEMINI_API_KEY`, or
|
||||
`GOOGLE_GENERATIVE_AI_API_KEY`
|
||||
- model: `gemini-2.5-flash-native-audio-preview-12-2025`
|
||||
- voice: `Kore`
|
||||
|
||||
Example:
|
||||
|
||||
```json5
|
||||
{
|
||||
plugins: {
|
||||
entries: {
|
||||
"voice-call": {
|
||||
config: {
|
||||
provider: "twilio",
|
||||
inboundPolicy: "allowlist",
|
||||
allowFrom: ["+15550005678"],
|
||||
realtime: {
|
||||
enabled: true,
|
||||
provider: "google",
|
||||
instructions: "Speak briefly and ask before using tools.",
|
||||
providers: {
|
||||
google: {
|
||||
apiKey: "${GEMINI_API_KEY}",
|
||||
model: "gemini-2.5-flash-native-audio-preview-12-2025",
|
||||
voice: "Kore",
|
||||
},
|
||||
},
|
||||
},
|
||||
},
|
||||
},
|
||||
},
|
||||
},
|
||||
}
|
||||
```
|
||||
|
||||
Use OpenAI instead:
|
||||
|
||||
```json5
|
||||
{
|
||||
plugins: {
|
||||
entries: {
|
||||
"voice-call": {
|
||||
config: {
|
||||
realtime: {
|
||||
enabled: true,
|
||||
provider: "openai",
|
||||
providers: {
|
||||
openai: {
|
||||
apiKey: "${OPENAI_API_KEY}",
|
||||
},
|
||||
},
|
||||
},
|
||||
},
|
||||
},
|
||||
},
|
||||
},
|
||||
}
|
||||
```
|
||||
|
||||
See [Google provider](/providers/google) and [OpenAI provider](/providers/openai)
|
||||
for provider-specific realtime voice options.
|
||||
|
||||
## Streaming transcription
|
||||
|
||||
`streaming` selects a realtime transcription provider for live call audio.
|
||||
|
||||
@@ -18,31 +18,31 @@ OpenClaw generates images, videos, and music, understands inbound media (images,
|
||||
| Image generation | `image_generate` | ComfyUI, fal, Google, MiniMax, OpenAI, Vydra, xAI | Creates or edits images from text prompts or references |
|
||||
| Video generation | `video_generate` | Alibaba, BytePlus, ComfyUI, fal, Google, MiniMax, OpenAI, Qwen, Runway, Together, Vydra, xAI | Creates videos from text, images, or existing videos |
|
||||
| Music generation | `music_generate` | ComfyUI, Google, MiniMax | Creates music or audio tracks from text prompts |
|
||||
| Text-to-speech (TTS) | `tts` | ElevenLabs, Microsoft, MiniMax, OpenAI, xAI | Converts outbound replies to spoken audio |
|
||||
| Text-to-speech (TTS) | `tts` | ElevenLabs, Google, Microsoft, MiniMax, OpenAI, xAI | Converts outbound replies to spoken audio |
|
||||
| Media understanding | (automatic) | Any vision/audio-capable model provider, plus CLI fallbacks | Summarizes inbound images, audio, and video |
|
||||
|
||||
## Provider capability matrix
|
||||
|
||||
This table shows which providers support which media capabilities across the platform.
|
||||
|
||||
| Provider | Image | Video | Music | TTS | STT / Transcription | Media Understanding |
|
||||
| ---------- | ----- | ----- | ----- | --- | ------------------- | ------------------- |
|
||||
| Alibaba | | Yes | | | | |
|
||||
| BytePlus | | Yes | | | | |
|
||||
| ComfyUI | Yes | Yes | Yes | | | |
|
||||
| Deepgram | | | | | Yes | |
|
||||
| ElevenLabs | | | | Yes | Yes | |
|
||||
| fal | Yes | Yes | | | | |
|
||||
| Google | Yes | Yes | Yes | | | Yes |
|
||||
| Microsoft | | | | Yes | | |
|
||||
| MiniMax | Yes | Yes | Yes | Yes | | |
|
||||
| Mistral | | | | | Yes | |
|
||||
| OpenAI | Yes | Yes | | Yes | Yes | Yes |
|
||||
| Qwen | | Yes | | | | |
|
||||
| Runway | | Yes | | | | |
|
||||
| Together | | Yes | | | | |
|
||||
| Vydra | Yes | Yes | | | | |
|
||||
| xAI | Yes | Yes | | Yes | Yes | Yes |
|
||||
| Provider | Image | Video | Music | TTS | STT / Transcription | Realtime Voice | Media Understanding |
|
||||
| ---------- | ----- | ----- | ----- | --- | ------------------- | -------------- | ------------------- |
|
||||
| Alibaba | | Yes | | | | | |
|
||||
| BytePlus | | Yes | | | | | |
|
||||
| ComfyUI | Yes | Yes | Yes | | | | |
|
||||
| Deepgram | | | | | Yes | | |
|
||||
| ElevenLabs | | | | Yes | Yes | | |
|
||||
| fal | Yes | Yes | | | | | |
|
||||
| Google | Yes | Yes | Yes | Yes | | Yes | Yes |
|
||||
| Microsoft | | | | Yes | | | |
|
||||
| MiniMax | Yes | Yes | Yes | Yes | | | |
|
||||
| Mistral | | | | | Yes | | |
|
||||
| OpenAI | Yes | Yes | | Yes | Yes | Yes | Yes |
|
||||
| Qwen | | Yes | | | | | |
|
||||
| Runway | | Yes | | | | | |
|
||||
| Together | | Yes | | | | | |
|
||||
| Vydra | Yes | Yes | | | | | |
|
||||
| xAI | Yes | Yes | | Yes | Yes | | Yes |
|
||||
|
||||
<Note>
|
||||
Media understanding uses any vision-capable or audio-capable model registered in your provider config. The table above highlights providers with dedicated media-understanding support; most LLM providers with multimodal models (Anthropic, Google, OpenAI, etc.) can also understand inbound media when configured as the active reply model.
|
||||
@@ -58,12 +58,14 @@ ElevenLabs, Mistral, OpenAI, and xAI also register Voice Call streaming STT
|
||||
providers, so live phone audio can be forwarded to the selected vendor
|
||||
without waiting for a completed recording.
|
||||
|
||||
OpenAI maps to OpenClaw's image, video, batch TTS, batch STT, Voice Call
|
||||
streaming STT, realtime voice, and memory embedding surfaces. xAI currently
|
||||
maps to OpenClaw's image, video, search, code-execution, batch TTS, batch STT,
|
||||
and Voice Call streaming STT surfaces. xAI Realtime voice is an upstream
|
||||
capability, but it is not registered in OpenClaw until the shared realtime
|
||||
voice contract can represent it.
|
||||
Google maps to OpenClaw's image, video, music, batch TTS, backend realtime
|
||||
voice, and media-understanding surfaces. OpenAI maps to OpenClaw's image,
|
||||
video, batch TTS, batch STT, Voice Call streaming STT, backend realtime voice,
|
||||
and memory embedding surfaces. xAI currently maps to OpenClaw's image, video,
|
||||
search, code-execution, batch TTS, batch STT, and Voice Call streaming STT
|
||||
surfaces. xAI Realtime voice is an upstream capability, but it is not
|
||||
registered in OpenClaw until the shared realtime voice contract can represent
|
||||
it.
|
||||
|
||||
## Quick links
|
||||
|
||||
|
||||
Reference in New Issue
Block a user