docs: outline unified talk API

This commit is contained in:
Peter Steinberger
2026-05-05 20:59:13 +01:00
parent 1f7d0ef310
commit 24853ced11
13 changed files with 625 additions and 13 deletions

View File

@@ -197,6 +197,9 @@ Node commands must pass two gates before they can be invoked:
Windows and macOS companion nodes allow safe declared commands such as
`canvas.*`, `camera.list`, `location.get`, and `screen.snapshot` by default.
Trusted nodes that advertise the `talk` capability or declare `talk.*` commands
also allow declared push-to-talk commands (`talk.ptt.start`, `talk.ptt.stop`,
`talk.ptt.cancel`, `talk.ptt.once`) by default, independent of platform label.
Dangerous or privacy-heavy commands such as `camera.snap`, `camera.clip`, and
`screen.record` still require explicit opt-in with
`gateway.nodes.allowCommands`. `gateway.nodes.denyCommands` always wins over

View File

@@ -1,18 +1,28 @@
---
summary: "Talk mode: continuous speech conversations with configured TTS providers"
summary: "Talk mode: continuous speech conversations across local STT/TTS and realtime voice"
read_when:
- Implementing Talk mode on macOS/iOS/Android
- Changing voice/TTS/interrupt behavior
title: "Talk mode"
---
Talk mode is a continuous voice conversation loop:
Talk mode has two runtime shapes:
- Native macOS/iOS/Android Talk uses local speech recognition, Gateway chat, and `talk.speak` TTS. Nodes advertise the `talk` capability and declare the `talk.*` commands they support.
- Browser Talk uses `talk.realtime.session` with canonical transports: `webrtc`, `provider-websocket`, or `gateway-relay`. `managed-room` is reserved for Gateway handoff rooms.
- Transcription-only clients use `talk.transcription.session` plus `talk.transcription.relayAudio`, `talk.transcription.relayCancel`, and `talk.transcription.relayStop` when they need captions or dictation without an assistant voice response.
Native Talk is a continuous voice conversation loop:
1. Listen for speech
2. Send transcript to the model (main session, chat.send)
2. Send transcript to the model through the active session
3. Wait for the response
4. Speak it via the configured Talk provider (`talk.speak`)
Browser realtime Talk forwards provider tool calls through `talk.realtime.toolCall`; browser clients do not call `chat.send` directly for realtime consults.
Transcription-only Talk emits the same common Talk event envelope as realtime and STT/TTS sessions, but uses `mode: "transcription"` and `brain: "none"`. It is for captions, dictation, and observe-only speech capture; one-shot uploaded voice notes still use the media/audio path.
## Behavior (macOS)
- **Always-on overlay** while Talk mode is enabled.
@@ -66,6 +76,19 @@ Supported keys:
speechLocale: "ru-RU",
silenceTimeoutMs: 1500,
interruptOnSpeech: true,
realtime: {
provider: "openai",
providers: {
openai: {
apiKey: "openai_api_key",
model: "gpt-realtime",
voice: "alloy",
},
},
mode: "realtime",
transport: "webrtc",
brain: "agent-consult",
},
},
}
```
@@ -79,6 +102,11 @@ Defaults:
- `providers.elevenlabs.modelId`: defaults to `eleven_v3` when unset.
- `providers.mlx.modelId`: defaults to `mlx-community/Soprano-80M-bf16` when unset.
- `providers.elevenlabs.apiKey`: falls back to `ELEVENLABS_API_KEY` (or gateway shell profile if available).
- `realtime.provider`: selects the active browser/server realtime voice provider. Use `openai` for WebRTC, `google` for provider WebSocket, or a bridge-only provider through Gateway relay.
- `realtime.providers.<provider>` stores provider-owned realtime config. The browser receives only ephemeral or constrained session credentials, never a standard API key.
- `realtime.brain`: `agent-consult` routes realtime tool calls through Gateway policy; `direct-tools` is owner-only compatibility behavior; `none` is for transcription or external orchestration.
- `talk.catalog` exposes each provider's valid modes, transports, brain strategies, realtime audio formats, and capability flags so first-party Talk clients can avoid unsupported combinations.
- Streaming transcription providers are discovered through `talk.catalog.transcription`. The current Gateway relay uses the Voice Call streaming provider config until the dedicated Talk transcription config surface is added.
- `speechLocale`: optional BCP 47 locale id for on-device Talk speech recognition on iOS/macOS. Leave unset to use the device default.
- `outputFormat`: defaults to `pcm_44100` on macOS/iOS and `pcm_24000` on Android (set `mp3_*` to force MP3 streaming)
@@ -103,7 +131,9 @@ Defaults:
## Notes
- Requires Speech + Microphone permissions.
- Uses `chat.send` against session key `main`.
- Native Talk uses the active Gateway session and only falls back to history polling when response events are unavailable.
- Browser realtime Talk uses `talk.realtime.toolCall` for `openclaw_agent_consult` instead of exposing `chat.send` to provider-owned browser sessions.
- Transcription-only Talk uses `talk.transcription.session`, `talk.transcription.relayAudio`, `talk.transcription.relayCancel`, and `talk.transcription.relayStop`; clients subscribe to `talk.transcription.relay` events for partial/final transcript updates.
- The gateway resolves Talk playback through `talk.speak` using the active Talk provider. Android falls back to local system TTS only when that RPC is unavailable.
- macOS local MLX playback uses the bundled `openclaw-mlx-tts` helper when present, or an executable on `PATH`. Set `OPENCLAW_MLX_TTS_BIN` to point at a custom helper binary during development.
- `stability` for `eleven_v3` is validated to `0.0`, `0.5`, or `1.0`; other models accept `0..1`.