From 24853ced114ac2612d654255cc85549c579c09bc Mon Sep 17 00:00:00 2001 From: Peter Steinberger Date: Tue, 5 May 2026 20:59:13 +0100 Subject: [PATCH] docs: outline unified talk API --- docs/.i18n/glossary.zh-CN.json | 4 + docs/gateway/config-agents.md | 12 + docs/gateway/doctor.md | 3 +- docs/gateway/protocol.md | 12 +- docs/nodes/index.md | 3 + docs/nodes/talk.md | 38 ++- docs/platforms/ios.md | 4 + docs/plugins/sdk-migration.md | 93 +++++- docs/plugins/sdk-provider-plugins.md | 15 +- docs/refactor/talk.md | 426 +++++++++++++++++++++++++++ docs/tools/media-overview.md | 12 + docs/tools/tts.md | 10 + docs/web/control-ui.md | 6 +- 13 files changed, 625 insertions(+), 13 deletions(-) create mode 100644 docs/refactor/talk.md diff --git a/docs/.i18n/glossary.zh-CN.json b/docs/.i18n/glossary.zh-CN.json index 4090e11e413..05284fc3854 100644 --- a/docs/.i18n/glossary.zh-CN.json +++ b/docs/.i18n/glossary.zh-CN.json @@ -35,6 +35,10 @@ "source": "Channel message API", "target": "频道消息 API" }, + { + "source": "Talk mode", + "target": "Talk 模式" + }, { "source": "Azure Speech", "target": "Azure Speech" diff --git a/docs/gateway/config-agents.md b/docs/gateway/config-agents.md index e2b7f347332..c42bce27e21 100644 --- a/docs/gateway/config-agents.md +++ b/docs/gateway/config-agents.md @@ -1384,6 +1384,18 @@ Defaults for Talk mode (macOS/iOS/Android). speechLocale: "ru-RU", silenceTimeoutMs: 1500, interruptOnSpeech: true, + realtime: { + provider: "openai", + providers: { + openai: { + model: "gpt-realtime", + voice: "alloy", + }, + }, + mode: "realtime", + transport: "webrtc", + brain: "agent-consult", + }, }, } ``` diff --git a/docs/gateway/doctor.md b/docs/gateway/doctor.md index 50751246f48..82cb183db6e 100644 --- a/docs/gateway/doctor.md +++ b/docs/gateway/doctor.md @@ -166,7 +166,7 @@ That stages grounded durable candidates into the short-term dreaming store while If the config contains legacy value shapes (for example `messages.ackReaction` without a channel-specific override), doctor normalizes them into the current schema. - That includes legacy Talk flat fields. Current public Talk config is `talk.provider` + `talk.providers.`. Doctor rewrites old `talk.voiceId` / `talk.voiceAliases` / `talk.modelId` / `talk.outputFormat` / `talk.apiKey` shapes into the provider map. + That includes legacy Talk flat fields. Current public Talk speech config is `talk.provider` + `talk.providers.`, and realtime voice config is `talk.realtime.*`. Doctor rewrites old `talk.voiceId` / `talk.voiceAliases` / `talk.modelId` / `talk.outputFormat` / `talk.apiKey` shapes into the provider map, and rewrites legacy top-level realtime selectors (`talk.mode`, `talk.transport`, `talk.brain`, `talk.model`, `talk.voice`) into `talk.realtime`. Doctor also warns when `plugins.allow` is non-empty and tool policy uses wildcard or plugin-owned tool entries. `tools.allow: ["*"]` only matches tools @@ -199,6 +199,7 @@ That stages grounded durable candidates into the short-term dreaming store while - `routing.bindings` → top-level `bindings` - `routing.agents`/`routing.defaultAgentId` → `agents.list` + `agents.list[].default` - legacy `talk.voiceId`/`talk.voiceAliases`/`talk.modelId`/`talk.outputFormat`/`talk.apiKey` → `talk.provider` + `talk.providers.` + - legacy top-level realtime Talk selectors (`talk.mode`/`talk.transport`/`talk.brain`/`talk.model`/`talk.voice`) + `talk.provider`/`talk.providers` → `talk.realtime` - `routing.agentToAgent` → `tools.agentToAgent` - `routing.transcribeAudio` → `tools.media.audio.models` - `messages.tts.` (`openai`/`elevenlabs`/`microsoft`/`edge`) → `messages.tts.providers.` diff --git a/docs/gateway/protocol.md b/docs/gateway/protocol.md index 04db84450c7..20089794c80 100644 --- a/docs/gateway/protocol.md +++ b/docs/gateway/protocol.md @@ -253,7 +253,8 @@ base method scope: Nodes declare capability claims at connect time: -- `caps`: high-level capability categories. +- `caps`: high-level capability categories such as `camera`, `canvas`, `screen`, + `location`, `voice`, and `talk`. - `commands`: command allowlist for invoke. - `permissions`: granular toggles (e.g. `screen.record`, `camera.capture`). @@ -361,8 +362,17 @@ enumeration of `src/gateway/server-methods/*.ts`. + - `talk.catalog` returns the read-only Talk provider catalog for speech, streaming transcription, and realtime voice. It includes provider ids, labels, configured state, exposed model/voice ids, canonical modes, transports, brain strategies, and realtime audio/capability flags without returning provider secrets or mutating global config. - `talk.config` returns the effective Talk config payload; `includeSecrets` requires `operator.talk.secrets` (or `operator.admin`). + - `talk.handoff.create` creates an expiring managed-room handoff for an existing session key. The result contains a room id, room URL, bearer token, optional session-scoped provider/model/voice selection, mode, transport, brain strategy, and expiry for a first-party walkie-talkie client. `brain: "direct-tools"` requires `operator.admin`. + - `talk.handoff.join` validates a handoff id plus bearer token, emits `session.ready` or `session.replaced` room events as needed, and returns room/session metadata plus recent Talk events without the plaintext token or stored token hash. + - `talk.handoff.turnStart`, `talk.handoff.turnEnd`, and `talk.handoff.turnCancel` let a first-party managed-room client drive the room turn lifecycle with `turn.started`, `turn.ended`, and `turn.cancelled` Talk events. + - `talk.handoff.revoke` invalidates an unexpired handoff, emits `session.closed`, and makes later joins fail. - `talk.mode` sets/broadcasts the current Talk mode state for WebChat/Control UI clients. + - `talk.realtime.session` creates a browser realtime session using canonical transports (`webrtc`, `provider-websocket`, or `gateway-relay`). It accepts optional `mode`, `transport`, and `brain` selectors, but currently only public browser `mode: "realtime"` plus `brain: "agent-consult"` is supported; `managed-room` remains reserved for handoff clients until the browser owns a real room client. + - `talk.realtime.relayAudio`, `talk.realtime.relayCancel`, `talk.realtime.relayMark`, `talk.realtime.relayStop`, and `talk.realtime.relayToolResult` control Gateway-owned realtime relay sessions. Relay cancellation clears provider output and aborts any linked agent consult run. + - `talk.realtime.toolCall` lets browser-owned realtime transports forward provider tool calls to Gateway policy. The first supported tool is `openclaw_agent_consult`; clients receive a run id and wait for normal chat lifecycle events before submitting the provider-specific tool result. Gateway relay clients include `relaySessionId` so turn cancellation can abort the consult. + - `talk.transcription.session` creates a transcription-only Gateway relay over the configured streaming STT provider. Clients send PCM frames through `talk.transcription.relayAudio`, cancel an active turn with `talk.transcription.relayCancel`, receive `talk.transcription.relay` events with common Talk envelopes, and close with `talk.transcription.relayStop`. - `talk.speak` synthesizes speech through the active Talk speech provider. - `tts.status` returns TTS enabled state, active provider, fallback providers, and provider config state. - `tts.providers` returns the visible TTS provider inventory. diff --git a/docs/nodes/index.md b/docs/nodes/index.md index e5ce896fbac..20f4a754051 100644 --- a/docs/nodes/index.md +++ b/docs/nodes/index.md @@ -197,6 +197,9 @@ Node commands must pass two gates before they can be invoked: Windows and macOS companion nodes allow safe declared commands such as `canvas.*`, `camera.list`, `location.get`, and `screen.snapshot` by default. +Trusted nodes that advertise the `talk` capability or declare `talk.*` commands +also allow declared push-to-talk commands (`talk.ptt.start`, `talk.ptt.stop`, +`talk.ptt.cancel`, `talk.ptt.once`) by default, independent of platform label. Dangerous or privacy-heavy commands such as `camera.snap`, `camera.clip`, and `screen.record` still require explicit opt-in with `gateway.nodes.allowCommands`. `gateway.nodes.denyCommands` always wins over diff --git a/docs/nodes/talk.md b/docs/nodes/talk.md index fac21310050..88fd1c1824a 100644 --- a/docs/nodes/talk.md +++ b/docs/nodes/talk.md @@ -1,18 +1,28 @@ --- -summary: "Talk mode: continuous speech conversations with configured TTS providers" +summary: "Talk mode: continuous speech conversations across local STT/TTS and realtime voice" read_when: - Implementing Talk mode on macOS/iOS/Android - Changing voice/TTS/interrupt behavior title: "Talk mode" --- -Talk mode is a continuous voice conversation loop: +Talk mode has two runtime shapes: + +- Native macOS/iOS/Android Talk uses local speech recognition, Gateway chat, and `talk.speak` TTS. Nodes advertise the `talk` capability and declare the `talk.*` commands they support. +- Browser Talk uses `talk.realtime.session` with canonical transports: `webrtc`, `provider-websocket`, or `gateway-relay`. `managed-room` is reserved for Gateway handoff rooms. +- Transcription-only clients use `talk.transcription.session` plus `talk.transcription.relayAudio`, `talk.transcription.relayCancel`, and `talk.transcription.relayStop` when they need captions or dictation without an assistant voice response. + +Native Talk is a continuous voice conversation loop: 1. Listen for speech -2. Send transcript to the model (main session, chat.send) +2. Send transcript to the model through the active session 3. Wait for the response 4. Speak it via the configured Talk provider (`talk.speak`) +Browser realtime Talk forwards provider tool calls through `talk.realtime.toolCall`; browser clients do not call `chat.send` directly for realtime consults. + +Transcription-only Talk emits the same common Talk event envelope as realtime and STT/TTS sessions, but uses `mode: "transcription"` and `brain: "none"`. It is for captions, dictation, and observe-only speech capture; one-shot uploaded voice notes still use the media/audio path. + ## Behavior (macOS) - **Always-on overlay** while Talk mode is enabled. @@ -66,6 +76,19 @@ Supported keys: speechLocale: "ru-RU", silenceTimeoutMs: 1500, interruptOnSpeech: true, + realtime: { + provider: "openai", + providers: { + openai: { + apiKey: "openai_api_key", + model: "gpt-realtime", + voice: "alloy", + }, + }, + mode: "realtime", + transport: "webrtc", + brain: "agent-consult", + }, }, } ``` @@ -79,6 +102,11 @@ Defaults: - `providers.elevenlabs.modelId`: defaults to `eleven_v3` when unset. - `providers.mlx.modelId`: defaults to `mlx-community/Soprano-80M-bf16` when unset. - `providers.elevenlabs.apiKey`: falls back to `ELEVENLABS_API_KEY` (or gateway shell profile if available). +- `realtime.provider`: selects the active browser/server realtime voice provider. Use `openai` for WebRTC, `google` for provider WebSocket, or a bridge-only provider through Gateway relay. +- `realtime.providers.` stores provider-owned realtime config. The browser receives only ephemeral or constrained session credentials, never a standard API key. +- `realtime.brain`: `agent-consult` routes realtime tool calls through Gateway policy; `direct-tools` is owner-only compatibility behavior; `none` is for transcription or external orchestration. +- `talk.catalog` exposes each provider's valid modes, transports, brain strategies, realtime audio formats, and capability flags so first-party Talk clients can avoid unsupported combinations. +- Streaming transcription providers are discovered through `talk.catalog.transcription`. The current Gateway relay uses the Voice Call streaming provider config until the dedicated Talk transcription config surface is added. - `speechLocale`: optional BCP 47 locale id for on-device Talk speech recognition on iOS/macOS. Leave unset to use the device default. - `outputFormat`: defaults to `pcm_44100` on macOS/iOS and `pcm_24000` on Android (set `mp3_*` to force MP3 streaming) @@ -103,7 +131,9 @@ Defaults: ## Notes - Requires Speech + Microphone permissions. -- Uses `chat.send` against session key `main`. +- Native Talk uses the active Gateway session and only falls back to history polling when response events are unavailable. +- Browser realtime Talk uses `talk.realtime.toolCall` for `openclaw_agent_consult` instead of exposing `chat.send` to provider-owned browser sessions. +- Transcription-only Talk uses `talk.transcription.session`, `talk.transcription.relayAudio`, `talk.transcription.relayCancel`, and `talk.transcription.relayStop`; clients subscribe to `talk.transcription.relay` events for partial/final transcript updates. - The gateway resolves Talk playback through `talk.speak` using the active Talk provider. Android falls back to local system TTS only when that RPC is unavailable. - macOS local MLX playback uses the bundled `openclaw-mlx-tts` helper when present, or an executable on `PATH`. Set `OPENCLAW_MLX_TTS_BIN` to point at a custom helper binary during development. - `stability` for `eleven_v3` is validated to `0.0`, `0.5`, or `1.0`; other models accept `0..1`. diff --git a/docs/platforms/ios.md b/docs/platforms/ios.md index 7ea4153cdc7..2b8d01c86a3 100644 --- a/docs/platforms/ios.md +++ b/docs/platforms/ios.md @@ -263,6 +263,10 @@ openclaw nodes invoke --node "iOS Node" --command canvas.snapshot --params '{"ma ## Voice wake + talk mode - Voice wake and talk mode are available in Settings. +- Talk-capable iOS nodes advertise the `talk` capability and can declare + `talk.ptt.start`, `talk.ptt.stop`, `talk.ptt.cancel`, and `talk.ptt.once`; + the Gateway allows those push-to-talk commands by default for trusted + Talk-capable nodes. - iOS may suspend background audio; treat voice features as best-effort when the app is not active. ## Common errors diff --git a/docs/plugins/sdk-migration.md b/docs/plugins/sdk-migration.md index 0db0a8de0bc..e66882ce8b3 100644 --- a/docs/plugins/sdk-migration.md +++ b/docs/plugins/sdk-migration.md @@ -77,6 +77,97 @@ Current bundled provider examples: - OpenRouter keeps provider builder and onboarding/config helpers in its own `api.ts` +## Talk and realtime voice migration plan + +Realtime voice, telephony, meeting, and browser Talk code is moving from +surface-local turn bookkeeping to a shared Talk session controller exported by +`openclaw/plugin-sdk/realtime-voice`. The new controller owns the common Talk +event envelope, active turn state, capture state, output-audio state, recent +event history, and stale-turn rejection. Provider plugins should keep owning +vendor-specific realtime sessions; surface plugins should keep owning capture, +playback, telephony, and meeting quirks. + +This migration is intentionally adapter-first: + +1. Add shared controller/runtime primitives to `plugin-sdk/realtime-voice`. +2. Keep existing public Gateway RPCs such as `talk.realtime.session`, + `talk.realtime.relayAudio`, `talk.transcription.session`, and + `talk.handoff.*` as compatibility adapters. +3. Move bundled surfaces onto the shared controller: browser relay, managed-room + handoff, voice-call realtime, voice-call streaming STT, Google Meet realtime, + and VoiceClaw realtime. +4. Advertise all Talk event channels in Gateway `hello-ok.features.events` so + clients can discover `talk.event`, `talk.realtime.relay`, and + `talk.transcription.relay`. +5. Expose the versioned `talk.session.*` API for Gateway-managed Talk sessions + after the adapters are internally backed by the same controller. + +New code should not call `createTalkEventSequencer(...)` directly unless it is +implementing a low-level adapter or test fixture. Prefer the shared controller +so turn-scoped events cannot be emitted without a turn id, stale `turnEnd` / +`turnCancel` calls cannot clear a newer active turn, and output-audio lifecycle +events stay consistent across telephony, meetings, browser relay, managed-room +handoff, and native Talk clients. + +The target public API shape is: + +```typescript +// Versioned Gateway-managed Talk session API. +await gateway.request("talk.session.create", { + mode: "realtime", + transport: "gateway-relay", + brain: "agent-consult", + sessionKey: "main", +}); +await gateway.request("talk.session.inputAudio", { sessionId, audioBase64 }); +await gateway.request("talk.session.control", { sessionId, type: "turn.cancel" }); +await gateway.request("talk.session.toolResult", { sessionId, callId, result }); +await gateway.request("talk.session.close", { sessionId }); +``` + +Browser-owned WebRTC/provider-websocket sessions stay on +`talk.realtime.session`, because the browser owns the provider negotiation and +media transport. `talk.session.*` is the common Gateway-managed surface for +gateway-relay realtime, gateway-relay transcription, and managed-room native +STT/TTS sessions. + +Legacy configs that placed realtime selectors beside `talk.provider` / +`talk.providers` should be repaired with `openclaw doctor --fix`; runtime Talk +does not reinterpret speech/TTS provider config as realtime provider config. + +The supported `talk.session.create` combinations are intentionally small: + +| Mode | Transport | Brain | Owner | Notes | +| --------------- | --------------- | --------------- | ------------------ | ------------------------------------------------------------------------------------------------------------------ | +| `realtime` | `gateway-relay` | `agent-consult` | Gateway | Full-duplex provider audio bridged through the Gateway; tool calls are routed through the agent-consult tool. | +| `transcription` | `gateway-relay` | `none` | Gateway | Streaming STT only; callers send input audio and receive transcript events. | +| `stt-tts` | `managed-room` | `agent-consult` | Native/client room | Push-to-talk and walkie-talkie style rooms where the client owns capture/playback and the Gateway owns turn state. | +| `stt-tts` | `managed-room` | `direct-tools` | Native/client room | Admin-only room mode for trusted first-party surfaces that execute Gateway tool actions directly. | + +Everything else should stay on the existing owner-specific adapter until there +is a real Gateway-managed transport for it: + +| Existing adapter | Keep using it for | +| ----------------------- | ---------------------------------------------------------------------------------------- | +| `talk.realtime.session` | Browser-owned WebRTC and provider-websocket realtime sessions. | +| `talk.realtime.relay*` | Compatibility for existing browser relay clients while they migrate to `talk.session.*`. | +| `talk.transcription.*` | Compatibility for existing streaming STT clients while they migrate to `talk.session.*`. | +| `talk.handoff.*` | Compatibility for room-style native clients; internally this is the managed-room shape. | + +The unified control vocabulary is also deliberately narrow: + +| Method | Applies to | Contract | +| ------------------------- | ------------------------------------------------------- | -------------------------------------------------------------------------------------------------------- | +| `talk.session.inputAudio` | `realtime/gateway-relay`, `transcription/gateway-relay` | Append a base64 PCM audio chunk to the provider session owned by the same Gateway connection. | +| `talk.session.control` | all unified sessions | `turn.cancel` for relay sessions; `turn.start`, `turn.end`, and `turn.cancel` for managed-room sessions. | +| `talk.session.toolResult` | `realtime/gateway-relay` | Complete a provider tool call emitted by the relay. | +| `talk.session.close` | all unified sessions | Stop relay sessions or revoke managed-room handoff state, then forget the unified session id. | + +Do not introduce provider or platform special cases in core to make this work. +Core owns Talk session semantics. Provider plugins own vendor session setup. +Voice-call and Google Meet own telephony/meeting adapters. Browser and native +apps own device capture/playback UX. + ## Compatibility policy For external plugins, compatibility work follows this order: @@ -497,7 +588,7 @@ releases. | `plugin-sdk/speech` | Speech helpers | Speech provider types plus provider-facing directive, registry, validation helpers, and OpenAI-compatible TTS builder | | `plugin-sdk/speech-core` | Shared speech core | Speech provider types, registry, directives, normalization | | `plugin-sdk/realtime-transcription` | Realtime transcription helpers | Provider types, registry helpers, and shared WebSocket session helper | - | `plugin-sdk/realtime-voice` | Realtime voice helpers | Provider types, registry/resolution helpers, and bridge session helpers | + | `plugin-sdk/realtime-voice` | Realtime voice helpers | Provider types, registry/resolution helpers, bridge session helpers, shared agent talk-back queues, transcript/event health, echo suppression, and fast context consult helpers | | `plugin-sdk/image-generation` | Image-generation helpers | Image generation provider types plus image asset/data URL helpers and the OpenAI-compatible image provider builder | | `plugin-sdk/image-generation-core` | Shared image-generation core | Image-generation types, failover, auth, and registry helpers | | `plugin-sdk/music-generation` | Music-generation helpers | Music-generation provider/request/result types | diff --git a/docs/plugins/sdk-provider-plugins.md b/docs/plugins/sdk-provider-plugins.md index cfe7cb2099c..2a4c058f741 100644 --- a/docs/plugins/sdk-provider-plugins.md +++ b/docs/plugins/sdk-provider-plugins.md @@ -588,6 +588,13 @@ API key auth, and dynamic model resolution. api.registerRealtimeVoiceProvider({ id: "acme-ai", label: "Acme Realtime Voice", + capabilities: { + transports: ["gateway-relay"], + inputAudioFormats: [{ encoding: "pcm16", sampleRateHz: 24000, channels: 1 }], + outputAudioFormats: [{ encoding: "pcm16", sampleRateHz: 24000, channels: 1 }], + supportsBargeIn: true, + supportsToolCalls: true, + }, isConfigured: ({ providerConfig }) => Boolean(providerConfig.apiKey), createBridge: (req) => ({ // Set this only if the provider accepts multiple tool responses for @@ -606,9 +613,11 @@ API key auth, and dynamic model resolution. }); ``` - Implement `handleBargeIn` when a transport can detect that a human is - interrupting assistant playback and the provider supports truncating or - clearing the active audio response. + Declare `capabilities` so `talk.catalog` can expose valid modes, + transports, audio formats, and feature flags to browser and native Talk + clients. Implement `handleBargeIn` when a transport can detect that a + human is interrupting assistant playback and the provider supports + truncating or clearing the active audio response. ```typescript diff --git a/docs/refactor/talk.md b/docs/refactor/talk.md new file mode 100644 index 00000000000..4f9f137a2b8 --- /dev/null +++ b/docs/refactor/talk.md @@ -0,0 +1,426 @@ +--- +summary: "Grand unification plan for Talk mode, realtime voice, voice-call, Google Meet, and VoiceClaw realtime" +read_when: + - Refactoring Talk mode, realtime voice, voice-call, Google Meet, or VoiceClaw realtime + - Changing Talk protocol, provider contracts, browser realtime, or native voice behavior + - Deciding whether a voice feature belongs in core, a provider plugin, or a surface adapter +title: "Talk unification plan" +--- + +# Talk Unification Plan + +OpenClaw has several voice loops that grew from different product surfaces: native Talk mode, browser realtime Talk, Voice Call realtime, Google Meet realtime, streaming STT, TTS reply playback, and `/voiceclaw/realtime`. The goal is not to force all of them into one implementation. The goal is one session contract, one event vocabulary, one policy boundary, and small adapters for each surface. + +Core should know conversation modes, byte transports, audio formats, tool policy, and client capabilities. Core should not know platform product names such as iOS, Android, or macOS except as optional telemetry emitted by an edge client. + +## Goals + +- Make browser Talk, native Talk, telephony, meetings, and VoiceClaw realtime share the same session semantics. +- Keep provider-specific realtime behavior in provider plugins. +- Keep telephony and meeting quirks in their owning plugins. +- Move browser realtime agent consult out of browser-owned `chat.send`. +- Keep existing public entry points only as migration adapters while the runtime converges. +- Keep local STT/TTS as a first-class fallback, not a deprecated path. +- Support a first-party walkie-talkie client that can hand off an existing OpenClaw session into voice without becoming a separate assistant. +- Make event logs, latency, usage, tool calls, cancellation, and interruption observable in the same shape everywhere. + +## Non Goals + +- Do not make core branch on app platforms. +- Do not move OpenAI, Google, Twilio, or meeting-specific behavior into core. +- Do not merge one-shot inbound audio attachments with live Talk sessions beyond sharing STT provider contracts where useful. +- Do not remove `/voiceclaw/realtime` or existing Talk RPC entry points during the first migration; they may reject retired fields instead of preserving every old request shape. +- Do not allow request-time instruction overrides for realtime sessions. +- Do not copy VoiceClaw names or request fields into shared APIs; preserve the realtime runtime capabilities through the shared Talk contract, except request-time instruction overrides. + +## Current Surfaces + +| Surface | Current shape | Keep | Refactor target | +| ------------------------ | ----------------------------------------------------------------------------------------------------------------------------------------------------- | ----------------------------------------------------------------- | ------------------------------------------------------------------------------------------------------- | +| Browser Talk | `talk.realtime.session` returns WebRTC, provider WebSocket, or Gateway relay. Tool calls go through `talk.realtime.toolCall`. | Browser audio capture/playback and WebRTC data-channel handling. | Keep browser media ownership while Gateway owns realtime tool policy. | +| Native Talk | Local STT, Gateway `chat.send`, response event or `chat.history` polling, then local or Gateway TTS. | Local STT/TTS fallback and native audio controls. | Event-driven success path with shared Talk events. | +| Voice Call realtime | Telephony WebSocket with G.711 u-law, marks, interruption, and realtime voice bridge. | Telephony adapter ownership. | Adapter over shared Talk session contract. | +| Voice Call streaming STT | Telephony stream through realtime transcription provider, then TTS playback. | STT/TTS pipeline mode. | Explicit `stt-tts` mode adapter. | +| Google Meet realtime | Meeting participant context, echo suppression, realtime provider bridge, fast context. | Meeting adapter ownership. | Adapter over shared Talk session contract and metrics. | +| VoiceClaw realtime | Separate WebSocket endpoint with Gemini Live, direct tools, audio/video frames, interruption, cancellation, session rotation/resumption, and metrics. | Migration endpoint; realtime runtime primitives except overrides. | Shared Talk contract; server-owned instructions; no request-time override. | +| TTS | `talk.speak` and provider TTS config. | Speech provider abstraction. | Cleanly separated from realtime provider config. | +| STT | Batch audio and streaming transcription providers. | Provider contracts. | Streaming STT is an input strategy for `stt-tts`; batch voice notes stay outside live Talk. | +| Walkie-talkie handoff | Prototype pattern: existing session, phone capture, push-to-talk turn, STT, agent turn, TTS playback, and transcript mirror. | One-button voice handoff UX and long-form PTT. | Gateway-backed handoff room using shared Talk events, provider catalogs, and existing session delivery. | + +## Core Model + +Separate the dimensions. Mode is how the conversation runs. Transport is how bytes move. Brain is who handles tools and agent reasoning. Surface is edge-owned and should not drive core branching. + +```ts +type TalkMode = "realtime" | "stt-tts" | "transcription"; + +type TalkTransport = "webrtc" | "provider-websocket" | "gateway-relay" | "managed-room"; + +type TalkBrain = "agent-consult" | "direct-tools" | "none"; +``` + +### Modes + +`realtime` is a provider-native live session. Audio goes in, audio comes out, interruptions and tool calls happen inside one low-latency session. OpenAI Realtime and Google Live fit here. WebRTC and provider WebSockets are transports for this mode, not separate modes. + +`stt-tts` is the classic pipeline: speech-to-text, agent text turn, text-to-speech. It is higher latency, but it works with local native speech, streaming STT providers, low-cost fallback providers, offline-ish native paths, and providers that do not support realtime voice. + +`transcription` is speech-to-text without an assistant speech response. It covers dictation, captions, meeting transcript capture, and voice-note style ingestion when the live session layer is useful. Gateway-owned transcription relay sessions use `talk.transcription.session`, `talk.transcription.relayAudio`, `talk.transcription.relayCancel`, and `talk.transcription.relayStop`. One-shot batch audio attachments can still use the existing media path without becoming Talk sessions. + +### Transports + +`webrtc` is browser or WebRTC-capable client transport using SDP and media/data channels. It is the best fit for direct OpenAI Realtime browser sessions with ephemeral credentials. + +`provider-websocket` is a constrained provider WebSocket carrying JSON control messages and PCM audio. It fits Google Live-style browser or server streams where WebRTC is not the provider contract. + +`gateway-relay` keeps the vendor session on the Gateway. Clients send authenticated audio frames to Gateway and receive audio/events back. This is the secure default for providers without browser-safe tokens and for server-owned tool policy. + +`managed-room` is a Gateway-owned room/session where one or more clients join a managed Talk handoff. It is the primitive for first-party walkie-talkie clients: Gateway owns rendezvous, expiry, replacement, turn lifecycle events, and provider credentials while the edge client owns capture and playback. + +Telephony, meetings, and native apps are not core transports. They are surface adapters that choose one of the transports above or implement local `stt-tts` before handing text/audio events into the shared session contract. + +Canonical transport names are the names above. Legacy browser-session transport names should be normalized at adapter boundaries (`webrtc-sdp` to `webrtc`, `json-pcm-websocket` to `provider-websocket`) so mixed-version clients and external providers keep working. Do not keep the legacy names as a second internal vocabulary. When a versioned creation RPC exists, freeze the old RPC shape and delete the aliases only after the announced compatibility window. + +### Brain Strategies + +`agent-consult` means the realtime model asks Gateway to consult an OpenClaw agent. Gateway applies tool policy, chooses fork or isolated context, runs the agent, and returns a concise result to the realtime provider. + +`direct-tools` means the realtime provider receives a direct OpenClaw tool declaration and calls Gateway-owned tools. This is the VoiceClaw-style brain and should require owner-level authorization. + +`none` means the session is pure transcription, external orchestration, or client-managed speech without OpenClaw tool access. + +## Shared Talk Session Runtime + +The next cleanup layer is a shared Talk session controller. It should be the only code that owns event sequencing, active turn state, capture state, output audio state, recent event retention, and stale-turn rejection. Surface adapters may decide when to call it, but they should not each reimplement turn bookkeeping. + +The controller contract should cover: + +- `emit(...)` for session, health, usage, latency, and tool events that do not mutate turn state +- `startTurn(...)` and `ensureTurn(...)` for capture, STT, realtime provider, telephony, and meeting adapters +- `endTurn(...)` and `cancelTurn(...)` with stale `turnId` rejection before clearing the active turn +- `startOutputAudio(...)`, `emitOutputAudioDelta(...)`, and `finishOutputAudio(...)` for playback, marks, relay clear, and barge-in +- recent event retention for reconnect, diagnostics, hello/event discovery tests, and native UI replay +- compatibility normalization for legacy transport result names at adapter boundaries + +The public API migration is adapter-first. Keep existing RPCs such as `talk.realtime.session`, `talk.realtime.relayAudio`, `talk.transcription.session`, `talk.transcription.relayAudio`, and `talk.handoff.*` while moving their internals onto the shared controller. Gateway-managed sessions expose the common model directly: + +```ts +talk.session.create; +talk.session.inputAudio; +talk.session.control; +talk.session.toolResult; +talk.session.close; +``` + +The old RPCs stay as compatibility adapters while new clients use `talk.session.*` for gateway-relay realtime, gateway-relay transcription, and managed-room native STT/TTS sessions. Browser-owned WebRTC/provider-websocket sessions remain on `talk.realtime.session` because the browser owns provider negotiation and media transport there. The internal controller must be provider-agnostic and platform-agnostic: provider plugins own vendor sessions, voice-call owns telephony, Google Meet owns meeting details, and browser/native clients own capture and playback UX. + +## VoiceClaw Runtime Scope + +VoiceClaw is an adapter target, not a feature template for the unified runtime. We do not need every VoiceClaw product or API feature. We do want the useful realtime runtime primitives: live provider sessions, audio and optional video frames, interruption, cancellation, session lifecycle, rotation/resumption, metrics, latency reporting, and direct tools when explicitly authorized. Those should arrive as shared Talk primitives instead of VoiceClaw-only knobs. + +The deliberate feature removal is request-time instruction override. Unified Talk instructions must be server-owned. If a capability depends on provider support, owner-scoped auth, or the selected brain strategy, the adapter should gate it through shared Talk capability metadata rather than deleting it. Do not preserve `instructionsOverride`; it is intentionally outside the unified Talk contract. Everything else in the existing realtime runtime is presumed in scope unless a later implementation review proves that it is dead, unsafe, or impossible to express as a shared Talk primitive. + +Keep: + +- `/voiceclaw/realtime` endpoint shape during migration +- existing auth expectations where they remain owner-scoped +- Gemini Live provider bridge +- audio input and output frames +- video frames when the selected provider supports them +- interruption and response cancellation +- session rotation and resumption where the provider supports them +- metrics and latency reporting +- direct tool calls behind the explicit `direct-tools` brain + +Do not keep: + +- request-time `instructionsOverride` +- VoiceClaw-only request fields that duplicate server-owned instructions, tool policy, provider selection, or session policy +- VoiceClaw-specific configuration names in new shared Talk APIs + +Realtime instruction policy must come from server-side config, agent identity, selected brain strategy, or another owner-controlled policy surface. If a client sends `instructionsOverride`, the compatibility adapter should reject the request rather than silently applying, partially honoring, or translating it. Everything in the Keep list remains in scope and should migrate onto shared Talk primitives. + +Compatibility here means "old entry point can route to the new runtime," not "old clients can keep every old knob forever." `/voiceclaw/realtime` should be allowed to return a clear unsupported-field error for retired request fields, especially `instructionsOverride`, while preserving the runtime behavior that still belongs in Talk. + +## Event Vocabulary + +All Talk sessions should emit a common event stream: + +- `session.started`, `session.ready`, `session.replaced`, `session.closed`, `session.error` +- `turn.started`, `turn.ended`, `turn.cancelled` +- `capture.started`, `capture.stopped`, `capture.cancelled`, `capture.once` +- `input.audio.delta`, `input.audio.committed` +- `transcript.delta`, `transcript.done` +- `output.text.delta`, `output.text.done` +- `output.audio.started`, `output.audio.delta`, `output.audio.done` +- `tool.call`, `tool.progress`, `tool.result`, `tool.error` +- `usage.metrics` +- `latency.metrics` +- `health.changed` + +Adapters may add vendor or surface metadata, but the common event names should be enough for UI, native clients, logs, tests, and metrics. + +Every common event must use the same envelope: + +```ts +type TalkEvent = { + id: string; + type: TalkEventType; + sessionId: string; + turnId?: string; + captureId?: string; + seq: number; + timestamp: string; + mode: TalkMode; + transport: TalkTransport; + brain: TalkBrain; + provider?: string; + final?: boolean; + callId?: string; + itemId?: string; + parentId?: string; + payload: TPayload; +}; +``` + +`sessionId` is required for every event. `turnId` is required for every event tied to one user/assistant turn. `captureId` is required while push-to-talk capture is active. `seq` is monotonically increasing within a session. `callId`, `itemId`, and `parentId` correlate provider tool calls, realtime response items, TTS jobs, and relay frames. Replay, stale-output suppression, metrics, and tests should rely on these envelope fields rather than vendor-specific payload shapes. + +Walkie-talkie clients need one extra timing rule: text-ready is not audio-ready. A client may show transcript text after `output.text.done`, but it should not transition from "thinking" to "speaking" until `output.audio.delta` or an explicit `output.audio.started` event arrives. That keeps hold music, waveform, replay, and barge-in UX honest when the agent turn finishes before TTS is ready. + +## Walkie-Talkie App Primitives + +The app should be buildable from the same primitives, not a parallel voice stack. + +### Session Handoff + +Voice handoff starts from an existing OpenClaw session. The handoff primitive should carry: + +- canonical session id +- optional session key for human-readable thread lookup +- delivery route, such as channel and target +- caller identity and scope +- selected `TalkMode`, `TalkTransport`, and `TalkBrain` +- optional session-scoped provider, model, and voice ids +- expiration, revocation, and replacement policy + +The existing Gateway session APIs and `chat.send`/agent delivery paths already cover the canonical conversation side. First-class Talk handoff RPCs provide the rendezvous primitive: `talk.handoff.create` returns an ephemeral room token or join URL, `talk.handoff.join` validates the later voice join without exposing stored token hashes, `talk.handoff.turnStart`/`turnEnd`/`turnCancel` drive the room turn lifecycle, and `talk.handoff.revoke` invalidates stale or replaced handoffs. + +### Room and Rendezvous + +The room model must allow one device or browser client to host multiple active voice handoffs for different sessions without cross-talk. A deterministic room key is fine for local or development flows, but the product path should prefer Gateway-owned room creation with caller auth, expiry, and revoke semantics. + +The minimum room events are: + +- `session.ready` +- `session.replaced` +- `turn.started` +- `turn.ended` +- `turn.cancelled` +- `session.closed` +- `session.error` + +`managed-room` is public only through handoff clients. Browser `talk.realtime.session` should keep rejecting `managed-room` until the browser owns a real room client instead of treating it as a browser-session result shape. + +### Push-To-Talk + +Push-to-talk is a turn-control primitive, not a platform primitive. It should map to browser capture, native local capture, or node commands: + +- `capture.started` +- `capture.stopped` +- `capture.cancelled` +- `capture.once` + +Native node support has `talk.ptt.start`, `talk.ptt.stop`, `talk.ptt.cancel`, and `talk.ptt.once` command handlers. The Gateway policy treats them as first-class defaults only for trusted Talk-capable nodes: a node must advertise the `talk` capability or declare `talk.*` command support, and the command must still be present in the paired command snapshot. + +### Provider Catalogs and Settings + +Walkie-talkie settings should be per session or per device. The client should request STT, TTS, and realtime catalogs through Gateway, store only provider ids, model ids, voice ids, and locales, and never receive provider API keys or mutate global Talk provider defaults as a side effect of opening the app. + +The catalog contract should describe which combinations are valid: + +- local STT plus local TTS +- streaming STT plus provider TTS +- realtime provider with provider-native output audio +- Gateway relay when browser-safe credentials are not available +- managed room when the Gateway owns the session + +### Canonical Transcript + +The OpenClaw session is the source of truth. A walkie-talkie app may keep a local transcript cache for replay, export, reconnect, or offline UX, but the agent turn and durable transcript should go through the existing session delivery route. Transcript mirroring should be best effort and must not block the voice turn. + +### Connectivity and Backgrounding + +Native apps can use node pairing, `node.invoke`, and platform wake mechanisms when available. Browser or standalone web clients need either Gateway relay, a managed room, or hosted WebRTC signaling with ICE/TURN. Background continuous audio remains platform-limited; the product should promise foreground push-to-talk first and treat background capture as best effort. + +### Cancellation and Replacement + +Every turn should carry a turn token or capture id. Stale STT finals, stale agent replies, and stale TTS output must be ignored after `turn.cancelled` or `session.replaced`. This is required for "tap again to interrupt", reconnect replacement, and multi-session isolation. + +Cancellation must also abort underlying work, not only hide stale output. A cancelled or replaced turn must: + +- cancel provider responses or realtime sessions when the provider supports it +- abort agent consult and tool runtime work through an `AbortSignal` +- prevent newly queued side-effecting tools from starting after cancellation +- let already-started side-effecting tools report cancellation status instead of inventing success +- drain pending TTS jobs and stop audio playback/relay writes +- close or reset relay and managed-room streams tied to the stale turn +- emit one terminal cancellation event with the final abort reason + +## Config Direction + +The current public Talk config is speech-provider oriented. Keep it as the speech config and add realtime config beside it. Do not introduce a second `talk.speech` namespace during this refactor. + +```ts +type TalkConfig = { + provider?: string; + providers?: Record; + realtime?: { + provider?: string; + model?: string; + voice?: string; + mode?: TalkMode; + transport?: TalkTransport; + brain?: TalkBrain; + }; + input?: { + interruptOnSpeech?: boolean; + silenceTimeoutMs?: number; + }; +}; +``` + +Rule: `talk.provider` and `talk.providers.*` continue to mean speech, STT, and TTS provider configuration. Realtime provider selection uses `talk.realtime.provider`, then registered realtime capabilities. Voice Call fallback inference should be deleted once the realtime config exists in schema, docs, forms, and doctor repair. + +## Provider Contracts + +Provider plugins should declare capabilities, not force core to infer behavior from ids: + +```ts +type RealtimeVoiceProviderCapabilities = { + transports: TalkTransport[]; + inputAudioFormats: AudioFormat[]; + outputAudioFormats: AudioFormat[]; + supportsBrowserSession?: boolean; + supportsBargeIn?: boolean; + supportsToolCalls?: boolean; + supportsVideoFrames?: boolean; + supportsSessionResumption?: boolean; +}; +``` + +OpenAI owns OpenAI Realtime details. Google owns Gemini Live details, continuation, compression, and session resumption. STT plugins own streaming transcription. TTS plugins own synthesis and telephony-compatible output formats. + +## Gateway Policy Boundary + +Browser realtime should not run agent consult by calling `chat.send` directly. The browser may own the media connection when a provider requires it, but Gateway should own the consult/tool policy. + +Target flow for browser-owned provider sessions: + +1. Provider emits a tool call to the browser. +2. Browser forwards the structured tool call to Gateway with the session id. +3. Gateway validates the session, caller, tool policy, brain strategy, and owner permissions. +4. Gateway runs `agent-consult`, `direct-tools`, or rejects the call. +5. Browser submits the provider-specific tool result back to the provider. + +Target flow for Gateway-owned sessions: + +1. Provider emits a tool call to Gateway. +2. Gateway runs policy and tool handling directly. +3. Client only receives status, transcript, audio, and visible tool progress events. + +## Surface Adapters + +Adapters convert surface-specific IO into the shared model. + +Browser adapter handles microphone capture, playback, WebRTC SDP, data channels, provider WebSocket framing, relay RPCs, and provider-specific tool result submission. + +Native adapter handles local STT/TTS, push-to-talk, continuous listening, local interruption, audio session lifecycles, and optional Gateway realtime or managed-room clients. Core sees capabilities such as PCM input support, local TTS fallback, and barge-in support, not platform names. + +Telephony adapter handles Twilio or Plivo media streams, G.711 u-law, stream ids, marks, clear events, backpressure, call lifecycle, and phone-specific interruption behavior. + +Meeting adapter handles room lifecycle, participant context, echo suppression, meeting transcript context, and meeting-specific authorization. + +VoiceClaw adapter handles `/voiceclaw/realtime`, auth expectations that remain owner-scoped, Gemini Live compatibility, audio/video frames, interruption, response cancellation, session rotation/resumption, metrics, latency reporting, and the `direct-tools` brain while using common Talk events internally. It must reject request-time `instructionsOverride` and must not introduce VoiceClaw-only policy fields into the shared Talk API. + +## Migration Phases + +### Phase 1: Contracts + +- Add shared Talk mode, transport, brain, capabilities, command, and event types. +- Add a config resolver that preserves legacy `talk.provider`. +- Keep existing `RealtimeVoiceProvider` APIs while introducing capability metadata. +- Add handoff, room, capture, provider catalog, cancellation, and replacement event contracts. +- Make `talk.ptt.start`, `talk.ptt.stop`, `talk.ptt.cancel`, and `talk.ptt.once` explicit safe commands for Talk-capable nodes. +- Add protocol tests for no request-time instruction override. + +### Phase 2: Gateway Tool Policy + +- Add Gateway RPC for realtime tool calls from browser-owned provider sessions. +- Add Gateway RPCs for `talk.handoff.create`, `talk.handoff.join`, `talk.handoff.revoke`, and explicit handoff turn start/end/cancel, with session identity, expiry, revocation, join authorization, and event replay. +- Add session-scoped STT, TTS, and realtime provider catalog RPCs. +- Keep browser `openclaw_agent_consult` handling on `talk.realtime.toolCall`, not browser-side `chat.send`. +- Reuse existing agent consult runtime and tool allow policy. +- Add owner-only gate for `direct-tools`. + +### Phase 3: Browser Runtime + +- Normalize browser WebRTC, provider WebSocket, and relay adapters behind common Talk events. +- Keep `managed-room` scoped to handoff clients until the browser has a real room client. +- Add a walkie-talkie browser client path over Gateway relay or managed room. +- Keep provider credentials on Gateway; browser receives only ephemeral room/session credentials. +- Add browser tests proving realtime consult does not call `chat.send`. + +### Phase 4: Native Runtime + +- Make native Talk consume response events in the success path. +- Remove normal-path `chat.history` polling and keep history polling only as a degraded fallback if needed. +- Preserve local STT and local TTS fallback. +- Route native push-to-talk through the shared capture and turn events. +- Verify node command policy allows `talk.ptt.*` for trusted Talk-capable native nodes. +- Align native emitted state with common Talk events. + +### Phase 5: VoiceClaw Runtime + +- Rebase `/voiceclaw/realtime` onto the shared Talk session runtime. +- Keep the endpoint as a thin migration adapter and preserve auth expectations only where they map cleanly to the shared Talk contract. +- Remove request-time `instructionsOverride`; owner policy must come from server-side config, agent identity, or the selected brain strategy. +- Map Gemini Live metrics, latency reporting, rotation, resumption, interruption, cancellation, audio, video, and tool events into the common event stream. +- Keep `direct-tools` separate from `agent-consult`. +- Do not add VoiceClaw-specific config names, override fields, or client policy knobs to new Talk contracts. + +### Phase 6: Voice Call and Meetings + +- Convert Voice Call realtime into a telephony adapter over shared Talk sessions. +- Convert Voice Call streaming STT into explicit `stt-tts`. +- Convert Google Meet realtime into a meeting adapter over shared Talk sessions. +- Keep telephony marks, u-law, backpressure, participant context, and echo suppression in their owning adapters. + +### Phase 7: Docs and Cleanup + +- Update [Talk mode](/nodes/talk), [Control UI](/web/control-ui), [Gateway protocol](/gateway/protocol), [Media overview](/tools/media-overview), [Text-to-speech](/tools/tts), and plugin SDK docs. +- Retire duplicate event names after compatibility windows. +- Remove browser-side consult-through-chat code after all supported providers use Gateway tool policy. + +## Test Matrix + +- WebRTC plus `agent-consult`. +- Provider WebSocket plus `agent-consult`. +- Gateway relay plus `agent-consult`. +- Public clients updated to canonical transport names, or a versioned RPC proves old result names stay isolated until deletion. +- VoiceClaw compatibility plus `direct-tools`, without request-time `instructionsOverride`. +- Telephony WebSocket with marks, clear, interruption, and u-law. +- Meeting adapter with participant context and echo suppression. +- Native `stt-tts` with no `chat.history` polling in the normal success path. +- Transcription-only Gateway relay session with partial/final transcript Talk events and no assistant brain. +- TTS-only `talk.speak`. +- Walkie-talkie handoff from an existing session into a voice room. +- Two simultaneous walkie-talkie handoffs for the same host but different sessions with no transcript, audio, or turn-token cross-talk. +- Push-to-talk start, stop, cancel, and once through `node.invoke` on a trusted talk-capable node. +- Text-ready before TTS-ready, proving the client does not enter playback until audio starts. +- Session-scoped provider catalog selection that does not mutate global Talk config. +- Cancellation aborts provider work, agent consult, queued tools, TTS, and relay/room streams. +- Security checks for no instruction override, no browser standard API keys, owner-only direct tools, and session-scoped tool calls. + +## End State + +OpenClaw has one Talk architecture with three execution modes, four core transports, explicit brain strategies, provider-owned vendor logic, Gateway-owned tool policy, and adapters for browser, native, telephony, meetings, and VoiceClaw compatibility. Users get better Talk mode. Maintainers get one place to reason about sessions, events, policy, metrics, and tests. diff --git a/docs/tools/media-overview.md b/docs/tools/media-overview.md index b1bea44b68f..3b4e5df800a 100644 --- a/docs/tools/media-overview.md +++ b/docs/tools/media-overview.md @@ -14,6 +14,12 @@ media capabilities are tool-driven: the agent decides when to use them based on the conversation, and each tool only appears when at least one backing provider is configured. +Live speech uses the Talk session contract instead of the one-shot media tool +path. Talk has three modes: provider-native `realtime`, local or streaming +`stt-tts`, and `transcription` for observe-only speech capture. Those modes +share provider catalogs, event envelopes, and cancellation semantics with +telephony, meetings, browser realtime, and native push-to-talk clients. + ## Capabilities @@ -110,6 +116,11 @@ Deepgram, ElevenLabs, Mistral, OpenAI, and xAI also register Voice Call streaming STT providers, so live phone audio can be forwarded to the selected vendor without waiting for a completed recording. +For live user conversations, prefer [Talk mode](/nodes/talk). Batch audio +attachments stay on the media path; browser realtime, native push-to-talk, +telephony, and meeting audio should use Talk events and the session-scoped +catalogs returned by the Gateway. + ## Provider mappings (how vendors split across surfaces) @@ -144,3 +155,4 @@ vendor without waiting for a completed recording. - [Text-to-speech](/tools/tts) - [Media understanding](/nodes/media-understanding) - [Audio nodes](/nodes/audio) +- [Talk mode](/nodes/talk) diff --git a/docs/tools/tts.md b/docs/tools/tts.md index 78bfc10bb6d..1aafb2ebeac 100644 --- a/docs/tools/tts.md +++ b/docs/tools/tts.md @@ -12,6 +12,11 @@ OpenClaw can convert outbound replies into audio across **14 speech providers** and deliver native voice messages on Feishu, Matrix, Telegram, and WhatsApp, audio attachments everywhere else, and PCM/Ulaw streams for telephony and Talk. +TTS is the speech-output half of Talk's `stt-tts` mode. Provider-native +`realtime` Talk sessions synthesize speech inside the realtime provider instead +of calling this TTS path, while `transcription` sessions do not synthesize an +assistant voice response. + ## Quick start @@ -586,6 +591,11 @@ attempted provider: The whole TTS request only fails when **every** attempted provider is skipped or fails. +Talk session provider selection is session-scoped. A Talk client should choose +provider ids, model ids, voice ids, and locales from `talk.catalog` and pass +them through the Talk session or handoff request. Opening a voice session should +not mutate `messages.tts` or global Talk provider defaults. + ## Model-driven directives By default, the assistant **can** emit `[[tts:...]]` directives to override diff --git a/docs/web/control-ui.md b/docs/web/control-ui.md index d3a1e9032fc..b7a8d2e4271 100644 --- a/docs/web/control-ui.md +++ b/docs/web/control-ui.md @@ -96,7 +96,7 @@ Imported themes are stored only in the current browser profile. They are not wri - Chat with the model via Gateway WS (`chat.history`, `chat.send`, `chat.abort`, `chat.inject`). - - Talk through browser realtime sessions. OpenAI uses direct WebRTC, Google Live uses a constrained one-use browser token over WebSocket, and backend-only realtime voice plugins use the Gateway relay transport. The relay keeps provider credentials on the Gateway while the browser streams microphone PCM through `talk.realtime.relay*` RPCs and sends `openclaw_agent_consult` tool calls back through `chat.send` for the larger configured OpenClaw model. + - Talk through browser realtime sessions. OpenAI uses direct WebRTC, Google Live uses a constrained one-use browser token over WebSocket, and backend-only realtime voice plugins use the Gateway relay transport. The relay keeps provider credentials on the Gateway while the browser streams microphone PCM through `talk.realtime.relay*` RPCs and forwards `openclaw_agent_consult` provider tool calls through `talk.realtime.toolCall` for Gateway policy and the larger configured OpenClaw model. - Stream tool calls + live tool output cards in Chat (agent events). @@ -168,9 +168,9 @@ Imported themes are stored only in the current browser profile. They are not wri - Talk mode uses a registered realtime voice provider. Configure OpenAI with `talk.provider: "openai"` plus `talk.providers.openai.apiKey`, or configure Google with `talk.provider: "google"` plus `talk.providers.google.apiKey`; Voice Call realtime provider config can still be reused as the fallback. The browser never receives a standard provider API key. OpenAI receives an ephemeral Realtime client secret for WebRTC. Google Live receives a one-use constrained Live API auth token for a browser WebSocket session, with instructions and tool declarations locked into the token by the Gateway. Providers that only expose a backend realtime bridge run through the Gateway relay transport, so credentials and vendor sockets stay server-side while browser audio moves through authenticated Gateway RPCs. The Realtime session prompt is assembled by the Gateway; `talk.realtime.session` does not accept caller-provided instruction overrides. + Talk mode uses a registered realtime voice provider. Configure OpenAI with `talk.realtime.provider: "openai"` plus `talk.realtime.providers.openai.apiKey`, or configure Google with `talk.realtime.provider: "google"` plus `talk.realtime.providers.google.apiKey`; Voice Call realtime provider config can still be reused as the fallback. The browser never receives a standard provider API key. OpenAI receives an ephemeral Realtime client secret for WebRTC. Google Live receives a one-use constrained Live API auth token for a browser WebSocket session, with instructions and tool declarations locked into the token by the Gateway. Providers that only expose a backend realtime bridge run through the Gateway relay transport, so credentials and vendor sockets stay server-side while browser audio moves through authenticated Gateway RPCs. The Realtime session prompt is assembled by the Gateway; `talk.realtime.session` does not accept caller-provided instruction overrides. - In the Chat composer, the Talk control is the waves button next to the microphone dictation button. When Talk starts, the composer status row shows `Connecting Talk...`, then `Talk live` while audio is connected, or `Asking OpenClaw...` while a realtime tool call is consulting the configured larger model through `chat.send`. + In the Chat composer, the Talk control is the waves button next to the microphone dictation button. When Talk starts, the composer status row shows `Connecting Talk...`, then `Talk live` while audio is connected, or `Asking OpenClaw...` while a realtime tool call is consulting the configured larger model through `talk.realtime.toolCall`. Maintainer live smoke: `OPENAI_API_KEY=... GEMINI_API_KEY=... node --import tsx scripts/dev/realtime-talk-live-smoke.ts` verifies the OpenAI browser WebRTC SDP exchange, Google Live constrained-token browser WebSocket setup, and the Gateway relay browser adapter with fake microphone media. The command prints provider status only and does not log secrets.