feat: add browser realtime talk transports

2026-05-06 15:40:44 +00:00 · 2026-04-27 14:21:38 +01:00
parent 5dd1e264eb
commit 93bbbe5e37
26 changed files with 2607 additions and 319 deletions
--- a/docs/.generated/plugin-sdk-api-baseline.sha256
+++ b/docs/.generated/plugin-sdk-api-baseline.sha256
@@ -1,2 +1,2 @@
-b81647828ee6599cdd1d76d96ea02c92ccdebb8c1b3b443cefe10ca8bd2ddbfe  plugin-sdk-api-baseline.json
-ca9f3569352522621857b51872f30b3c31881505fd9eff2451b1b46d77670726  plugin-sdk-api-baseline.jsonl
+7178659d932136074130426d08e596738a991c6812b2494149427d1f822f1be8  plugin-sdk-api-baseline.json
+fc1e3ab9f21b6f7b6a55498cf5ee322d62dccf4c23322f0ba27559e55a59f901  plugin-sdk-api-baseline.jsonl
--- a/docs/providers/google.md
+++ b/docs/providers/google.md
@@ -352,11 +352,17 @@ SDK rejects language-code hints on this API path.
 </Note>

 <Note>
-Control UI Talk browser sessions still require a realtime voice provider with a
-browser WebRTC session implementation. Today that path is OpenAI Realtime; the
-Google provider is for backend realtime bridges.
+Control UI Talk supports Google Live browser sessions with constrained one-use
+tokens. Backend-only realtime voice providers can also run through the generic
+Gateway relay transport, which keeps provider credentials on the Gateway.
 </Note>

+For maintainer live verification, run
+`OPENAI_API_KEY=... GEMINI_API_KEY=... node --import tsx scripts/dev/realtime-talk-live-smoke.ts`.
+The Google leg mints the same constrained Live API token shape used by Control
+UI Talk, opens the browser WebSocket endpoint, sends the initial setup payload,
+and waits for `setupComplete`.
+
 ## Advanced configuration

 <AccordionGroup>
--- a/docs/providers/openai.md
+++ b/docs/providers/openai.md
@@ -546,7 +546,17 @@ Legacy `plugins.entries.openai.config.personality` is still read as a compatibil
    | API key | `...openai.apiKey` | Falls back to `OPENAI_API_KEY` |

    <Note>
-    Supports Azure OpenAI via `azureEndpoint` and `azureDeployment` config keys. Supports bidirectional tool calling. Uses G.711 u-law audio format.
+    Supports Azure OpenAI via `azureEndpoint` and `azureDeployment` config keys for backend realtime bridges. Supports bidirectional tool calling. Uses G.711 u-law audio format.
+    </Note>
+
+    <Note>
+    Control UI Talk uses OpenAI browser realtime sessions with a Gateway-minted
+    ephemeral client secret and a direct browser WebRTC SDP exchange against the
+    OpenAI Realtime API. Maintainer live verification is available with
+    `OPENAI_API_KEY=... GEMINI_API_KEY=... node --import tsx scripts/dev/realtime-talk-live-smoke.ts`;
+    the OpenAI leg mints a client secret in Node, generates a browser SDP offer
+    with fake microphone media, posts it to OpenAI, and applies the SDP answer
+    without logging secrets.
    </Note>

  </Accordion>
--- a/docs/web/control-ui.md
+++ b/docs/web/control-ui.md
@@ -87,7 +87,7 @@ The Control UI can localize itself on first load based on your browser locale. T
 <AccordionGroup>
  <Accordion title="Chat and Talk">
    - Chat with the model via Gateway WS (`chat.history`, `chat.send`, `chat.abort`, `chat.inject`).
-    - Talk to OpenAI Realtime directly from the browser via WebRTC. The Gateway mints a short-lived Realtime client secret with `talk.realtime.session`; the browser sends microphone audio directly to OpenAI and relays `openclaw_agent_consult` tool calls back through `chat.send` for the larger configured OpenClaw model.
+    - Talk through browser realtime sessions. OpenAI uses direct WebRTC, Google Live uses a constrained one-use browser token over WebSocket, and backend-only realtime voice plugins use the Gateway relay transport. The relay keeps provider credentials on the Gateway while the browser streams microphone PCM through `talk.realtime.relay*` RPCs and sends `openclaw_agent_consult` tool calls back through `chat.send` for the larger configured OpenClaw model.
    - Stream tool calls + live tool output cards in Chat (agent events).
  </Accordion>
  <Accordion title="Channels, instances, sessions, dreams">
@@ -144,11 +144,13 @@ The Control UI can localize itself on first load based on your browser locale. T
    - The chat header model and thinking pickers patch the active session immediately through `sessions.patch`; they are persistent session overrides, not one-turn-only send options.
    - When fresh Gateway session usage reports show high context pressure, the chat composer area shows a context notice and, at recommended compaction levels, a compact button that runs the normal session compaction path. Stale token snapshots are hidden until the Gateway reports fresh usage again.
  </Accordion>
-  <Accordion title="Talk mode (browser WebRTC)">
-    Talk mode uses a registered realtime voice provider that supports browser WebRTC sessions. Configure OpenAI with `talk.provider: "openai"` plus `talk.providers.openai.apiKey`, or reuse the Voice Call realtime provider config. The browser never receives the standard OpenAI API key; it receives only the ephemeral Realtime client secret. Google Live realtime voice is supported for backend Voice Call and Google Meet bridges, but not this browser WebRTC path yet. The Realtime session prompt is assembled by the Gateway; `talk.realtime.session` does not accept caller-provided instruction overrides.
+  <Accordion title="Talk mode (browser realtime)">
+    Talk mode uses a registered realtime voice provider. Configure OpenAI with `talk.provider: "openai"` plus `talk.providers.openai.apiKey`, or configure Google with `talk.provider: "google"` plus `talk.providers.google.apiKey`; Voice Call realtime provider config can still be reused as the fallback. The browser never receives a standard provider API key. OpenAI receives an ephemeral Realtime client secret for WebRTC. Google Live receives a one-use constrained Live API auth token for a browser WebSocket session, with instructions and tool declarations locked into the token by the Gateway. Providers that only expose a backend realtime bridge run through the Gateway relay transport, so credentials and vendor sockets stay server-side while browser audio moves through authenticated Gateway RPCs. The Realtime session prompt is assembled by the Gateway; `talk.realtime.session` does not accept caller-provided instruction overrides.

    In the Chat composer, the Talk control is the waves button next to the microphone dictation button. When Talk starts, the composer status row shows `Connecting Talk...`, then `Talk live` while audio is connected, or `Asking OpenClaw...` while a realtime tool call is consulting the configured larger model through `chat.send`.

+    Maintainer live smoke: `OPENAI_API_KEY=... GEMINI_API_KEY=... node --import tsx scripts/dev/realtime-talk-live-smoke.ts` verifies the OpenAI browser WebRTC SDP exchange, Google Live constrained-token browser WebSocket setup, and the Gateway relay browser adapter with fake microphone media. The command prints provider status only and does not log secrets.
+
  </Accordion>
  <Accordion title="Stop and abort">
    - Click **Stop** (calls `chat.abort`).