From 40d36b5bbca7a2643a19db91cdc0404dfa3c292e Mon Sep 17 00:00:00 2001 From: Peter Steinberger Date: Sun, 24 May 2026 00:34:35 +0100 Subject: [PATCH] docs(talk): document realtime active-run control Co-authored-by: Colin --- CHANGELOG.md | 1 + docs/channels/discord.md | 2 ++ docs/gateway/protocol.md | 2 ++ docs/nodes/talk.md | 5 +++++ docs/plugins/sdk-migration.md | 4 +++- docs/web/control-ui.md | 2 +- 6 files changed, 14 insertions(+), 2 deletions(-) diff --git a/CHANGELOG.md b/CHANGELOG.md index fcb90ea4815..02641518c19 100644 --- a/CHANGELOG.md +++ b/CHANGELOG.md @@ -8,6 +8,7 @@ Docs: https://docs.openclaw.ai - Gateway/perf: reuse process-stable channel catalog reads, avoid repeated bundled-channel boundary checks, and rotate gateway watch CPU profiles so benchmark runs do not accumulate unbounded artifacts. - Gateway/perf: reuse immutable plugin metadata snapshots across startup, config, model, channel, setup, and secret metadata readers so hot paths avoid repeated plugin file stats and manifest registry reloads. +- Talk/realtime: let WebUI and Discord voice callers ask for active OpenClaw run status, cancel, steer, or queue follow-up work while a consult is still running. (#84231) Thanks @Solvely-Colin. - Gateway/perf: lazy-load startup-idle plugin work, core gateway method handlers, and the embedded ACPX runtime so Gateway health and ready signals no longer wait on unused handler trees or ACPX probes. - Gateway/perf: cache plugin SDK public-surface alias maps and skip irrelevant macOS Linuxbrew PATH probes so Gateway startup avoids repeated filesystem walks and slow missing-directory stats. - Image tool: add adaptive model-aware image compression with an `agents.defaults.imageQuality` preference for choosing token-efficient, balanced, or high-detail media handling. diff --git a/docs/channels/discord.md b/docs/channels/discord.md index 096cf1a60d4..fd87d07fd60 100644 --- a/docs/channels/discord.md +++ b/docs/channels/discord.md @@ -1422,6 +1422,8 @@ Voice as an extension of an existing Discord channel session: In `agent-proxy` mode the bot joins the configured voice channel, but OpenClaw agent turns use the target channel's normal routed session and agent. The realtime voice session speaks the returned result back into the voice channel. The supervisor agent can still use normal message tools according to its tool policy, including sending a separate Discord message if that is the right action. +While a delegated OpenClaw run is active, new Discord voice transcripts are treated as live run control before starting another agent turn. Phrases such as "status", "cancel that", "use the smaller fix", or "when you're done also check tests" are classified as status, cancel, steering, or follow-up input for the active session. Status, cancel, accepted steering, and follow-up outcomes are spoken back into the voice channel so the caller knows whether OpenClaw handled the request. + Useful target forms: - `target: "channel:123456789012345678"` routes through a Discord text channel session. diff --git a/docs/gateway/protocol.md b/docs/gateway/protocol.md index bc4ef64e0f1..8986d5adf46 100644 --- a/docs/gateway/protocol.md +++ b/docs/gateway/protocol.md @@ -379,10 +379,12 @@ enumeration of `src/gateway/server-methods/*.ts`. - `talk.session.startTurn`, `talk.session.endTurn`, and `talk.session.cancelTurn` drive managed-room turn lifecycle with stale-turn rejection before state is cleared. - `talk.session.cancelOutput` stops assistant audio output, primarily for VAD-gated barge-in in Gateway relay sessions. - `talk.session.submitToolResult` completes a provider tool call emitted by a Gateway-owned realtime relay session. Pass `options: { willContinue: true }` for interim tool output when a final result will follow, or `options: { suppressResponse: true }` when the tool result should satisfy the provider call without starting another realtime assistant response. + - `talk.session.steer` sends active-run voice control into a Gateway-owned agent-backed Talk session. It accepts `{ sessionId, text, mode? }`, where `mode` is `status`, `steer`, `cancel`, or `followup`; omitted mode is classified from the spoken text. - `talk.session.close` closes a Gateway-owned relay, transcription, or managed-room session and emits terminal Talk events. - `talk.mode` sets/broadcasts the current Talk mode state for WebChat/Control UI clients. - `talk.client.create` creates a client-owned realtime provider session using `webrtc` or `provider-websocket` while the Gateway owns config, credentials, instructions, and tool policy. - `talk.client.toolCall` lets client-owned realtime transports forward provider tool calls to Gateway policy. The first supported tool is `openclaw_agent_consult`; clients receive a run id and wait for normal chat lifecycle events before submitting the provider-specific tool result. + - `talk.client.steer` sends active-run voice control for client-owned realtime transports. The Gateway resolves the active embedded run from `sessionKey` and returns a structured accepted/rejected result instead of silently dropping steering. - `talk.event` is the single Talk event channel for realtime, transcription, STT/TTS, managed-room, telephony, and meeting adapters. - `talk.speak` synthesizes speech through the active Talk speech provider. - `tts.status` returns TTS enabled state, active provider, fallback providers, and provider config state. diff --git a/docs/nodes/talk.md b/docs/nodes/talk.md index 1c3109a9e26..c9972db6254 100644 --- a/docs/nodes/talk.md +++ b/docs/nodes/talk.md @@ -21,6 +21,11 @@ Native Talk is a continuous voice conversation loop: 4. Speak it via the configured Talk provider (`talk.speak`) Browser realtime Talk forwards provider tool calls through `talk.client.toolCall`; browser clients do not call `chat.send` directly for realtime consults. +While a realtime consult is active, Talk clients can use `talk.client.steer` or +`talk.session.steer` to classify spoken input as `status`, `steer`, `cancel`, or +`followup`. Accepted steering is queued into the active embedded run; rejected +steering returns a structured reason such as `no_active_run`, `not_streaming`, +or `compacting`. Transcription-only Talk emits the same common Talk event envelope as realtime and STT/TTS sessions, but uses `mode: "transcription"` and `brain: "none"`. It is for captions, dictation, and observe-only speech capture; one-shot uploaded voice notes still use the media/audio path. diff --git a/docs/plugins/sdk-migration.md b/docs/plugins/sdk-migration.md index 5f825dbee82..b1f4e00945e 100644 --- a/docs/plugins/sdk-migration.md +++ b/docs/plugins/sdk-migration.md @@ -143,6 +143,7 @@ await gateway.request("talk.client.create", { sessionKey: "main", }); await gateway.request("talk.client.toolCall", { sessionKey, callId, name, args }); +await gateway.request("talk.client.steer", { sessionKey, text, mode: "steer" }); ``` Browser-owned WebRTC/provider-websocket sessions use `talk.client.create`, @@ -192,6 +193,7 @@ The unified control vocabulary is also deliberately narrow: | `talk.session.cancelTurn` | all Gateway-owned sessions | Cancel active capture/provider/agent/TTS work for a turn. | | `talk.session.cancelOutput` | `realtime/gateway-relay` | Stop assistant audio output without necessarily ending the user turn. | | `talk.session.submitToolResult` | `realtime/gateway-relay` | Complete a provider tool call emitted by the relay; pass `options.willContinue` for interim output or `options.suppressResponse` to satisfy the call without another assistant response. | +| `talk.session.steer` | agent-backed Talk sessions | Send spoken `status`, `steer`, `cancel`, or `followup` control to the active embedded run resolved from the Talk session. | | `talk.session.close` | all unified sessions | Stop relay sessions or revoke managed-room state, then forget the unified session id. | Do not introduce provider or platform special cases in core to make this work. @@ -624,7 +626,7 @@ releases. | `plugin-sdk/speech` | Speech helpers | Speech provider types plus provider-facing directive, registry, validation helpers, and OpenAI-compatible TTS builder | | `plugin-sdk/speech-core` | Shared speech core | Speech provider types, registry, directives, normalization | | `plugin-sdk/realtime-transcription` | Realtime transcription helpers | Provider types, registry helpers, and shared WebSocket session helper | - | `plugin-sdk/realtime-voice` | Realtime voice helpers | Provider types, registry/resolution helpers, bridge session helpers, shared agent talk-back queues, transcript/event health, echo suppression, and fast context consult helpers | + | `plugin-sdk/realtime-voice` | Realtime voice helpers | Provider types, registry/resolution helpers, bridge session helpers, shared agent talk-back queues, active-run voice control, transcript/event health, echo suppression, and fast context consult helpers | | `plugin-sdk/image-generation` | Image-generation helpers | Image generation provider types plus image asset/data URL helpers and the OpenAI-compatible image provider builder | | `plugin-sdk/image-generation-core` | Shared image-generation core | Image-generation types, failover, auth, and registry helpers | | `plugin-sdk/music-generation` | Music-generation helpers | Music-generation provider/request/result types | diff --git a/docs/web/control-ui.md b/docs/web/control-ui.md index 6e6e4e2e7f9..50e3df6511a 100644 --- a/docs/web/control-ui.md +++ b/docs/web/control-ui.md @@ -101,7 +101,7 @@ Imported themes are stored only in the current browser profile. They are not wri - Chat with the model via Gateway WS (`chat.history`, `chat.send`, `chat.abort`, `chat.inject`). - Chat history refreshes request a bounded recent window with per-message text caps so large sessions do not force the browser to render a full transcript payload before the chat becomes usable. - - Talk through browser realtime sessions. OpenAI uses direct WebRTC, Google Live uses a constrained one-use browser token over WebSocket, and backend-only realtime voice plugins use the Gateway relay transport. Client-owned provider sessions start with `talk.client.create`; Gateway relay sessions start with `talk.session.create`. The relay keeps provider credentials on the Gateway while the browser streams microphone PCM through `talk.session.appendAudio` and forwards `openclaw_agent_consult` provider tool calls through `talk.client.toolCall` for Gateway policy and the larger configured OpenClaw model. + - Talk through browser realtime sessions. OpenAI uses direct WebRTC, Google Live uses a constrained one-use browser token over WebSocket, and backend-only realtime voice plugins use the Gateway relay transport. Client-owned provider sessions start with `talk.client.create`; Gateway relay sessions start with `talk.session.create`. The relay keeps provider credentials on the Gateway while the browser streams microphone PCM through `talk.session.appendAudio`, forwards `openclaw_agent_consult` provider tool calls through `talk.client.toolCall` for Gateway policy and the larger configured OpenClaw model, and routes active-run voice steering through `talk.client.steer` or `talk.session.steer`. - Stream tool calls + live tool output cards in Chat (agent events).