docs: detail talk refactor plan

This commit is contained in:
Peter Steinberger
2026-05-06 00:37:03 +01:00
parent 7760edc68e
commit 7431cb8def
9 changed files with 1150 additions and 413 deletions

View File

@@ -184,42 +184,6 @@ OPENCLAW_CONFIG_PATH=~/.openclaw/b.json OPENCLAW_STATE_DIR=~/.openclaw-b opencla
Detailed setup: [/gateway/multiple-gateways](/gateway/multiple-gateways).
## VoiceClaw real-time brain endpoint
OpenClaw exposes a VoiceClaw-compatible real-time WebSocket endpoint at
`/voiceclaw/realtime`. Use it when a VoiceClaw desktop client should talk
directly to a real-time OpenClaw brain instead of going through a separate relay
process.
The endpoint uses Gemini Live for real-time audio and calls OpenClaw as the
brain by exposing OpenClaw tools directly to Gemini Live. Tool calls return an
immediate `working` result to keep the voice turn responsive, then OpenClaw
executes the actual tool asynchronously and injects the result back into the
live session. Set `GEMINI_API_KEY` in the gateway process environment. If
gateway auth is enabled, the desktop client sends the gateway token or password
in its first `session.config` message.
Real-time brain access runs owner-authorized OpenClaw agent commands. Keep
`gateway.auth.mode: "none"` limited to loopback-only test instances. Non-local
real-time brain connections require gateway auth.
For an isolated test gateway, run a separate instance with its own port, config,
and state:
```bash
OPENCLAW_CONFIG_PATH=/path/to/openclaw-realtime/openclaw.json \
OPENCLAW_STATE_DIR=/path/to/openclaw-realtime/state \
OPENCLAW_SKIP_CHANNELS=1 \
GEMINI_API_KEY=... \
openclaw gateway --port 19789
```
Then configure VoiceClaw to use:
```text
ws://127.0.0.1:19789/voiceclaw/realtime
```
## Remote access
Preferred: Tailscale/VPN.

View File

@@ -364,15 +364,17 @@ enumeration of `src/gateway/server-methods/*.ts`.
<Accordion title="Talk and TTS">
- `talk.catalog` returns the read-only Talk provider catalog for speech, streaming transcription, and realtime voice. It includes provider ids, labels, configured state, exposed model/voice ids, canonical modes, transports, brain strategies, and realtime audio/capability flags without returning provider secrets or mutating global config.
- `talk.config` returns the effective Talk config payload; `includeSecrets` requires `operator.talk.secrets` (or `operator.admin`).
- `talk.handoff.create` creates an expiring managed-room handoff for an existing session key. The result contains a room id, room URL, bearer token, optional session-scoped provider/model/voice selection, mode, transport, brain strategy, and expiry for a first-party walkie-talkie client. `brain: "direct-tools"` requires `operator.admin`.
- `talk.handoff.join` validates a handoff id plus bearer token, emits `session.ready` or `session.replaced` room events as needed, and returns room/session metadata plus recent Talk events without the plaintext token or stored token hash.
- `talk.handoff.turnStart`, `talk.handoff.turnEnd`, and `talk.handoff.turnCancel` let a first-party managed-room client drive the room turn lifecycle with `turn.started`, `turn.ended`, and `turn.cancelled` Talk events.
- `talk.handoff.revoke` invalidates an unexpired handoff, emits `session.closed`, and makes later joins fail.
- `talk.session.create` creates a Gateway-owned Talk session for `realtime/gateway-relay`, `transcription/gateway-relay`, or `stt-tts/managed-room`. `brain: "direct-tools"` requires `operator.admin`.
- `talk.session.join` validates a managed-room session token, emits `session.ready` or `session.replaced` events as needed, and returns room/session metadata plus recent Talk events without the plaintext token or stored token hash.
- `talk.session.appendAudio` appends base64 PCM input audio to Gateway-owned realtime relay and transcription sessions.
- `talk.session.startTurn`, `talk.session.endTurn`, and `talk.session.cancelTurn` drive managed-room turn lifecycle with stale-turn rejection before state is cleared.
- `talk.session.cancelOutput` stops assistant audio output, primarily for VAD-gated barge-in in Gateway relay sessions.
- `talk.session.submitToolResult` completes a provider tool call emitted by a Gateway-owned realtime relay session.
- `talk.session.close` closes a Gateway-owned relay, transcription, or managed-room session and emits terminal Talk events.
- `talk.mode` sets/broadcasts the current Talk mode state for WebChat/Control UI clients.
- `talk.realtime.session` creates a browser realtime session using canonical transports (`webrtc`, `provider-websocket`, or `gateway-relay`). It accepts optional `mode`, `transport`, and `brain` selectors, but currently only public browser `mode: "realtime"` plus `brain: "agent-consult"` is supported; `managed-room` remains reserved for handoff clients until the browser owns a real room client.
- `talk.realtime.relayAudio`, `talk.realtime.relayCancel`, `talk.realtime.relayMark`, `talk.realtime.relayStop`, and `talk.realtime.relayToolResult` control Gateway-owned realtime relay sessions. Relay cancellation clears provider output and aborts any linked agent consult run.
- `talk.realtime.toolCall` lets browser-owned realtime transports forward provider tool calls to Gateway policy. The first supported tool is `openclaw_agent_consult`; clients receive a run id and wait for normal chat lifecycle events before submitting the provider-specific tool result. Gateway relay clients include `relaySessionId` so turn cancellation can abort the consult.
- `talk.transcription.session` creates a transcription-only Gateway relay over the configured streaming STT provider. Clients send PCM frames through `talk.transcription.relayAudio`, cancel an active turn with `talk.transcription.relayCancel`, receive `talk.transcription.relay` events with common Talk envelopes, and close with `talk.transcription.relayStop`.
- `talk.client.create` creates a client-owned realtime provider session using `webrtc` or `provider-websocket` while the Gateway owns config, credentials, instructions, and tool policy.
- `talk.client.toolCall` lets client-owned realtime transports forward provider tool calls to Gateway policy. The first supported tool is `openclaw_agent_consult`; clients receive a run id and wait for normal chat lifecycle events before submitting the provider-specific tool result.
- `talk.event` is the single Talk event channel for realtime, transcription, STT/TTS, managed-room, telephony, and meeting adapters.
- `talk.speak` synthesizes speech through the active Talk speech provider.
- `tts.status` returns TTS enabled state, active provider, fallback providers, and provider config state.
- `tts.providers` returns the visible TTS provider inventory.

View File

@@ -9,8 +9,8 @@ title: "Talk mode"
Talk mode has two runtime shapes:
- Native macOS/iOS/Android Talk uses local speech recognition, Gateway chat, and `talk.speak` TTS. Nodes advertise the `talk` capability and declare the `talk.*` commands they support.
- Browser Talk uses `talk.realtime.session` with canonical transports: `webrtc`, `provider-websocket`, or `gateway-relay`. `managed-room` is reserved for Gateway handoff rooms.
- Transcription-only clients use `talk.transcription.session` plus `talk.transcription.relayAudio`, `talk.transcription.relayCancel`, and `talk.transcription.relayStop` when they need captions or dictation without an assistant voice response.
- Browser Talk uses `talk.client.create` for client-owned `webrtc` and `provider-websocket` sessions, or `talk.session.create` for Gateway-owned `gateway-relay` sessions. `managed-room` is reserved for Gateway handoff and walkie-talkie rooms.
- Transcription-only clients use `talk.session.create({ mode: "transcription", transport: "gateway-relay", brain: "none" })`, then `talk.session.appendAudio`, `talk.session.cancelTurn`, and `talk.session.close` when they need captions or dictation without an assistant voice response.
Native Talk is a continuous voice conversation loop:
@@ -19,7 +19,7 @@ Native Talk is a continuous voice conversation loop:
3. Wait for the response
4. Speak it via the configured Talk provider (`talk.speak`)
Browser realtime Talk forwards provider tool calls through `talk.realtime.toolCall`; browser clients do not call `chat.send` directly for realtime consults.
Browser realtime Talk forwards provider tool calls through `talk.client.toolCall`; browser clients do not call `chat.send` directly for realtime consults.
Transcription-only Talk emits the same common Talk event envelope as realtime and STT/TTS sessions, but uses `mode: "transcription"` and `brain: "none"`. It is for captions, dictation, and observe-only speech capture; one-shot uploaded voice notes still use the media/audio path.
@@ -132,8 +132,8 @@ Defaults:
- Requires Speech + Microphone permissions.
- Native Talk uses the active Gateway session and only falls back to history polling when response events are unavailable.
- Browser realtime Talk uses `talk.realtime.toolCall` for `openclaw_agent_consult` instead of exposing `chat.send` to provider-owned browser sessions.
- Transcription-only Talk uses `talk.transcription.session`, `talk.transcription.relayAudio`, `talk.transcription.relayCancel`, and `talk.transcription.relayStop`; clients subscribe to `talk.transcription.relay` events for partial/final transcript updates.
- Browser realtime Talk uses `talk.client.toolCall` for `openclaw_agent_consult` instead of exposing `chat.send` to provider-owned browser sessions.
- Transcription-only Talk uses `talk.session.create`, `talk.session.appendAudio`, `talk.session.cancelTurn`, and `talk.session.close`; clients subscribe to `talk.event` for partial/final transcript updates.
- The gateway resolves Talk playback through `talk.speak` using the active Talk provider. Android falls back to local system TTS only when that RPC is unavailable.
- macOS local MLX playback uses the bundled `openclaw-mlx-tts` helper when present, or an executable on `PATH`. Set `OPENCLAW_MLX_TTS_BIN` to point at a custom helper binary during development.
- `stability` for `eleven_v3` is validated to `0.0`, `0.5`, or `1.0`; other models accept `0..1`.

View File

@@ -87,20 +87,19 @@ event history, and stale-turn rejection. Provider plugins should keep owning
vendor-specific realtime sessions; surface plugins should keep owning capture,
playback, telephony, and meeting quirks.
This migration is intentionally adapter-first:
This Talk migration is intentionally breaking-clean:
1. Add shared controller/runtime primitives to `plugin-sdk/realtime-voice`.
2. Keep existing public Gateway RPCs such as `talk.realtime.session`,
`talk.realtime.relayAudio`, `talk.transcription.session`, and
`talk.handoff.*` as compatibility adapters.
3. Move bundled surfaces onto the shared controller: browser relay, managed-room
handoff, voice-call realtime, voice-call streaming STT, Google Meet realtime,
and VoiceClaw realtime.
4. Advertise all Talk event channels in Gateway `hello-ok.features.events` so
clients can discover `talk.event`, `talk.realtime.relay`, and
`talk.transcription.relay`.
5. Expose the versioned `talk.session.*` API for Gateway-managed Talk sessions
after the adapters are internally backed by the same controller.
1. Keep the shared controller/runtime primitives in
`plugin-sdk/realtime-voice`.
2. Move bundled surfaces onto the shared controller: browser relay,
managed-room handoff, voice-call realtime, voice-call streaming STT, Google
Meet realtime, and native push-to-talk.
3. Replace old Talk RPC families with the final `talk.session.*` and
`talk.client.*` API.
4. Advertise one live Talk event channel in Gateway
`hello-ok.features.events`: `talk.event`.
5. Delete the old realtime HTTP endpoint and any request-time instruction
override path.
New code should not call `createTalkEventSequencer(...)` directly unless it is
implementing a low-level adapter or test fixture. Prefer the shared controller
@@ -112,24 +111,33 @@ handoff, and native Talk clients.
The target public API shape is:
```typescript
// Versioned Gateway-managed Talk session API.
// Gateway-owned Talk session API.
await gateway.request("talk.session.create", {
mode: "realtime",
transport: "gateway-relay",
brain: "agent-consult",
sessionKey: "main",
});
await gateway.request("talk.session.inputAudio", { sessionId, audioBase64 });
await gateway.request("talk.session.control", { sessionId, type: "turn.cancel" });
await gateway.request("talk.session.toolResult", { sessionId, callId, result });
await gateway.request("talk.session.appendAudio", { sessionId, audioBase64 });
await gateway.request("talk.session.cancelOutput", { sessionId, reason: "barge-in" });
await gateway.request("talk.session.submitToolResult", { sessionId, callId, result });
await gateway.request("talk.session.close", { sessionId });
// Client-owned provider session API.
await gateway.request("talk.client.create", {
mode: "realtime",
transport: "webrtc",
brain: "agent-consult",
sessionKey: "main",
});
await gateway.request("talk.client.toolCall", { sessionKey, callId, name, args });
```
Browser-owned WebRTC/provider-websocket sessions stay on
`talk.realtime.session`, because the browser owns the provider negotiation and
media transport. `talk.session.*` is the common Gateway-managed surface for
gateway-relay realtime, gateway-relay transcription, and managed-room native
STT/TTS sessions.
Browser-owned WebRTC/provider-websocket sessions use `talk.client.create`,
because the browser owns the provider negotiation and media transport while the
Gateway owns credentials, instructions, and tool policy. `talk.session.*` is the
common Gateway-managed surface for gateway-relay realtime, gateway-relay
transcription, and managed-room native STT/TTS sessions.
Legacy configs that placed realtime selectors beside `talk.provider` /
`talk.providers` should be repaired with `openclaw doctor --fix`; runtime Talk
@@ -144,30 +152,43 @@ The supported `talk.session.create` combinations are intentionally small:
| `stt-tts` | `managed-room` | `agent-consult` | Native/client room | Push-to-talk and walkie-talkie style rooms where the client owns capture/playback and the Gateway owns turn state. |
| `stt-tts` | `managed-room` | `direct-tools` | Native/client room | Admin-only room mode for trusted first-party surfaces that execute Gateway tool actions directly. |
Everything else should stay on the existing owner-specific adapter until there
is a real Gateway-managed transport for it:
Removed method map:
| Existing adapter | Keep using it for |
| ----------------------- | ---------------------------------------------------------------------------------------- |
| `talk.realtime.session` | Browser-owned WebRTC and provider-websocket realtime sessions. |
| `talk.realtime.relay*` | Compatibility for existing browser relay clients while they migrate to `talk.session.*`. |
| `talk.transcription.*` | Compatibility for existing streaming STT clients while they migrate to `talk.session.*`. |
| `talk.handoff.*` | Compatibility for room-style native clients; internally this is the managed-room shape. |
| Old | New |
| -------------------------------- | -------------------------------------------------------- |
| `talk.realtime.session` | `talk.client.create` |
| `talk.realtime.toolCall` | `talk.client.toolCall` |
| `talk.realtime.relayAudio` | `talk.session.appendAudio` |
| `talk.realtime.relayCancel` | `talk.session.cancelOutput` or `talk.session.cancelTurn` |
| `talk.realtime.relayToolResult` | `talk.session.submitToolResult` |
| `talk.realtime.relayStop` | `talk.session.close` |
| `talk.transcription.session` | `talk.session.create({ mode: "transcription" })` |
| `talk.transcription.relayAudio` | `talk.session.appendAudio` |
| `talk.transcription.relayCancel` | `talk.session.cancelTurn` |
| `talk.transcription.relayStop` | `talk.session.close` |
| `talk.handoff.create` | `talk.session.create({ transport: "managed-room" })` |
| `talk.handoff.join` | `talk.session.join` |
| `talk.handoff.revoke` | `talk.session.close` |
The unified control vocabulary is also deliberately narrow:
| Method | Applies to | Contract |
| ------------------------- | ------------------------------------------------------- | -------------------------------------------------------------------------------------------------------- |
| `talk.session.inputAudio` | `realtime/gateway-relay`, `transcription/gateway-relay` | Append a base64 PCM audio chunk to the provider session owned by the same Gateway connection. |
| `talk.session.control` | all unified sessions | `turn.cancel` for relay sessions; `turn.start`, `turn.end`, and `turn.cancel` for managed-room sessions. |
| `talk.session.toolResult` | `realtime/gateway-relay` | Complete a provider tool call emitted by the relay. |
| `talk.session.close` | all unified sessions | Stop relay sessions or revoke managed-room handoff state, then forget the unified session id. |
| Method | Applies to | Contract |
| ------------------------------- | ------------------------------------------------------- | --------------------------------------------------------------------------------------------- |
| `talk.session.appendAudio` | `realtime/gateway-relay`, `transcription/gateway-relay` | Append a base64 PCM audio chunk to the provider session owned by the same Gateway connection. |
| `talk.session.startTurn` | `stt-tts/managed-room` | Start a managed-room user turn. |
| `talk.session.endTurn` | `stt-tts/managed-room` | End the active turn after stale-turn validation. |
| `talk.session.cancelTurn` | all Gateway-owned sessions | Cancel active capture/provider/agent/TTS work for a turn. |
| `talk.session.cancelOutput` | `realtime/gateway-relay` | Stop assistant audio output without necessarily ending the user turn. |
| `talk.session.submitToolResult` | `realtime/gateway-relay` | Complete a provider tool call emitted by the relay. |
| `talk.session.close` | all unified sessions | Stop relay sessions or revoke managed-room state, then forget the unified session id. |
Do not introduce provider or platform special cases in core to make this work.
Core owns Talk session semantics. Provider plugins own vendor session setup.
Voice-call and Google Meet own telephony/meeting adapters. Browser and native
apps own device capture/playback UX.
The detailed implementation plan lives in [Talk refactor plan](/refactor/talk).
## Compatibility policy
For external plugins, compatibility work follows this order:

View File

@@ -0,0 +1,320 @@
---
summary: "Detailed API, event, runtime, cancellation, and tool-policy contract for the Talk refactor"
read_when:
- Implementing Talk Gateway methods or protocol schemas
- Changing Talk config, events, cancellation, or provider tool policy
- Reviewing whether a Talk behavior belongs in core or an adapter
title: "Talk API and runtime contract"
---
# Talk API And Runtime Contract
This is the detailed contract for [Talk refactor plan](/refactor/talk).
## Config Contract
Config stays under the existing `talk` object. Do not add `talk.speech` in this
refactor.
```ts
type TalkConfig = {
provider?: string;
providers?: Record<string, unknown>;
realtime?: {
provider?: string;
model?: string;
voice?: string;
mode?: TalkMode;
transport?: TalkTransport;
brain?: TalkBrain;
providers?: Record<string, unknown>;
};
input?: {
interruptOnSpeech?: boolean;
silenceTimeoutMs?: number;
};
};
```
Rules:
- `talk.provider` and `talk.providers.*` remain speech/STT/TTS provider config.
- `talk.realtime.provider` and `talk.realtime.providers.*` are realtime voice provider config.
- `talk.config` returns effective config without secrets unless privileged.
- `talk.catalog` returns capabilities, not inferred provider-id guesses.
- Doctor migrates old realtime selectors into `talk.realtime`.
- Runtime does not silently reinterpret Voice Call or TTS config as realtime config.
## Method Semantics
### `talk.catalog`
Returns effective Talk capabilities:
- modes
- transports
- brain strategies
- providers
- models
- voices
- input audio formats
- output audio formats
- browser-safe client session support
- Gateway relay support
- managed-room support
- local STT/TTS support
Provider capability declarations drive this. Core must not infer support from
provider ids.
### `talk.speak`
One-shot TTS:
```ts
await gateway.request("talk.speak", {
text: "Ready.",
voice: "alloy",
});
```
`talk.speak` does not create live session state, turn state, transcript state,
barge-in state, or provider realtime state.
### `talk.client.create`
Creates a client-owned provider session while Gateway still owns config,
instructions, credentials, and tool policy.
Use it for browser WebRTC, browser provider WebSocket, and native provider media
sessions that require client-owned sockets. Reject `gateway-relay` and
`managed-room`; the error points clients to `talk.session.create`.
### `talk.client.toolCall`
Forwards provider tool calls from client-owned provider sessions to Gateway
policy:
```ts
await gateway.request("talk.client.toolCall", {
sessionId,
callId,
name,
argumentsJson,
});
```
Validate session identity, caller ownership, brain strategy, and policy. Pass an
`AbortSignal` into agent/tool runtime, reject stale or closed sessions, and never
accept request-time instructions.
### `talk.session.create`
Creates a Gateway-owned live Talk session.
| Mode | Transport | Brain | Owner |
| --------------- | --------------- | --------------- | ------------------- |
| `realtime` | `gateway-relay` | `agent-consult` | Gateway |
| `transcription` | `gateway-relay` | `none` | Gateway |
| `stt-tts` | `managed-room` | `agent-consult` | Gateway/client room |
| `stt-tts` | `managed-room` | `direct-tools` | trusted room |
Reject `webrtc` and `provider-websocket`; the error points clients to
`talk.client.create`.
### `talk.session.join`
Joins or reconnects to a Gateway-owned managed room. Validate session id and
token, never expose token hashes, emit `session.replaced` to the displaced
client, and emit `session.ready` to the new owner.
### `talk.session.appendAudio`
Appends an input audio frame to a Gateway-owned relay session:
```ts
await gateway.request("talk.session.appendAudio", {
sessionId,
audioBase64,
timestamp,
});
```
Use for realtime Gateway relay and streaming transcription. Do not use this for
managed-room native push-to-talk when the native node captures audio locally and
returns transcript/output through node command results.
### Turn Verbs
Use explicit verbs instead of generic controls:
```ts
await gateway.request("talk.session.startTurn", { sessionId });
await gateway.request("talk.session.endTurn", { sessionId, turnId });
await gateway.request("talk.session.cancelTurn", { sessionId, turnId, reason });
await gateway.request("talk.session.cancelOutput", { sessionId, turnId, reason });
```
`endTurn` rejects stale `turnId` before clearing active state. `cancelTurn`
aborts capture, STT, provider response, agent consult, tools, TTS, relay output,
and room streams tied to that turn. `cancelOutput` stops assistant audio without
necessarily ending the user turn. Barge-in must be speech/VAD gated.
### `talk.session.submitToolResult`
Completes a provider tool call emitted inside a Gateway-owned relay session:
```ts
await gateway.request("talk.session.submitToolResult", {
sessionId,
callId,
output,
});
```
### `talk.session.close`
Closes a Gateway-owned session. Close emits one terminal event, stops capture and
playback, aborts provider and agent work, drains TTS, revokes room join state,
and removes retained state after its replay/debug window.
## Event Contract
All live Talk paths emit one public event channel:
```ts
talk.event;
```
Every event uses this envelope:
```ts
type TalkEvent<TPayload = unknown> = {
id: string;
type: TalkEventType;
sessionId: string;
turnId?: string;
captureId?: string;
seq: number;
timestamp: string;
mode: TalkMode;
transport: TalkTransport;
brain: TalkBrain;
provider?: string;
final?: boolean;
callId?: string;
itemId?: string;
parentId?: string;
source?: string;
payload: TPayload;
};
```
Core event types include `session.*`, `turn.*`, `capture.*`, `input.audio.*`,
`transcript.*`, `output.text.*`, `output.audio.*`, `tool.*`, `usage.metrics`,
`latency.metrics`, and `health.changed`.
Rules:
- `sessionId` is required for every event.
- `turnId` is required for turn-bound input, output, transcript, tool, and cancellation events.
- `captureId` is required while capture is active.
- `seq` monotonically increases per session.
- `timestamp` uses ISO 8601 UTC.
- `callId`, `itemId`, and `parentId` correlate provider responses, tool calls, TTS jobs, and relay frames.
- payloads must not duplicate large raw audio frames when transport already carries them.
- consumers should rely on envelope fields instead of provider-specific payloads.
Text-ready is not audio-ready. Clients may show text after `output.text.done`,
but should not enter speaking/playback state until `output.audio.started` or
`output.audio.delta`.
## Shared Runtime Target
Keep one provider-agnostic runtime under `src/talk`. The first pass keeps names
close to the old runtime modules so the move stays reviewable:
```text
src/talk/
audio-codec.ts
agent-consult-runtime.ts
agent-consult-tool.ts
agent-talkback-runtime.ts
fast-context-runtime.ts
provider-registry.ts
provider-resolver.ts
provider-types.ts
session-log-runtime.ts
session-runtime.ts
talk-events.ts
talk-session-controller.ts
```
New code should import the shared runtime from `src/talk` inside core. Plugins
that already use the stable SDK subpath keep importing
`openclaw/plugin-sdk/realtime-voice`; that facade re-exports the Talk runtime
contract without exposing core file layout.
Responsibilities:
- normalize modes, transports, brains, codecs, and audio metadata
- create, close, and replace session records
- allocate turn ids and capture ids
- reject stale turn ids before mutation
- sequence events
- retain recent events for replay, reconnect, and diagnostics
- track active input capture and assistant output
- coordinate barge-in and output cancellation
- propagate abort signals
- register provider tool calls and bind tool results
- expose test builders for session/event assertions
Gateway method files should become thin adapters:
```text
src/gateway/server-methods/
talk.ts
talk-client.ts
talk-session.ts
```
Internal Gateway helpers may exist only as staging files while code moves to
`src/talk`.
## Cancellation Contract
Cancellation must abort underlying work, not only ignore stale output.
When a turn or session is cancelled:
- provider realtime response is cancelled when supported
- provider session is closed or reset when cancellation cannot be scoped
- streaming STT receives abort
- agent consult receives abort
- queued tools do not start after abort
- already-started side-effecting tools receive abort and report cancellation
- pending TTS jobs are drained
- playback sources are stopped
- relay streams are cleared
- managed-room capture and output state reset
- stale finals and stale audio deltas are ignored
- one terminal cancellation event is emitted
Barge-in uses VAD or provider speech-started signals, ignores silence and echo,
cancels output only after real user speech, and starts or ensures a turn before
emitting `turn.cancelled`.
## Tool Policy Contract
Gateway owns Talk tool policy.
Client-owned flow: `talk.client.create`, provider tool call to client,
`talk.client.toolCall`, Gateway policy validation, agent/direct-tool execution,
client result submission to provider.
Gateway-owned flow: `talk.session.create`, provider tool call to Gateway,
Gateway policy validation, agent/direct-tool execution, provider result
submission, `talk.event` emission.
No Talk path accepts caller-provided instructions. Gateway builds instructions
from trusted config and session context.

View File

@@ -0,0 +1,229 @@
---
summary: "Implementation packages, deletion checklist, test matrix, and verification commands for the Talk refactor"
read_when:
- Implementing the Talk refactor plan
- Deleting legacy Talk RPCs, event channels, or realtime endpoint code
- Verifying browser, native, telephony, meeting, STT, or TTS Talk behavior after refactor work
title: "Talk refactor execution checklist"
---
# Talk Refactor Execution Checklist
Use this as the PR tracker for [Talk refactor plan](/refactor/talk).
## Implementation Packages
### Package 1: Protocol
- update `src/gateway/protocol/schema/channels.ts`
- update `src/gateway/protocol/schema/protocol-schemas.ts`
- update `src/gateway/protocol/schema/types.ts`
- update `src/gateway/protocol/index.ts`
- regenerate generated protocol clients
- remove old schemas from generated metadata
- update protocol tests
Done when old RPC/event names are absent from generated protocol output.
### Package 2: Gateway Methods
- split client-owned methods into `talk-client.ts`
- keep session-owned methods in `talk-session.ts`
- keep catalog/config/speak/mode in `talk.ts`
- classify every new method in method scopes
- advertise only `talk.event` in hello event features
- remove old method list entries
- update authorization tests
Done when every public Talk method has an explicit scope.
### Package 3: Session Runtime
- add `src/talk` primitives
- move event sequencing into shared runtime
- move stale-turn rejection into shared runtime
- move active output state into shared runtime
- move cancellation bookkeeping into shared runtime
- expose small test helpers
Done when relay, transcription, handoff, telephony, and meetings do not each
invent event and turn bookkeeping.
### Package 4: Browser UI
- update realtime startup to `talk.client.create`
- update realtime tool consult to `talk.client.toolCall`
- update relay startup to `talk.session.create`
- update relay audio to `talk.session.appendAudio`
- update relay tool result to `talk.session.submitToolResult`
- update relay output cancel to `talk.session.cancelOutput`
- update relay close to `talk.session.close`
- listen only to `talk.event`
- remove relay mark RPC
Done when UI tests prove no removed RPC names remain.
### Package 5: Native And Nodes
- route native Talk through session events
- map push-to-talk commands to managed-room turn lifecycle
- clean capture state on failed start
- keep local STT/TTS as adapter behavior
- remove chat-history polling from the success path
- keep fallback polling only if explicitly needed
Done when native voice success path is event-driven.
### Package 6: Voice Call
- map telephony realtime events into `talk.event`
- map local speech detection to `startTurn`, `cancelOutput`, and `cancelTurn`
- pass abort through agent consult and tools
- keep marks, clear, u-law, and call lifecycle in the plugin
- add tests for early speech before provider speech-started
Done when Voice Call shares event and cancellation semantics without leaking
telephony into core.
### Package 7: Meetings
- map meeting speech and transcript state into `talk.event`
- keep participant and room state in meeting adapter
- add echo-suppression aware barge-in tests
- ensure meeting adapters can choose realtime, transcription, or `stt-tts`
Done when meeting behavior is an adapter over Talk, not a parallel realtime loop.
### Package 8: Doctor And Migration
- detect old realtime selectors outside `talk.realtime`
- write explicit `talk.realtime.provider`, `model`, `voice`, `transport`, and `brain`
- report removed RPC names when logs show old clients
- keep startup free of hidden config rewrites
- update SDK migration, Gateway protocol, Talk node, Control UI, and TTS docs
Done when runtime config is explicit and docs mention removed API only in
migration notes.
## Deletion Checklist
Delete or prove absent:
- `src/gateway/voiceclaw-realtime/`
- `/voiceclaw/realtime`
- `instructionsOverride`
- `talk.realtime.*` public RPCs
- `talk.transcription.*` public RPCs
- `talk.handoff.*` public RPCs
- `talk.session.inputAudio`
- `talk.session.control`
- `talk.session.toolResult`
- `talk.realtime.relay`
- `talk.transcription.relay`
- old generated protocol models
- old UI relay method calls
Keep only these old names in explicit migration tables.
## Test Matrix
Protocol:
- final methods exist in protocol schemas
- removed methods are absent from protocol schemas
- final event is advertised in hello features
- removed events are absent from broadcast guards
- generated clients match schema
- request-time instruction override is rejected or impossible by schema
Gateway:
- `talk.client.create` creates WebRTC session result
- `talk.client.create` creates provider WebSocket session result
- `talk.client.create` rejects Gateway-owned transports
- `talk.client.toolCall` validates caller, session, brain, and policy
- `talk.session.create` creates realtime Gateway relay
- `talk.session.create` creates transcription relay
- `talk.session.create` creates STT/TTS managed room
- `talk.session.create` rejects client-owned transports
- `talk.session.join` replacement notifies displaced client
- `talk.session.appendAudio` routes to relay/transcription session
- `talk.session.startTurn` starts managed-room turn
- `talk.session.endTurn` rejects stale turn ids
- `talk.session.cancelTurn` aborts provider, agent, tools, TTS, and streams
- `talk.session.cancelOutput` cancels playback only
- `talk.session.submitToolResult` binds to provider call id
- `talk.session.close` emits terminal event and releases resources
Browser:
- WebRTC path calls `talk.client.create`
- provider WebSocket path calls `talk.client.create`
- provider tool calls use `talk.client.toolCall`
- Gateway relay uses only `talk.session.*`
- Gateway relay listens only to `talk.event`
- barge-in requires speech/VAD
- relay close rejects or aborts pending consult runs
- no removed RPC names in UI tests
Native:
- push-to-talk start emits capture/turn events
- failed push-to-talk start cleans capture state
- cancel clears capture and output state
- STT/TTS success path is event-driven
- fallback polling is explicit and tested if kept
- node policy rejects untrusted Talk commands
Telephony:
- early speech before provider speech-started creates or guards turn before cancellation
- marks and clear events map to output state
- u-law codec stays adapter-owned
- cancellation aborts consult run
- closed call prevents stale tool result submission
Meetings:
- participant context appears as metadata, not core branching
- echo suppression prevents false barge-in
- transcript events use common envelope
- meeting close aborts active work
Architecture:
- no removed public RPC names in protocol metadata
- no retired realtime endpoint route
- no retired realtime folder
- no request-time instruction override field
- no core branches on app platform names
- provider behavior comes from capabilities
## Verification Commands
Focused local loop:
```sh
pnpm test src/gateway/protocol/index.test.ts
pnpm test src/gateway/server-methods/talk.test.ts
pnpm test src/gateway/method-scopes.test.ts src/gateway/server-methods-list.test.ts
pnpm test src/gateway/talk-realtime-relay.test.ts src/gateway/talk-transcription-relay.test.ts
pnpm test ui/src/ui/realtime-talk.test.ts ui/src/ui/realtime-talk-gateway-relay.test.ts ui/src/ui/realtime-talk-webrtc.test.ts ui/src/ui/realtime-talk-google-live.test.ts
pnpm exec oxfmt --check --threads=1 docs/refactor/talk.md docs/refactor/talk-execution.md
```
Generation and docs:
```sh
pnpm protocol:gen && pnpm protocol:gen:swift
pnpm docs:check-mdx
pnpm plugin-sdk:api:check
```
Broad gate before push:
```sh
pnpm check:changed
```
Use Testbox for broad gates on maintainer machines.

View File

@@ -0,0 +1,128 @@
---
summary: "Surface adapter plan for browser, native, walkie-talkie, telephony, and meeting Talk refactor work"
read_when:
- Updating browser realtime Talk, native Talk, walkie-talkie handoff, Voice Call, or meeting voice code
- Deciding whether a Talk behavior belongs in an adapter or shared runtime
title: "Talk surface mapping"
---
# Talk Surface Mapping
This maps product surfaces into [Talk refactor plan](/refactor/talk) primitives.
## Browser
WebRTC:
- call `talk.client.create`
- open provider media connection in browser
- forward provider tool calls through `talk.client.toolCall`
- receive provider audio through provider media/data channel
Provider WebSocket:
- call `talk.client.create`
- connect using constrained provider result
- keep provider-specific framing in the browser adapter
- forward tool calls through `talk.client.toolCall`
Gateway relay:
- call `talk.session.create`
- send PCM frames with `talk.session.appendAudio`
- listen only to `talk.event`
- submit tool results with `talk.session.submitToolResult`
- barge-in with `talk.session.cancelOutput`
- close with `talk.session.close`
## Native And Nodes
Native apps map local audio lifecycle into Talk primitives.
Native realtime:
- use `talk.client.create` when the app owns provider media
- use `talk.session.create` when Gateway owns provider relay
Native STT/TTS:
- use `talk.session.create({ mode: "stt-tts", transport: "managed-room" })`
- keep local STT and local TTS behind native adapters
- drive success path from Talk events
- keep history polling only as a degraded fallback if explicitly tested
Native push-to-talk:
- press maps to `talk.session.startTurn`
- release maps to `talk.session.endTurn`
- cancel maps to `talk.session.cancelTurn`
- node capture commands emit capture events
- failed start cleans capture state
- opening voice UI never mutates global Talk config
Trusted node command adapters may remain:
```ts
talk.ptt.start;
talk.ptt.stop;
talk.ptt.cancel;
talk.ptt.once;
```
## Walkie-Talkie
Walkie-talkie is managed-room Talk:
```ts
await gateway.request("talk.session.create", {
mode: "stt-tts",
transport: "managed-room",
brain: "agent-consult",
sessionKey,
});
```
Then:
- client joins with `talk.session.join`
- press calls `talk.session.startTurn`
- release calls `talk.session.endTurn`
- cancel calls `talk.session.cancelTurn`
- assistant speech emits `output.text.*` and `output.audio.*`
- replacement emits `session.replaced` to old owner
- close calls `talk.session.close`
Room state includes canonical session id, route/channel target, caller identity,
mode, transport, brain, provider, model, voice, locale, expiry, token hash,
active client id, active turn id, and replacement state.
Two simultaneous rooms must not share turn ids, transcripts, audio output, or
cancellation tokens.
## Telephony
Voice Call becomes a telephony adapter over Talk semantics.
Keep telephony-owned: Twilio/Plivo WebSocket contracts, stream ids, call ids,
G.711 u-law, marks, clear events, backpressure, phone call lifecycle, and inbound
speech detection quirks.
Move shared behavior to Talk: event envelope, turn ids, cancellation, agent
consult abort, tool policy, usage and latency metrics, and output state.
Telephony should emit `talk.event` for observability, even if phone media
remains plugin-owned.
## Meetings
Google Meet and future meeting integrations become meeting adapters over Talk
semantics.
Keep meeting-owned: meeting join/leave, participant identity, room permissions,
echo suppression, transcript context, and meeting-specific mute/deafen behavior.
Move shared behavior to Talk: turn lifecycle, transcript events, assistant output
events, tool policy, cancellation, and metrics.
Meeting adapters may run `transcription`, `stt-tts`, or `realtime` depending on
provider support.

View File

@@ -1,55 +1,68 @@
---
summary: "Grand unification plan for Talk mode, realtime voice, voice-call, Google Meet, and VoiceClaw realtime"
summary: "Breaking refactor plan for one Talk architecture across realtime voice, STT/TTS, browser, native, telephony, meetings, and walkie-talkie handoff"
read_when:
- Refactoring Talk mode, realtime voice, voice-call, Google Meet, or VoiceClaw realtime
- Changing Talk protocol, provider contracts, browser realtime, or native voice behavior
- Deciding whether a voice feature belongs in core, a provider plugin, or a surface adapter
title: "Talk unification plan"
- Refactoring Talk mode, realtime voice, voice-call, Google Meet, browser realtime voice, native push-to-talk, STT, or TTS
- Changing Talk Gateway protocol, provider contracts, realtime transports, managed rooms, audio events, cancellation, or tool policy
- Deciding whether a voice feature belongs in core, a provider plugin, a native app, a meeting adapter, or a telephony adapter
title: "Talk refactor plan"
---
# Talk Unification Plan
# Talk Refactor Plan
OpenClaw has several voice loops that grew from different product surfaces: native Talk mode, browser realtime Talk, Voice Call realtime, Google Meet realtime, streaming STT, TTS reply playback, and `/voiceclaw/realtime`. The goal is not to force all of them into one implementation. The goal is one session contract, one event vocabulary, one policy boundary, and small adapters for each surface.
This is the breaking-clean plan for unifying every live voice path behind one
Talk architecture.
Core should know conversation modes, byte transports, audio formats, tool policy, and client capabilities. Core should not know platform product names such as iOS, Android, or macOS except as optional telemetry emitted by an edge client.
The old architecture grew by product surface: browser realtime, Gateway relay,
managed native handoff, streaming transcription, Voice Call, Google Meet, local
STT/TTS, one-shot TTS, and a retired realtime WebSocket endpoint each learned
their own names for sessions, turns, capture, output, barge-in, tool calls,
cancellation, and transcript events.
## Goals
The new architecture grows by primitive. There is one public Talk API, one
event envelope, one turn model, one cancellation contract, one provider policy
boundary, and one place for shared runtime state. Browser, native, telephony,
meetings, and walkie-talkie become adapters over those primitives.
- Make browser Talk, native Talk, telephony, meetings, and VoiceClaw realtime share the same session semantics.
- Keep provider-specific realtime behavior in provider plugins.
- Keep telephony and meeting quirks in their owning plugins.
- Move browser realtime agent consult out of browser-owned `chat.send`.
- Keep existing public entry points only as migration adapters while the runtime converges.
- Keep local STT/TTS as a first-class fallback, not a deprecated path.
- Support a first-party walkie-talkie client that can hand off an existing OpenClaw session into voice without becoming a separate assistant.
- Make event logs, latency, usage, tool calls, cancellation, and interruption observable in the same shape everywhere.
## Product Target
## Non Goals
OpenClaw supports three Talk products:
- Do not make core branch on app platforms.
- Do not move OpenAI, Google, Twilio, or meeting-specific behavior into core.
- Do not merge one-shot inbound audio attachments with live Talk sessions beyond sharing STT provider contracts where useful.
- Do not remove `/voiceclaw/realtime` or existing Talk RPC entry points during the first migration; they may reject retired fields instead of preserving every old request shape.
- Do not allow request-time instruction overrides for realtime sessions.
- Do not copy VoiceClaw names or request fields into shared APIs; preserve the realtime runtime capabilities through the shared Talk contract, except request-time instruction overrides.
| Product | User experience | Mode |
| --------------------- | ----------------------------------------------------------------------- | --------------- |
| Realtime conversation | Low-latency duplex speech with interruption and provider tool calls | `realtime` |
| Walkie-talkie | Press or hold to speak, release, then hear OpenClaw answer | `stt-tts` |
| Transcription | Live captions, dictation, notes, meeting transcript, no assistant audio | `transcription` |
## Current Surfaces
All three products share session identity, join/reconnect state, turn and
capture ids, input audio metadata, output text/audio state, transcript finality,
tool-call correlation, cancellation, replay, provider capabilities, policy,
auth, and observability.
| Surface | Current shape | Keep | Refactor target |
| ------------------------ | ----------------------------------------------------------------------------------------------------------------------------------------------------- | ----------------------------------------------------------------- | ------------------------------------------------------------------------------------------------------- |
| Browser Talk | `talk.realtime.session` returns WebRTC, provider WebSocket, or Gateway relay. Tool calls go through `talk.realtime.toolCall`. | Browser audio capture/playback and WebRTC data-channel handling. | Keep browser media ownership while Gateway owns realtime tool policy. |
| Native Talk | Local STT, Gateway `chat.send`, response event or `chat.history` polling, then local or Gateway TTS. | Local STT/TTS fallback and native audio controls. | Event-driven success path with shared Talk events. |
| Voice Call realtime | Telephony WebSocket with G.711 u-law, marks, interruption, and realtime voice bridge. | Telephony adapter ownership. | Adapter over shared Talk session contract. |
| Voice Call streaming STT | Telephony stream through realtime transcription provider, then TTS playback. | STT/TTS pipeline mode. | Explicit `stt-tts` mode adapter. |
| Google Meet realtime | Meeting participant context, echo suppression, realtime provider bridge, fast context. | Meeting adapter ownership. | Adapter over shared Talk session contract and metrics. |
| VoiceClaw realtime | Separate WebSocket endpoint with Gemini Live, direct tools, audio/video frames, interruption, cancellation, session rotation/resumption, and metrics. | Migration endpoint; realtime runtime primitives except overrides. | Shared Talk contract; server-owned instructions; no request-time override. |
| TTS | `talk.speak` and provider TTS config. | Speech provider abstraction. | Cleanly separated from realtime provider config. |
| STT | Batch audio and streaming transcription providers. | Provider contracts. | Streaming STT is an input strategy for `stt-tts`; batch voice notes stay outside live Talk. |
| Walkie-talkie handoff | Prototype pattern: existing session, phone capture, push-to-talk turn, STT, agent turn, TTS playback, and transcript mirror. | One-button voice handoff UX and long-form PTT. | Gateway-backed handoff room using shared Talk events, provider catalogs, and existing session delivery. |
One-shot uploaded audio and one-shot TTS do not need live Talk session state
unless they participate in live capture, turns, interruption, replay, or
cancellation.
## Core Model
## Hard Decisions
Separate the dimensions. Mode is how the conversation runs. Transport is how bytes move. Brain is who handles tools and agent reasoning. Surface is edge-owned and should not drive core branching.
This refactor intentionally removes compatibility that would keep the design
muddy:
- remove public `talk.realtime.*` RPCs
- remove public `talk.transcription.*` RPCs
- remove public `talk.handoff.*` RPCs
- remove generic `talk.session.inputAudio`, `talk.session.control`, and
`talk.session.toolResult`
- remove old relay event channels
- remove `/voiceclaw/realtime`
- remove `src/gateway/voiceclaw-realtime/`
- remove request-time instruction overrides
- keep `talk.speak` as one-shot TTS, not a live session API
- keep legacy realtime config repair in doctor, not startup
- keep platform and product names out of core branching
## Vocabulary
Keep mode, transport, brain, and surface separate.
```ts
type TalkMode = "realtime" | "stt-tts" | "transcription";
@@ -61,366 +74,426 @@ type TalkBrain = "agent-consult" | "direct-tools" | "none";
### Modes
`realtime` is a provider-native live session. Audio goes in, audio comes out, interruptions and tool calls happen inside one low-latency session. OpenAI Realtime and Google Live fit here. WebRTC and provider WebSockets are transports for this mode, not separate modes.
`realtime` means a provider owns a live voice session. Audio goes in, audio
comes out, interruptions are possible, and provider tool calls may happen during
one provider session.
`stt-tts` is the classic pipeline: speech-to-text, agent text turn, text-to-speech. It is higher latency, but it works with local native speech, streaming STT providers, low-cost fallback providers, offline-ish native paths, and providers that do not support realtime voice.
`stt-tts` means input speech is transcribed, OpenClaw answers as text, and TTS
renders the answer. This is the native Talk and walkie-talkie path when a full
duplex provider session is not the right shape.
`transcription` is speech-to-text without an assistant speech response. It covers dictation, captions, meeting transcript capture, and voice-note style ingestion when the live session layer is useful. Gateway-owned transcription relay sessions use `talk.transcription.session`, `talk.transcription.relayAudio`, `talk.transcription.relayCancel`, and `talk.transcription.relayStop`. One-shot batch audio attachments can still use the existing media path without becoming Talk sessions.
`transcription` means speech-to-text without assistant audio output. It covers
captions, dictation, notes, meeting transcript capture, and live voice-note
ingestion.
### Transports
`webrtc` is browser or WebRTC-capable client transport using SDP and media/data channels. It is the best fit for direct OpenAI Realtime browser sessions with ephemeral credentials.
`webrtc` is client-owned SDP/media/data-channel transport. It fits browser-owned
OpenAI Realtime sessions with ephemeral credentials.
`provider-websocket` is a constrained provider WebSocket carrying JSON control messages and PCM audio. It fits Google Live-style browser or server streams where WebRTC is not the provider contract.
`provider-websocket` is client-owned provider JSON and audio framing. It fits
browser-owned Google Live style sessions.
`gateway-relay` keeps the vendor session on the Gateway. Clients send authenticated audio frames to Gateway and receive audio/events back. This is the secure default for providers without browser-safe tokens and for server-owned tool policy.
`gateway-relay` means the Gateway owns the provider connection. The client sends
authenticated audio frames to the Gateway and receives `talk.event` plus audio
output through Gateway-managed relay state.
`managed-room` is a Gateway-owned room/session where one or more clients join a managed Talk handoff. It is the primitive for first-party walkie-talkie clients: Gateway owns rendezvous, expiry, replacement, turn lifecycle events, and provider credentials while the edge client owns capture and playback.
`managed-room` means the Gateway owns a room-like session that clients can join,
replace, and drive with explicit turn verbs. It is the primitive for
walkie-talkie and native handoff.
Telephony, meetings, and native apps are not core transports. They are surface adapters that choose one of the transports above or implement local `stt-tts` before handing text/audio events into the shared session contract.
Canonical transport names are the names above. Legacy browser-session transport names should be normalized at adapter boundaries (`webrtc-sdp` to `webrtc`, `json-pcm-websocket` to `provider-websocket`) so mixed-version clients and external providers keep working. Do not keep the legacy names as a second internal vocabulary. When a versioned creation RPC exists, freeze the old RPC shape and delete the aliases only after the announced compatibility window.
Telephony and meetings are not core transports. They are adapters that map
phone or meeting media into `gateway-relay`, `managed-room`, or `stt-tts` while
keeping call and meeting lifecycle outside core.
### Brain Strategies
`agent-consult` means the realtime model asks Gateway to consult an OpenClaw agent. Gateway applies tool policy, chooses fork or isolated context, runs the agent, and returns a concise result to the realtime provider.
`agent-consult` means provider tool calls or session turns consult an OpenClaw
agent. Gateway owns prompt construction, context selection, authorization, abort
signals, and final result delivery.
`direct-tools` means the realtime provider receives a direct OpenClaw tool declaration and calls Gateway-owned tools. This is the VoiceClaw-style brain and should require owner-level authorization.
`direct-tools` means a trusted first-party surface can call selected OpenClaw
tools directly through Gateway policy. Keep this privileged.
`none` means the session is pure transcription, external orchestration, or client-managed speech without OpenClaw tool access.
`none` means transcription-only, external orchestration, or no OpenClaw tool
access.
## Shared Talk Session Runtime
## Ownership Boundaries
The next cleanup layer is a shared Talk session controller. It should be the only code that owns event sequencing, active turn state, capture state, output audio state, recent event retention, and stale-turn rejection. Surface adapters may decide when to call it, but they should not each reimplement turn bookkeeping.
Core owns generic Talk semantics:
The controller contract should cover:
- mode, transport, brain, codec, and audio descriptors
- session records and session ownership
- turn ids and capture ids
- event envelope, sequencing, replay, and stale-output suppression
- active capture state
- active assistant output state
- replacement and reconnect state
- cancellation propagation
- tool policy and tool-call correlation
- usage, latency, and health events
- `emit(...)` for session, health, usage, latency, and tool events that do not mutate turn state
- `startTurn(...)` and `ensureTurn(...)` for capture, STT, realtime provider, telephony, and meeting adapters
- `endTurn(...)` and `cancelTurn(...)` with stale `turnId` rejection before clearing the active turn
- `startOutputAudio(...)`, `emitOutputAudioDelta(...)`, and `finishOutputAudio(...)` for playback, marks, relay clear, and barge-in
- recent event retention for reconnect, diagnostics, hello/event discovery tests, and native UI replay
- compatibility normalization for legacy transport result names at adapter boundaries
Provider plugins own vendor behavior:
The public API migration is adapter-first. Keep existing RPCs such as `talk.realtime.session`, `talk.realtime.relayAudio`, `talk.transcription.session`, `talk.transcription.relayAudio`, and `talk.handoff.*` while moving their internals onto the shared controller. Gateway-managed sessions expose the common model directly:
- OpenAI Realtime SDP and data-channel details
- Google Live WebSocket framing
- streaming STT provider details
- TTS provider details
- provider auth, model, voice, codec, and resume quirks
- provider capability declarations
Surface adapters own IO and product quirks:
- browser capture and playback
- native audio sessions, local speech engines, and foreground Talk UX
- node command dispatch
- telephony media streams, marks, clear events, u-law, and call lifecycle
- meeting join/leave, participants, echo suppression, and authorization
Core may store optional surface metadata for diagnostics. Core must not branch
on browser, iOS, Android, macOS, Google Meet, Voice Call, or any retired product
name.
## Final Gateway API
The public Gateway surface is deliberately small:
```ts
// Discovery and configuration.
talk.catalog;
talk.config;
// One-shot speech output.
talk.speak;
// Client-owned provider sessions.
talk.client.create;
talk.client.toolCall;
// Gateway-owned live sessions.
talk.session.create;
talk.session.inputAudio;
talk.session.control;
talk.session.toolResult;
talk.session.join;
talk.session.appendAudio;
talk.session.startTurn;
talk.session.endTurn;
talk.session.cancelTurn;
talk.session.cancelOutput;
talk.session.submitToolResult;
talk.session.close;
// Events and foreground node mode.
talk.event;
talk.mode;
```
The old RPCs stay as compatibility adapters while new clients use `talk.session.*` for gateway-relay realtime, gateway-relay transcription, and managed-room native STT/TTS sessions. Browser-owned WebRTC/provider-websocket sessions remain on `talk.realtime.session` because the browser owns provider negotiation and media transport there. The internal controller must be provider-agnostic and platform-agnostic: provider plugins own vendor sessions, voice-call owns telephony, Google Meet owns meeting details, and browser/native clients own capture and playback UX.
Use `talk.client.*` when the client owns provider media transport. Use
`talk.session.*` when the Gateway owns live session state.
## VoiceClaw Runtime Scope
`talk.mode` is the existing foreground node mode broadcast. It can stay, but it
is not part of the Talk session control API.
VoiceClaw is an adapter target, not a feature template for the unified runtime. We do not need every VoiceClaw product or API feature. We do want the useful realtime runtime primitives: live provider sessions, audio and optional video frames, interruption, cancellation, session lifecycle, rotation/resumption, metrics, latency reporting, and direct tools when explicitly authorized. Those should arrive as shared Talk primitives instead of VoiceClaw-only knobs.
### Supported Creation Matrix
The deliberate feature removal is request-time instruction override. Unified Talk instructions must be server-owned. If a capability depends on provider support, owner-scoped auth, or the selected brain strategy, the adapter should gate it through shared Talk capability metadata rather than deleting it. Do not preserve `instructionsOverride`; it is intentionally outside the unified Talk contract. Everything else in the existing realtime runtime is presumed in scope unless a later implementation review proves that it is dead, unsafe, or impossible to express as a shared Talk primitive.
| Method | Mode | Transport | Brain | Owner |
| --------------------- | --------------- | -------------------- | --------------- | ------- |
| `talk.client.create` | `realtime` | `webrtc` | `agent-consult` | client |
| `talk.client.create` | `realtime` | `provider-websocket` | `agent-consult` | client |
| `talk.session.create` | `realtime` | `gateway-relay` | `agent-consult` | Gateway |
| `talk.session.create` | `transcription` | `gateway-relay` | `none` | Gateway |
| `talk.session.create` | `stt-tts` | `managed-room` | `agent-consult` | Gateway |
| `talk.session.create` | `stt-tts` | `managed-room` | `direct-tools` | Gateway |
Keep:
Reject combinations that blur ownership. `talk.client.create` must reject
Gateway-owned transports. `talk.session.create` must reject client-owned
transports.
- `/voiceclaw/realtime` endpoint shape during migration
- existing auth expectations where they remain owner-scoped
- Gemini Live provider bridge
- audio input and output frames
- video frames when the selected provider supports them
- interruption and response cancellation
- session rotation and resumption where the provider supports them
- metrics and latency reporting
- direct tool calls behind the explicit `direct-tools` brain
## Removed API
Do not keep:
Remove these names from handlers, method lists, scopes, protocol schemas,
generated clients, broadcast guards, tests, and docs except explicit migration
tables:
- request-time `instructionsOverride`
- VoiceClaw-only request fields that duplicate server-owned instructions, tool policy, provider selection, or session policy
- VoiceClaw-specific configuration names in new shared Talk APIs
| Removed | Replacement |
| ------------------------------- | -------------------------------------------------------- |
| `talk.realtime.session` | `talk.client.create` |
| `talk.realtime.toolCall` | `talk.client.toolCall` |
| `talk.realtime.relayAudio` | `talk.session.appendAudio` |
| `talk.realtime.relayCancel` | `talk.session.cancelOutput` or `talk.session.cancelTurn` |
| `talk.realtime.relayMark` | internal relay output state |
| `talk.realtime.relayToolResult` | `talk.session.submitToolResult` |
| `talk.realtime.relayClose` | `talk.session.close` |
| `talk.realtime.relay` | `talk.event` |
| `talk.transcription.session` | `talk.session.create({ mode: "transcription" })` |
| `talk.transcription.audio` | `talk.session.appendAudio` |
| `talk.transcription.cancel` | `talk.session.cancelTurn` |
| `talk.transcription.close` | `talk.session.close` |
| `talk.transcription.relay` | `talk.event` |
| `talk.handoff.create` | `talk.session.create({ transport: "managed-room" })` |
| `talk.handoff.join` | `talk.session.join` |
| `talk.handoff.revoke` | `talk.session.close` |
| `talk.session.inputAudio` | `talk.session.appendAudio` |
| `talk.session.control` | explicit turn/output verbs |
| `talk.session.toolResult` | `talk.session.submitToolResult` |
Realtime instruction policy must come from server-side config, agent identity, selected brain strategy, or another owner-controlled policy surface. If a client sends `instructionsOverride`, the compatibility adapter should reject the request rather than silently applying, partially honoring, or translating it. Everything in the Keep list remains in scope and should migrate onto shared Talk primitives.
Delete this endpoint:
Compatibility here means "old entry point can route to the new runtime," not "old clients can keep every old knob forever." `/voiceclaw/realtime` should be allowed to return a clear unsupported-field error for retired request fields, especially `instructionsOverride`, while preserving the runtime behavior that still belongs in Talk.
## Event Vocabulary
All Talk sessions should emit a common event stream:
- `session.started`, `session.ready`, `session.replaced`, `session.closed`, `session.error`
- `turn.started`, `turn.ended`, `turn.cancelled`
- `capture.started`, `capture.stopped`, `capture.cancelled`, `capture.once`
- `input.audio.delta`, `input.audio.committed`
- `transcript.delta`, `transcript.done`
- `output.text.delta`, `output.text.done`
- `output.audio.started`, `output.audio.delta`, `output.audio.done`
- `tool.call`, `tool.progress`, `tool.result`, `tool.error`
- `usage.metrics`
- `latency.metrics`
- `health.changed`
Adapters may add vendor or surface metadata, but the common event names should be enough for UI, native clients, logs, tests, and metrics.
Every common event must use the same envelope:
```ts
type TalkEvent<TPayload = unknown> = {
id: string;
type: TalkEventType;
sessionId: string;
turnId?: string;
captureId?: string;
seq: number;
timestamp: string;
mode: TalkMode;
transport: TalkTransport;
brain: TalkBrain;
provider?: string;
final?: boolean;
callId?: string;
itemId?: string;
parentId?: string;
payload: TPayload;
};
```text
/voiceclaw/realtime
```
`sessionId` is required for every event. `turnId` is required for every event tied to one user/assistant turn. `captureId` is required while push-to-talk capture is active. `seq` is monotonically increasing within a session. `callId`, `itemId`, and `parentId` correlate provider tool calls, realtime response items, TTS jobs, and relay frames. Replay, stale-output suppression, metrics, and tests should rely on these envelope fields rather than vendor-specific payload shapes.
Delete this folder:
Walkie-talkie clients need one extra timing rule: text-ready is not audio-ready. A client may show transcript text after `output.text.done`, but it should not transition from "thinking" to "speaking" until `output.audio.delta` or an explicit `output.audio.started` event arrives. That keeps hold music, waveform, replay, and barge-in UX honest when the agent turn finishes before TTS is ready.
## Walkie-Talkie App Primitives
The app should be buildable from the same primitives, not a parallel voice stack.
### Session Handoff
Voice handoff starts from an existing OpenClaw session. The handoff primitive should carry:
- canonical session id
- optional session key for human-readable thread lookup
- delivery route, such as channel and target
- caller identity and scope
- selected `TalkMode`, `TalkTransport`, and `TalkBrain`
- optional session-scoped provider, model, and voice ids
- expiration, revocation, and replacement policy
The existing Gateway session APIs and `chat.send`/agent delivery paths already cover the canonical conversation side. First-class Talk handoff RPCs provide the rendezvous primitive: `talk.handoff.create` returns an ephemeral room token or join URL, `talk.handoff.join` validates the later voice join without exposing stored token hashes, `talk.handoff.turnStart`/`turnEnd`/`turnCancel` drive the room turn lifecycle, and `talk.handoff.revoke` invalidates stale or replaced handoffs.
### Room and Rendezvous
The room model must allow one device or browser client to host multiple active voice handoffs for different sessions without cross-talk. A deterministic room key is fine for local or development flows, but the product path should prefer Gateway-owned room creation with caller auth, expiry, and revoke semantics.
The minimum room events are:
- `session.ready`
- `session.replaced`
- `turn.started`
- `turn.ended`
- `turn.cancelled`
- `session.closed`
- `session.error`
`managed-room` is public only through handoff clients. Browser `talk.realtime.session` should keep rejecting `managed-room` until the browser owns a real room client instead of treating it as a browser-session result shape.
### Push-To-Talk
Push-to-talk is a turn-control primitive, not a platform primitive. It should map to browser capture, native local capture, or node commands:
- `capture.started`
- `capture.stopped`
- `capture.cancelled`
- `capture.once`
Native node support has `talk.ptt.start`, `talk.ptt.stop`, `talk.ptt.cancel`, and `talk.ptt.once` command handlers. The Gateway policy treats them as first-class defaults only for trusted Talk-capable nodes: a node must advertise the `talk` capability or declare `talk.*` command support, and the command must still be present in the paired command snapshot.
### Provider Catalogs and Settings
Walkie-talkie settings should be per session or per device. The client should request STT, TTS, and realtime catalogs through Gateway, store only provider ids, model ids, voice ids, and locales, and never receive provider API keys or mutate global Talk provider defaults as a side effect of opening the app.
The catalog contract should describe which combinations are valid:
- local STT plus local TTS
- streaming STT plus provider TTS
- realtime provider with provider-native output audio
- Gateway relay when browser-safe credentials are not available
- managed room when the Gateway owns the session
### Canonical Transcript
The OpenClaw session is the source of truth. A walkie-talkie app may keep a local transcript cache for replay, export, reconnect, or offline UX, but the agent turn and durable transcript should go through the existing session delivery route. Transcript mirroring should be best effort and must not block the voice turn.
### Connectivity and Backgrounding
Native apps can use node pairing, `node.invoke`, and platform wake mechanisms when available. Browser or standalone web clients need either Gateway relay, a managed room, or hosted WebRTC signaling with ICE/TURN. Background continuous audio remains platform-limited; the product should promise foreground push-to-talk first and treat background capture as best effort.
### Cancellation and Replacement
Every turn should carry a turn token or capture id. Stale STT finals, stale agent replies, and stale TTS output must be ignored after `turn.cancelled` or `session.replaced`. This is required for "tap again to interrupt", reconnect replacement, and multi-session isolation.
Cancellation must also abort underlying work, not only hide stale output. A cancelled or replaced turn must:
- cancel provider responses or realtime sessions when the provider supports it
- abort agent consult and tool runtime work through an `AbortSignal`
- prevent newly queued side-effecting tools from starting after cancellation
- let already-started side-effecting tools report cancellation status instead of inventing success
- drain pending TTS jobs and stop audio playback/relay writes
- close or reset relay and managed-room streams tied to the stale turn
- emit one terminal cancellation event with the final abort reason
## Config Direction
The current public Talk config is speech-provider oriented. Keep it as the speech config and add realtime config beside it. Do not introduce a second `talk.speech` namespace during this refactor.
```ts
type TalkConfig = {
provider?: string;
providers?: Record<string, unknown>;
realtime?: {
provider?: string;
model?: string;
voice?: string;
mode?: TalkMode;
transport?: TalkTransport;
brain?: TalkBrain;
};
input?: {
interruptOnSpeech?: boolean;
silenceTimeoutMs?: number;
};
};
```text
src/gateway/voiceclaw-realtime/
```
Rule: `talk.provider` and `talk.providers.*` continue to mean speech, STT, and TTS provider configuration. Realtime provider selection uses `talk.realtime.provider`, then registered realtime capabilities. Voice Call fallback inference should be deleted once the realtime config exists in schema, docs, forms, and doctor repair.
Do not leave a compatibility namespace around retired code.
## Provider Contracts
## Target Source Layout
Provider plugins should declare capabilities, not force core to infer behavior from ids:
Shared runtime:
```ts
type RealtimeVoiceProviderCapabilities = {
transports: TalkTransport[];
inputAudioFormats: AudioFormat[];
outputAudioFormats: AudioFormat[];
supportsBrowserSession?: boolean;
supportsBargeIn?: boolean;
supportsToolCalls?: boolean;
supportsVideoFrames?: boolean;
supportsSessionResumption?: boolean;
};
```text
src/talk/
audio-codec.ts
agent-consult-runtime.ts
agent-consult-tool.ts
agent-talkback-runtime.ts
fast-context-runtime.ts
provider-registry.ts
provider-resolver.ts
provider-types.ts
session-log-runtime.ts
session-runtime.ts
talk-events.ts
talk-session-controller.ts
```
OpenAI owns OpenAI Realtime details. Google owns Gemini Live details, continuation, compression, and session resumption. STT plugins own streaming transcription. TTS plugins own synthesis and telephony-compatible output formats.
Gateway adapters:
## Gateway Policy Boundary
```text
src/gateway/server-methods/
talk.ts # catalog, config, speak, mode, composition
talk-client.ts # client-owned provider sessions
talk-session.ts # Gateway-owned live sessions
```
Browser realtime should not run agent consult by calling `chat.send` directly. The browser may own the media connection when a provider requires it, but Gateway should own the consult/tool policy.
Gateway relay helpers can exist while the code moves, but the long-term shape
is that relay, transcription, and handoff state use `src/talk` primitives
instead of each reimplementing turns and events.
Target flow for browser-owned provider sessions:
Public SDK:
1. Provider emits a tool call to the browser.
2. Browser forwards the structured tool call to Gateway with the session id.
3. Gateway validates the session, caller, tool policy, brain strategy, and owner permissions.
4. Gateway runs `agent-consult`, `direct-tools`, or rejects the call.
5. Browser submits the provider-specific tool result back to the provider.
```text
src/plugin-sdk/realtime-voice.ts
```
Target flow for Gateway-owned sessions:
Keep this SDK subpath as the stable plugin import facade. It may re-export
Talk runtime contracts, but plugin authors should not import core file layout.
1. Provider emits a tool call to Gateway.
2. Gateway runs policy and tool handling directly.
3. Client only receives status, transcript, audio, and visible tool progress events.
## Event Contract
## Surface Adapters
All live paths emit `talk.event` with the envelope defined in
[Talk API and runtime contract](/refactor/talk-api-contract). The required
shape is: `id`, `type`, `sessionId`, `seq`, `timestamp`, `mode`, `transport`,
`brain`, and `payload`, with `turnId`, `captureId`, `callId`, `itemId`, and
`parentId` when the event is tied to turn, capture, provider item, tool call, or
TTS output.
Adapters convert surface-specific IO into the shared model.
Core event families are `session.*`, `turn.*`, `capture.*`, `input.audio.*`,
`transcript.*`, `output.text.*`, `output.audio.*`, `tool.*`, `usage.metrics`,
`latency.metrics`, and `health.changed`. Payloads must not duplicate large raw
audio frames when the transport already carries them. Text-ready is not
audio-ready; clients enter playback state only on audio events.
Browser adapter handles microphone capture, playback, WebRTC SDP, data channels, provider WebSocket framing, relay RPCs, and provider-specific tool result submission.
## Cancellation Contract
Native adapter handles local STT/TTS, push-to-talk, continuous listening, local interruption, audio session lifecycles, and optional Gateway realtime or managed-room clients. Core sees capabilities such as PCM input support, local TTS fallback, and barge-in support, not platform names.
Cancellation must abort underlying work, not only ignore stale output.
Telephony adapter handles Twilio or Plivo media streams, G.711 u-law, stream ids, marks, clear events, backpressure, call lifecycle, and phone-specific interruption behavior.
When a turn or session is cancelled:
Meeting adapter handles room lifecycle, participant context, echo suppression, meeting transcript context, and meeting-specific authorization.
- provider realtime response is cancelled when supported
- provider session is closed or reset when cancellation cannot be scoped
- streaming STT receives abort
- agent consult receives abort
- queued tools do not start after abort
- already-started side-effecting tools receive abort and report cancellation
- pending TTS jobs are drained
- playback sources are stopped
- relay streams are cleared
- managed-room capture and output state reset
- stale finals and stale audio deltas are ignored
- one terminal cancellation event is emitted
VoiceClaw adapter handles `/voiceclaw/realtime`, auth expectations that remain owner-scoped, Gemini Live compatibility, audio/video frames, interruption, response cancellation, session rotation/resumption, metrics, latency reporting, and the `direct-tools` brain while using common Talk events internally. It must reject request-time `instructionsOverride` and must not introduce VoiceClaw-only policy fields into the shared Talk API.
Barge-in requires real speech: provider speech-started, local VAD, or an
adapter-owned speech detector. Silence, echo, or microphone buffers alone must
not cancel assistant output.
## Migration Phases
## Config Contract
### Phase 1: Contracts
Config stays under `talk`; do not add `talk.speech`. `talk.provider` and
`talk.providers.*` remain speech/STT/TTS provider config. Realtime selectors
live under `talk.realtime.provider`, `talk.realtime.providers.*`, `model`,
`voice`, `mode`, `transport`, and `brain`.
- Add shared Talk mode, transport, brain, capabilities, command, and event types.
- Add a config resolver that preserves legacy `talk.provider`.
- Keep existing `RealtimeVoiceProvider` APIs while introducing capability metadata.
- Add handoff, room, capture, provider catalog, cancellation, and replacement event contracts.
- Make `talk.ptt.start`, `talk.ptt.stop`, `talk.ptt.cancel`, and `talk.ptt.once` explicit safe commands for Talk-capable nodes.
- Add protocol tests for no request-time instruction override.
`talk.config` returns effective config without secrets unless privileged.
`talk.catalog` returns provider capabilities, not inferred provider-id guesses.
Doctor migrates old realtime placement into `talk.realtime`; runtime startup
does not reinterpret Voice Call, STT, or TTS config as realtime config.
### Phase 2: Gateway Tool Policy
## Surface Mapping
- Add Gateway RPC for realtime tool calls from browser-owned provider sessions.
- Add Gateway RPCs for `talk.handoff.create`, `talk.handoff.join`, `talk.handoff.revoke`, and explicit handoff turn start/end/cancel, with session identity, expiry, revocation, join authorization, and event replay.
- Add session-scoped STT, TTS, and realtime provider catalog RPCs.
- Keep browser `openclaw_agent_consult` handling on `talk.realtime.toolCall`, not browser-side `chat.send`.
- Reuse existing agent consult runtime and tool allow policy.
- Add owner-only gate for `direct-tools`.
| Surface | Talk mapping |
| ------------------------------- | ----------------------------------------------------------------------------------------------------- |
| Browser WebRTC | `talk.client.create`, client-owned provider media, `talk.client.toolCall` for provider tool calls |
| Browser provider WebSocket | `talk.client.create`, browser-owned provider framing, Gateway-owned credentials and policy |
| Browser Gateway relay | `talk.session.create`, `appendAudio`, `submitToolResult`, `cancelOutput`, `close`, and `talk.event` |
| Native push-to-talk | `stt-tts` plus `managed-room`; press/startTurn, release/endTurn, cancel/cancelTurn |
| Walkie-talkie | managed-room join/replacement plus shared turn/output events |
| Voice Call | telephony adapter over Talk events; call ids, stream ids, u-law, marks, clear events stay plugin side |
| Google Meet and future meetings | meeting adapter over Talk events; participant state, permissions, mute, and echo suppression stay out |
### Phase 3: Browser Runtime
See [Talk surface mapping](/refactor/talk-surfaces) for the adapter-level
rules.
- Normalize browser WebRTC, provider WebSocket, and relay adapters behind common Talk events.
- Keep `managed-room` scoped to handoff clients until the browser has a real room client.
- Add a walkie-talkie browser client path over Gateway relay or managed room.
- Keep provider credentials on Gateway; browser receives only ephemeral room/session credentials.
- Add browser tests proving realtime consult does not call `chat.send`.
## Detailed Refactor Phases
### Phase 4: Native Runtime
### Phase 1: Protocol Is The Source Of Truth
- Make native Talk consume response events in the success path.
- Remove normal-path `chat.history` polling and keep history polling only as a degraded fallback if needed.
- Preserve local STT and local TTS fallback.
- Route native push-to-talk through the shared capture and turn events.
- Verify node command policy allows `talk.ptt.*` for trusted Talk-capable native nodes.
- Align native emitted state with common Talk events.
- define final `talk.client.*`, `talk.session.*`, `talk.event`, `talk.catalog`, `talk.config`, `talk.speak`, and `talk.mode`
- delete removed RPCs from method lists and generated metadata
- delete removed event channels from hello feature advertising
- classify every final method in `METHOD_SCOPE_GROUPS`
- regenerate TypeScript and Swift protocol clients
- add protocol tests proving removed names are absent
### Phase 5: VoiceClaw Runtime
Exit criteria: generated clients expose only the final public Talk API.
- Rebase `/voiceclaw/realtime` onto the shared Talk session runtime.
- Keep the endpoint as a thin migration adapter and preserve auth expectations only where they map cleanly to the shared Talk contract.
- Remove request-time `instructionsOverride`; owner policy must come from server-side config, agent identity, or the selected brain strategy.
- Map Gemini Live metrics, latency reporting, rotation, resumption, interruption, cancellation, audio, video, and tool events into the common event stream.
- Keep `direct-tools` separate from `agent-consult`.
- Do not add VoiceClaw-specific config names, override fields, or client policy knobs to new Talk contracts.
### Phase 2: Shared Runtime Becomes `src/talk`
### Phase 6: Voice Call and Meetings
- move provider-agnostic realtime voice modules into `src/talk`
- keep the plugin SDK facade at `openclaw/plugin-sdk/realtime-voice`
- rename logs and tests from realtime-voice wording to Talk wording where that improves clarity
- centralize event sequencing, active turn state, capture state, output state, stale-turn rejection, and replay history
- keep provider adapters out of this folder
- Convert Voice Call realtime into a telephony adapter over shared Talk sessions.
- Convert Voice Call streaming STT into explicit `stt-tts`.
- Convert Google Meet realtime into a meeting adapter over shared Talk sessions.
- Keep telephony marks, u-law, backpressure, participant context, and echo suppression in their owning adapters.
Exit criteria: core and bundled surfaces import shared semantics from `src/talk`
or the SDK facade, not from surface-local helpers.
### Phase 7: Docs and Cleanup
### Phase 3: Gateway Method Split
- Update [Talk mode](/nodes/talk), [Control UI](/web/control-ui), [Gateway protocol](/gateway/protocol), [Media overview](/tools/media-overview), [Text-to-speech](/tools/tts), and plugin SDK docs.
- Retire duplicate event names after compatibility windows.
- Remove browser-side consult-through-chat code after all supported providers use Gateway tool policy.
- make `talk.ts` a composition point for catalog, config, speak, mode, client, and session handlers
- put client-owned provider session methods in `talk-client.ts`
- put Gateway-owned session methods in `talk-session.ts`
- make relay, transcription, and managed-room handlers thin adapters over shared runtime primitives
- route session replacement notifications to the displaced connection
- reject stale turn completion before mutating active room state
## Test Matrix
Exit criteria: public RPC handlers read like API adapters, not separate Talk
implementations.
- WebRTC plus `agent-consult`.
- Provider WebSocket plus `agent-consult`.
- Gateway relay plus `agent-consult`.
- Public clients updated to canonical transport names, or a versioned RPC proves old result names stay isolated until deletion.
- VoiceClaw compatibility plus `direct-tools`, without request-time `instructionsOverride`.
- Telephony WebSocket with marks, clear, interruption, and u-law.
- Meeting adapter with participant context and echo suppression.
- Native `stt-tts` with no `chat.history` polling in the normal success path.
- Transcription-only Gateway relay session with partial/final transcript Talk events and no assistant brain.
- TTS-only `talk.speak`.
- Walkie-talkie handoff from an existing session into a voice room.
- Two simultaneous walkie-talkie handoffs for the same host but different sessions with no transcript, audio, or turn-token cross-talk.
- Push-to-talk start, stop, cancel, and once through `node.invoke` on a trusted talk-capable node.
- Text-ready before TTS-ready, proving the client does not enter playback until audio starts.
- Session-scoped provider catalog selection that does not mutate global Talk config.
- Cancellation aborts provider work, agent consult, queued tools, TTS, and relay/room streams.
- Security checks for no instruction override, no browser standard API keys, owner-only direct tools, and session-scoped tool calls.
### Phase 4: Browser UI Uses The Final API
## End State
- update WebRTC and provider WebSocket startup to `talk.client.create`
- update browser provider tool calls to `talk.client.toolCall`
- update Gateway relay startup to `talk.session.create`
- update relay audio to `talk.session.appendAudio`
- update relay tool result submission to `talk.session.submitToolResult`
- update relay close to `talk.session.close`
- listen only to `talk.event`
- handle aborted consult runs immediately instead of timing out
- gate relay barge-in on speech or VAD
OpenClaw has one Talk architecture with three execution modes, four core transports, explicit brain strategies, provider-owned vendor logic, Gateway-owned tool policy, and adapters for browser, native, telephony, meetings, and VoiceClaw compatibility. Users get better Talk mode. Maintainers get one place to reason about sessions, events, policy, metrics, and tests.
Exit criteria: UI tests contain no calls to removed Talk RPC names.
### Phase 5: Native And Nodes Become Event-Driven
- map native push-to-talk into managed-room sessions
- start, end, cancel, and replace turns through explicit session verbs
- clean capture state when push-to-talk start fails
- keep local STT and TTS as native adapter behavior
- remove chat-history polling from the success path
- keep fallback polling only if there is an explicit degraded-mode test
Exit criteria: native Talk success path is driven by `talk.event`, not hidden
chat side effects.
### Phase 6: Telephony And Meetings Become Adapters
- map Voice Call realtime and streaming STT into Talk event/cancellation semantics
- create or guard a turn before early speech cancellation events
- keep telephony codec, marks, clear events, and call lifecycle outside core
- map Google Meet transcript and assistant output into `talk.event`
- keep participant and echo-suppression behavior in the meeting adapter
- pass abort signals into agent consult and tool runtime
Exit criteria: Voice Call and meetings share event and cancellation semantics
without introducing telephony or meeting branches in core.
### Phase 7: Config And Doctor Cleanup
- keep `talk.provider` and `talk.providers.*` as speech/STT/TTS config
- keep realtime voice selectors under `talk.realtime`
- make `talk.config` return only resolved effective provider data
- repair legacy realtime placement in doctor
- document that runtime startup does not guess or rewrite config
- update SDK migration, Gateway protocol, Talk node, Control UI, and TTS docs
Exit criteria: no second speech namespace, no startup migrations, and no
ambiguous active provider in `talk.config`.
### Phase 8: Delete The Retired Stack
- remove `/voiceclaw/realtime`
- delete `src/gateway/voiceclaw-realtime/`
- remove request-time `instructionsOverride`
- remove old RPC handlers, scopes, broadcast guards, protocol schemas, generated clients, docs, and UI calls
- keep old names only in explicit migration tables and negative tests
Exit criteria: repository search finds removed public names only in migration
notes or tests that assert absence.
## Test And Verification Plan
The full matrix lives in
[Talk refactor execution checklist](/refactor/talk-execution). The required
proof areas are:
- protocol and generated clients expose only the final Talk API
- Gateway tests cover every `talk.client.*` and `talk.session.*` method
- UI tests prove browser WebRTC, provider WebSocket, and relay paths use the final API
- native tests prove managed-room push-to-talk cleanup, replacement, and event flow
- Voice Call and meeting tests prove early speech, barge-in, output state, and cancellation behavior
- config tests prove `talk.config` reports only resolved effective provider data
- architecture searches prove removed RPCs, events, endpoint, folder, and instruction override stay gone
- docs, protocol generation, SDK API checks, Android tests, build, and `pnpm check:changed` pass before push
## Definition Of Done
The refactor is complete when:
- final API is the only advertised public API
- removed RPCs are gone from handlers, scopes, method lists, schemas, generated clients, docs, and UI
- removed event channels are gone
- retired realtime HTTP endpoint is gone
- retired realtime folder is gone
- browser Talk works through `talk.client.*` or `talk.session.*`
- native Talk works through session events
- streaming STT works through `talk.session.*`
- TTS one-shot remains `talk.speak`
- walkie-talkie works through managed-room sessions
- Voice Call and meetings use shared events and cancellation semantics
- cancellation aborts underlying work
- event envelopes are consistent
- config migration is handled by doctor
- tests prove the deleted API cannot accidentally return
Supporting details:
- [Talk API and runtime contract](/refactor/talk-api-contract)
- [Talk surface mapping](/refactor/talk-surfaces)
- [Talk refactor execution checklist](/refactor/talk-execution)
The end state: one Talk system, a small public API, provider-owned vendor
logic, surface-owned IO, and a Gateway core that owns policy, events, sessions,
turns, cancellation, and observability.

View File

@@ -96,7 +96,7 @@ Imported themes are stored only in the current browser profile. They are not wri
<AccordionGroup>
<Accordion title="Chat and Talk">
- Chat with the model via Gateway WS (`chat.history`, `chat.send`, `chat.abort`, `chat.inject`).
- Talk through browser realtime sessions. OpenAI uses direct WebRTC, Google Live uses a constrained one-use browser token over WebSocket, and backend-only realtime voice plugins use the Gateway relay transport. The relay keeps provider credentials on the Gateway while the browser streams microphone PCM through `talk.realtime.relay*` RPCs and forwards `openclaw_agent_consult` provider tool calls through `talk.realtime.toolCall` for Gateway policy and the larger configured OpenClaw model.
- Talk through browser realtime sessions. OpenAI uses direct WebRTC, Google Live uses a constrained one-use browser token over WebSocket, and backend-only realtime voice plugins use the Gateway relay transport. Client-owned provider sessions start with `talk.client.create`; Gateway relay sessions start with `talk.session.create`. The relay keeps provider credentials on the Gateway while the browser streams microphone PCM through `talk.session.appendAudio` and forwards `openclaw_agent_consult` provider tool calls through `talk.client.toolCall` for Gateway policy and the larger configured OpenClaw model.
- Stream tool calls + live tool output cards in Chat (agent events).
</Accordion>
@@ -168,9 +168,9 @@ Imported themes are stored only in the current browser profile. They are not wri
</Accordion>
<Accordion title="Talk mode (browser realtime)">
Talk mode uses a registered realtime voice provider. Configure OpenAI with `talk.realtime.provider: "openai"` plus `talk.realtime.providers.openai.apiKey`, or configure Google with `talk.realtime.provider: "google"` plus `talk.realtime.providers.google.apiKey`; Voice Call realtime provider config can still be reused as the fallback. The browser never receives a standard provider API key. OpenAI receives an ephemeral Realtime client secret for WebRTC. Google Live receives a one-use constrained Live API auth token for a browser WebSocket session, with instructions and tool declarations locked into the token by the Gateway. Providers that only expose a backend realtime bridge run through the Gateway relay transport, so credentials and vendor sockets stay server-side while browser audio moves through authenticated Gateway RPCs. The Realtime session prompt is assembled by the Gateway; `talk.realtime.session` does not accept caller-provided instruction overrides.
Talk mode uses a registered realtime voice provider. Configure OpenAI with `talk.realtime.provider: "openai"` plus `talk.realtime.providers.openai.apiKey`, or configure Google with `talk.realtime.provider: "google"` plus `talk.realtime.providers.google.apiKey`. The browser never receives a standard provider API key. OpenAI receives an ephemeral Realtime client secret for WebRTC. Google Live receives a one-use constrained Live API auth token for a browser WebSocket session, with instructions and tool declarations locked into the token by the Gateway. Providers that only expose a backend realtime bridge run through the Gateway relay transport, so credentials and vendor sockets stay server-side while browser audio moves through authenticated Gateway RPCs. The Realtime session prompt is assembled by the Gateway; `talk.client.create` does not accept caller-provided instruction overrides.
In the Chat composer, the Talk control is the waves button next to the microphone dictation button. When Talk starts, the composer status row shows `Connecting Talk...`, then `Talk live` while audio is connected, or `Asking OpenClaw...` while a realtime tool call is consulting the configured larger model through `talk.realtime.toolCall`.
In the Chat composer, the Talk control is the waves button next to the microphone dictation button. When Talk starts, the composer status row shows `Connecting Talk...`, then `Talk live` while audio is connected, or `Asking OpenClaw...` while a realtime tool call is consulting the configured larger model through `talk.client.toolCall`.
Maintainer live smoke: `OPENAI_API_KEY=... GEMINI_API_KEY=... node --import tsx scripts/dev/realtime-talk-live-smoke.ts` verifies the OpenAI browser WebRTC SDP exchange, Google Live constrained-token browser WebSocket setup, and the Gateway relay browser adapter with fake microphone media. The command prints provider status only and does not log secrets.