mirror of
https://github.com/openclaw/openclaw.git
synced 2026-05-06 05:30:42 +00:00
docs: detail talk refactor plan
This commit is contained in:
@@ -184,42 +184,6 @@ OPENCLAW_CONFIG_PATH=~/.openclaw/b.json OPENCLAW_STATE_DIR=~/.openclaw-b opencla
|
||||
|
||||
Detailed setup: [/gateway/multiple-gateways](/gateway/multiple-gateways).
|
||||
|
||||
## VoiceClaw real-time brain endpoint
|
||||
|
||||
OpenClaw exposes a VoiceClaw-compatible real-time WebSocket endpoint at
|
||||
`/voiceclaw/realtime`. Use it when a VoiceClaw desktop client should talk
|
||||
directly to a real-time OpenClaw brain instead of going through a separate relay
|
||||
process.
|
||||
|
||||
The endpoint uses Gemini Live for real-time audio and calls OpenClaw as the
|
||||
brain by exposing OpenClaw tools directly to Gemini Live. Tool calls return an
|
||||
immediate `working` result to keep the voice turn responsive, then OpenClaw
|
||||
executes the actual tool asynchronously and injects the result back into the
|
||||
live session. Set `GEMINI_API_KEY` in the gateway process environment. If
|
||||
gateway auth is enabled, the desktop client sends the gateway token or password
|
||||
in its first `session.config` message.
|
||||
|
||||
Real-time brain access runs owner-authorized OpenClaw agent commands. Keep
|
||||
`gateway.auth.mode: "none"` limited to loopback-only test instances. Non-local
|
||||
real-time brain connections require gateway auth.
|
||||
|
||||
For an isolated test gateway, run a separate instance with its own port, config,
|
||||
and state:
|
||||
|
||||
```bash
|
||||
OPENCLAW_CONFIG_PATH=/path/to/openclaw-realtime/openclaw.json \
|
||||
OPENCLAW_STATE_DIR=/path/to/openclaw-realtime/state \
|
||||
OPENCLAW_SKIP_CHANNELS=1 \
|
||||
GEMINI_API_KEY=... \
|
||||
openclaw gateway --port 19789
|
||||
```
|
||||
|
||||
Then configure VoiceClaw to use:
|
||||
|
||||
```text
|
||||
ws://127.0.0.1:19789/voiceclaw/realtime
|
||||
```
|
||||
|
||||
## Remote access
|
||||
|
||||
Preferred: Tailscale/VPN.
|
||||
|
||||
@@ -364,15 +364,17 @@ enumeration of `src/gateway/server-methods/*.ts`.
|
||||
<Accordion title="Talk and TTS">
|
||||
- `talk.catalog` returns the read-only Talk provider catalog for speech, streaming transcription, and realtime voice. It includes provider ids, labels, configured state, exposed model/voice ids, canonical modes, transports, brain strategies, and realtime audio/capability flags without returning provider secrets or mutating global config.
|
||||
- `talk.config` returns the effective Talk config payload; `includeSecrets` requires `operator.talk.secrets` (or `operator.admin`).
|
||||
- `talk.handoff.create` creates an expiring managed-room handoff for an existing session key. The result contains a room id, room URL, bearer token, optional session-scoped provider/model/voice selection, mode, transport, brain strategy, and expiry for a first-party walkie-talkie client. `brain: "direct-tools"` requires `operator.admin`.
|
||||
- `talk.handoff.join` validates a handoff id plus bearer token, emits `session.ready` or `session.replaced` room events as needed, and returns room/session metadata plus recent Talk events without the plaintext token or stored token hash.
|
||||
- `talk.handoff.turnStart`, `talk.handoff.turnEnd`, and `talk.handoff.turnCancel` let a first-party managed-room client drive the room turn lifecycle with `turn.started`, `turn.ended`, and `turn.cancelled` Talk events.
|
||||
- `talk.handoff.revoke` invalidates an unexpired handoff, emits `session.closed`, and makes later joins fail.
|
||||
- `talk.session.create` creates a Gateway-owned Talk session for `realtime/gateway-relay`, `transcription/gateway-relay`, or `stt-tts/managed-room`. `brain: "direct-tools"` requires `operator.admin`.
|
||||
- `talk.session.join` validates a managed-room session token, emits `session.ready` or `session.replaced` events as needed, and returns room/session metadata plus recent Talk events without the plaintext token or stored token hash.
|
||||
- `talk.session.appendAudio` appends base64 PCM input audio to Gateway-owned realtime relay and transcription sessions.
|
||||
- `talk.session.startTurn`, `talk.session.endTurn`, and `talk.session.cancelTurn` drive managed-room turn lifecycle with stale-turn rejection before state is cleared.
|
||||
- `talk.session.cancelOutput` stops assistant audio output, primarily for VAD-gated barge-in in Gateway relay sessions.
|
||||
- `talk.session.submitToolResult` completes a provider tool call emitted by a Gateway-owned realtime relay session.
|
||||
- `talk.session.close` closes a Gateway-owned relay, transcription, or managed-room session and emits terminal Talk events.
|
||||
- `talk.mode` sets/broadcasts the current Talk mode state for WebChat/Control UI clients.
|
||||
- `talk.realtime.session` creates a browser realtime session using canonical transports (`webrtc`, `provider-websocket`, or `gateway-relay`). It accepts optional `mode`, `transport`, and `brain` selectors, but currently only public browser `mode: "realtime"` plus `brain: "agent-consult"` is supported; `managed-room` remains reserved for handoff clients until the browser owns a real room client.
|
||||
- `talk.realtime.relayAudio`, `talk.realtime.relayCancel`, `talk.realtime.relayMark`, `talk.realtime.relayStop`, and `talk.realtime.relayToolResult` control Gateway-owned realtime relay sessions. Relay cancellation clears provider output and aborts any linked agent consult run.
|
||||
- `talk.realtime.toolCall` lets browser-owned realtime transports forward provider tool calls to Gateway policy. The first supported tool is `openclaw_agent_consult`; clients receive a run id and wait for normal chat lifecycle events before submitting the provider-specific tool result. Gateway relay clients include `relaySessionId` so turn cancellation can abort the consult.
|
||||
- `talk.transcription.session` creates a transcription-only Gateway relay over the configured streaming STT provider. Clients send PCM frames through `talk.transcription.relayAudio`, cancel an active turn with `talk.transcription.relayCancel`, receive `talk.transcription.relay` events with common Talk envelopes, and close with `talk.transcription.relayStop`.
|
||||
- `talk.client.create` creates a client-owned realtime provider session using `webrtc` or `provider-websocket` while the Gateway owns config, credentials, instructions, and tool policy.
|
||||
- `talk.client.toolCall` lets client-owned realtime transports forward provider tool calls to Gateway policy. The first supported tool is `openclaw_agent_consult`; clients receive a run id and wait for normal chat lifecycle events before submitting the provider-specific tool result.
|
||||
- `talk.event` is the single Talk event channel for realtime, transcription, STT/TTS, managed-room, telephony, and meeting adapters.
|
||||
- `talk.speak` synthesizes speech through the active Talk speech provider.
|
||||
- `tts.status` returns TTS enabled state, active provider, fallback providers, and provider config state.
|
||||
- `tts.providers` returns the visible TTS provider inventory.
|
||||
|
||||
@@ -9,8 +9,8 @@ title: "Talk mode"
|
||||
Talk mode has two runtime shapes:
|
||||
|
||||
- Native macOS/iOS/Android Talk uses local speech recognition, Gateway chat, and `talk.speak` TTS. Nodes advertise the `talk` capability and declare the `talk.*` commands they support.
|
||||
- Browser Talk uses `talk.realtime.session` with canonical transports: `webrtc`, `provider-websocket`, or `gateway-relay`. `managed-room` is reserved for Gateway handoff rooms.
|
||||
- Transcription-only clients use `talk.transcription.session` plus `talk.transcription.relayAudio`, `talk.transcription.relayCancel`, and `talk.transcription.relayStop` when they need captions or dictation without an assistant voice response.
|
||||
- Browser Talk uses `talk.client.create` for client-owned `webrtc` and `provider-websocket` sessions, or `talk.session.create` for Gateway-owned `gateway-relay` sessions. `managed-room` is reserved for Gateway handoff and walkie-talkie rooms.
|
||||
- Transcription-only clients use `talk.session.create({ mode: "transcription", transport: "gateway-relay", brain: "none" })`, then `talk.session.appendAudio`, `talk.session.cancelTurn`, and `talk.session.close` when they need captions or dictation without an assistant voice response.
|
||||
|
||||
Native Talk is a continuous voice conversation loop:
|
||||
|
||||
@@ -19,7 +19,7 @@ Native Talk is a continuous voice conversation loop:
|
||||
3. Wait for the response
|
||||
4. Speak it via the configured Talk provider (`talk.speak`)
|
||||
|
||||
Browser realtime Talk forwards provider tool calls through `talk.realtime.toolCall`; browser clients do not call `chat.send` directly for realtime consults.
|
||||
Browser realtime Talk forwards provider tool calls through `talk.client.toolCall`; browser clients do not call `chat.send` directly for realtime consults.
|
||||
|
||||
Transcription-only Talk emits the same common Talk event envelope as realtime and STT/TTS sessions, but uses `mode: "transcription"` and `brain: "none"`. It is for captions, dictation, and observe-only speech capture; one-shot uploaded voice notes still use the media/audio path.
|
||||
|
||||
@@ -132,8 +132,8 @@ Defaults:
|
||||
|
||||
- Requires Speech + Microphone permissions.
|
||||
- Native Talk uses the active Gateway session and only falls back to history polling when response events are unavailable.
|
||||
- Browser realtime Talk uses `talk.realtime.toolCall` for `openclaw_agent_consult` instead of exposing `chat.send` to provider-owned browser sessions.
|
||||
- Transcription-only Talk uses `talk.transcription.session`, `talk.transcription.relayAudio`, `talk.transcription.relayCancel`, and `talk.transcription.relayStop`; clients subscribe to `talk.transcription.relay` events for partial/final transcript updates.
|
||||
- Browser realtime Talk uses `talk.client.toolCall` for `openclaw_agent_consult` instead of exposing `chat.send` to provider-owned browser sessions.
|
||||
- Transcription-only Talk uses `talk.session.create`, `talk.session.appendAudio`, `talk.session.cancelTurn`, and `talk.session.close`; clients subscribe to `talk.event` for partial/final transcript updates.
|
||||
- The gateway resolves Talk playback through `talk.speak` using the active Talk provider. Android falls back to local system TTS only when that RPC is unavailable.
|
||||
- macOS local MLX playback uses the bundled `openclaw-mlx-tts` helper when present, or an executable on `PATH`. Set `OPENCLAW_MLX_TTS_BIN` to point at a custom helper binary during development.
|
||||
- `stability` for `eleven_v3` is validated to `0.0`, `0.5`, or `1.0`; other models accept `0..1`.
|
||||
|
||||
@@ -87,20 +87,19 @@ event history, and stale-turn rejection. Provider plugins should keep owning
|
||||
vendor-specific realtime sessions; surface plugins should keep owning capture,
|
||||
playback, telephony, and meeting quirks.
|
||||
|
||||
This migration is intentionally adapter-first:
|
||||
This Talk migration is intentionally breaking-clean:
|
||||
|
||||
1. Add shared controller/runtime primitives to `plugin-sdk/realtime-voice`.
|
||||
2. Keep existing public Gateway RPCs such as `talk.realtime.session`,
|
||||
`talk.realtime.relayAudio`, `talk.transcription.session`, and
|
||||
`talk.handoff.*` as compatibility adapters.
|
||||
3. Move bundled surfaces onto the shared controller: browser relay, managed-room
|
||||
handoff, voice-call realtime, voice-call streaming STT, Google Meet realtime,
|
||||
and VoiceClaw realtime.
|
||||
4. Advertise all Talk event channels in Gateway `hello-ok.features.events` so
|
||||
clients can discover `talk.event`, `talk.realtime.relay`, and
|
||||
`talk.transcription.relay`.
|
||||
5. Expose the versioned `talk.session.*` API for Gateway-managed Talk sessions
|
||||
after the adapters are internally backed by the same controller.
|
||||
1. Keep the shared controller/runtime primitives in
|
||||
`plugin-sdk/realtime-voice`.
|
||||
2. Move bundled surfaces onto the shared controller: browser relay,
|
||||
managed-room handoff, voice-call realtime, voice-call streaming STT, Google
|
||||
Meet realtime, and native push-to-talk.
|
||||
3. Replace old Talk RPC families with the final `talk.session.*` and
|
||||
`talk.client.*` API.
|
||||
4. Advertise one live Talk event channel in Gateway
|
||||
`hello-ok.features.events`: `talk.event`.
|
||||
5. Delete the old realtime HTTP endpoint and any request-time instruction
|
||||
override path.
|
||||
|
||||
New code should not call `createTalkEventSequencer(...)` directly unless it is
|
||||
implementing a low-level adapter or test fixture. Prefer the shared controller
|
||||
@@ -112,24 +111,33 @@ handoff, and native Talk clients.
|
||||
The target public API shape is:
|
||||
|
||||
```typescript
|
||||
// Versioned Gateway-managed Talk session API.
|
||||
// Gateway-owned Talk session API.
|
||||
await gateway.request("talk.session.create", {
|
||||
mode: "realtime",
|
||||
transport: "gateway-relay",
|
||||
brain: "agent-consult",
|
||||
sessionKey: "main",
|
||||
});
|
||||
await gateway.request("talk.session.inputAudio", { sessionId, audioBase64 });
|
||||
await gateway.request("talk.session.control", { sessionId, type: "turn.cancel" });
|
||||
await gateway.request("talk.session.toolResult", { sessionId, callId, result });
|
||||
await gateway.request("talk.session.appendAudio", { sessionId, audioBase64 });
|
||||
await gateway.request("talk.session.cancelOutput", { sessionId, reason: "barge-in" });
|
||||
await gateway.request("talk.session.submitToolResult", { sessionId, callId, result });
|
||||
await gateway.request("talk.session.close", { sessionId });
|
||||
|
||||
// Client-owned provider session API.
|
||||
await gateway.request("talk.client.create", {
|
||||
mode: "realtime",
|
||||
transport: "webrtc",
|
||||
brain: "agent-consult",
|
||||
sessionKey: "main",
|
||||
});
|
||||
await gateway.request("talk.client.toolCall", { sessionKey, callId, name, args });
|
||||
```
|
||||
|
||||
Browser-owned WebRTC/provider-websocket sessions stay on
|
||||
`talk.realtime.session`, because the browser owns the provider negotiation and
|
||||
media transport. `talk.session.*` is the common Gateway-managed surface for
|
||||
gateway-relay realtime, gateway-relay transcription, and managed-room native
|
||||
STT/TTS sessions.
|
||||
Browser-owned WebRTC/provider-websocket sessions use `talk.client.create`,
|
||||
because the browser owns the provider negotiation and media transport while the
|
||||
Gateway owns credentials, instructions, and tool policy. `talk.session.*` is the
|
||||
common Gateway-managed surface for gateway-relay realtime, gateway-relay
|
||||
transcription, and managed-room native STT/TTS sessions.
|
||||
|
||||
Legacy configs that placed realtime selectors beside `talk.provider` /
|
||||
`talk.providers` should be repaired with `openclaw doctor --fix`; runtime Talk
|
||||
@@ -144,30 +152,43 @@ The supported `talk.session.create` combinations are intentionally small:
|
||||
| `stt-tts` | `managed-room` | `agent-consult` | Native/client room | Push-to-talk and walkie-talkie style rooms where the client owns capture/playback and the Gateway owns turn state. |
|
||||
| `stt-tts` | `managed-room` | `direct-tools` | Native/client room | Admin-only room mode for trusted first-party surfaces that execute Gateway tool actions directly. |
|
||||
|
||||
Everything else should stay on the existing owner-specific adapter until there
|
||||
is a real Gateway-managed transport for it:
|
||||
Removed method map:
|
||||
|
||||
| Existing adapter | Keep using it for |
|
||||
| ----------------------- | ---------------------------------------------------------------------------------------- |
|
||||
| `talk.realtime.session` | Browser-owned WebRTC and provider-websocket realtime sessions. |
|
||||
| `talk.realtime.relay*` | Compatibility for existing browser relay clients while they migrate to `talk.session.*`. |
|
||||
| `talk.transcription.*` | Compatibility for existing streaming STT clients while they migrate to `talk.session.*`. |
|
||||
| `talk.handoff.*` | Compatibility for room-style native clients; internally this is the managed-room shape. |
|
||||
| Old | New |
|
||||
| -------------------------------- | -------------------------------------------------------- |
|
||||
| `talk.realtime.session` | `talk.client.create` |
|
||||
| `talk.realtime.toolCall` | `talk.client.toolCall` |
|
||||
| `talk.realtime.relayAudio` | `talk.session.appendAudio` |
|
||||
| `talk.realtime.relayCancel` | `talk.session.cancelOutput` or `talk.session.cancelTurn` |
|
||||
| `talk.realtime.relayToolResult` | `talk.session.submitToolResult` |
|
||||
| `talk.realtime.relayStop` | `talk.session.close` |
|
||||
| `talk.transcription.session` | `talk.session.create({ mode: "transcription" })` |
|
||||
| `talk.transcription.relayAudio` | `talk.session.appendAudio` |
|
||||
| `talk.transcription.relayCancel` | `talk.session.cancelTurn` |
|
||||
| `talk.transcription.relayStop` | `talk.session.close` |
|
||||
| `talk.handoff.create` | `talk.session.create({ transport: "managed-room" })` |
|
||||
| `talk.handoff.join` | `talk.session.join` |
|
||||
| `talk.handoff.revoke` | `talk.session.close` |
|
||||
|
||||
The unified control vocabulary is also deliberately narrow:
|
||||
|
||||
| Method | Applies to | Contract |
|
||||
| ------------------------- | ------------------------------------------------------- | -------------------------------------------------------------------------------------------------------- |
|
||||
| `talk.session.inputAudio` | `realtime/gateway-relay`, `transcription/gateway-relay` | Append a base64 PCM audio chunk to the provider session owned by the same Gateway connection. |
|
||||
| `talk.session.control` | all unified sessions | `turn.cancel` for relay sessions; `turn.start`, `turn.end`, and `turn.cancel` for managed-room sessions. |
|
||||
| `talk.session.toolResult` | `realtime/gateway-relay` | Complete a provider tool call emitted by the relay. |
|
||||
| `talk.session.close` | all unified sessions | Stop relay sessions or revoke managed-room handoff state, then forget the unified session id. |
|
||||
| Method | Applies to | Contract |
|
||||
| ------------------------------- | ------------------------------------------------------- | --------------------------------------------------------------------------------------------- |
|
||||
| `talk.session.appendAudio` | `realtime/gateway-relay`, `transcription/gateway-relay` | Append a base64 PCM audio chunk to the provider session owned by the same Gateway connection. |
|
||||
| `talk.session.startTurn` | `stt-tts/managed-room` | Start a managed-room user turn. |
|
||||
| `talk.session.endTurn` | `stt-tts/managed-room` | End the active turn after stale-turn validation. |
|
||||
| `talk.session.cancelTurn` | all Gateway-owned sessions | Cancel active capture/provider/agent/TTS work for a turn. |
|
||||
| `talk.session.cancelOutput` | `realtime/gateway-relay` | Stop assistant audio output without necessarily ending the user turn. |
|
||||
| `talk.session.submitToolResult` | `realtime/gateway-relay` | Complete a provider tool call emitted by the relay. |
|
||||
| `talk.session.close` | all unified sessions | Stop relay sessions or revoke managed-room state, then forget the unified session id. |
|
||||
|
||||
Do not introduce provider or platform special cases in core to make this work.
|
||||
Core owns Talk session semantics. Provider plugins own vendor session setup.
|
||||
Voice-call and Google Meet own telephony/meeting adapters. Browser and native
|
||||
apps own device capture/playback UX.
|
||||
|
||||
The detailed implementation plan lives in [Talk refactor plan](/refactor/talk).
|
||||
|
||||
## Compatibility policy
|
||||
|
||||
For external plugins, compatibility work follows this order:
|
||||
|
||||
320
docs/refactor/talk-api-contract.md
Normal file
320
docs/refactor/talk-api-contract.md
Normal file
@@ -0,0 +1,320 @@
|
||||
---
|
||||
summary: "Detailed API, event, runtime, cancellation, and tool-policy contract for the Talk refactor"
|
||||
read_when:
|
||||
- Implementing Talk Gateway methods or protocol schemas
|
||||
- Changing Talk config, events, cancellation, or provider tool policy
|
||||
- Reviewing whether a Talk behavior belongs in core or an adapter
|
||||
title: "Talk API and runtime contract"
|
||||
---
|
||||
|
||||
# Talk API And Runtime Contract
|
||||
|
||||
This is the detailed contract for [Talk refactor plan](/refactor/talk).
|
||||
|
||||
## Config Contract
|
||||
|
||||
Config stays under the existing `talk` object. Do not add `talk.speech` in this
|
||||
refactor.
|
||||
|
||||
```ts
|
||||
type TalkConfig = {
|
||||
provider?: string;
|
||||
providers?: Record<string, unknown>;
|
||||
realtime?: {
|
||||
provider?: string;
|
||||
model?: string;
|
||||
voice?: string;
|
||||
mode?: TalkMode;
|
||||
transport?: TalkTransport;
|
||||
brain?: TalkBrain;
|
||||
providers?: Record<string, unknown>;
|
||||
};
|
||||
input?: {
|
||||
interruptOnSpeech?: boolean;
|
||||
silenceTimeoutMs?: number;
|
||||
};
|
||||
};
|
||||
```
|
||||
|
||||
Rules:
|
||||
|
||||
- `talk.provider` and `talk.providers.*` remain speech/STT/TTS provider config.
|
||||
- `talk.realtime.provider` and `talk.realtime.providers.*` are realtime voice provider config.
|
||||
- `talk.config` returns effective config without secrets unless privileged.
|
||||
- `talk.catalog` returns capabilities, not inferred provider-id guesses.
|
||||
- Doctor migrates old realtime selectors into `talk.realtime`.
|
||||
- Runtime does not silently reinterpret Voice Call or TTS config as realtime config.
|
||||
|
||||
## Method Semantics
|
||||
|
||||
### `talk.catalog`
|
||||
|
||||
Returns effective Talk capabilities:
|
||||
|
||||
- modes
|
||||
- transports
|
||||
- brain strategies
|
||||
- providers
|
||||
- models
|
||||
- voices
|
||||
- input audio formats
|
||||
- output audio formats
|
||||
- browser-safe client session support
|
||||
- Gateway relay support
|
||||
- managed-room support
|
||||
- local STT/TTS support
|
||||
|
||||
Provider capability declarations drive this. Core must not infer support from
|
||||
provider ids.
|
||||
|
||||
### `talk.speak`
|
||||
|
||||
One-shot TTS:
|
||||
|
||||
```ts
|
||||
await gateway.request("talk.speak", {
|
||||
text: "Ready.",
|
||||
voice: "alloy",
|
||||
});
|
||||
```
|
||||
|
||||
`talk.speak` does not create live session state, turn state, transcript state,
|
||||
barge-in state, or provider realtime state.
|
||||
|
||||
### `talk.client.create`
|
||||
|
||||
Creates a client-owned provider session while Gateway still owns config,
|
||||
instructions, credentials, and tool policy.
|
||||
|
||||
Use it for browser WebRTC, browser provider WebSocket, and native provider media
|
||||
sessions that require client-owned sockets. Reject `gateway-relay` and
|
||||
`managed-room`; the error points clients to `talk.session.create`.
|
||||
|
||||
### `talk.client.toolCall`
|
||||
|
||||
Forwards provider tool calls from client-owned provider sessions to Gateway
|
||||
policy:
|
||||
|
||||
```ts
|
||||
await gateway.request("talk.client.toolCall", {
|
||||
sessionId,
|
||||
callId,
|
||||
name,
|
||||
argumentsJson,
|
||||
});
|
||||
```
|
||||
|
||||
Validate session identity, caller ownership, brain strategy, and policy. Pass an
|
||||
`AbortSignal` into agent/tool runtime, reject stale or closed sessions, and never
|
||||
accept request-time instructions.
|
||||
|
||||
### `talk.session.create`
|
||||
|
||||
Creates a Gateway-owned live Talk session.
|
||||
|
||||
| Mode | Transport | Brain | Owner |
|
||||
| --------------- | --------------- | --------------- | ------------------- |
|
||||
| `realtime` | `gateway-relay` | `agent-consult` | Gateway |
|
||||
| `transcription` | `gateway-relay` | `none` | Gateway |
|
||||
| `stt-tts` | `managed-room` | `agent-consult` | Gateway/client room |
|
||||
| `stt-tts` | `managed-room` | `direct-tools` | trusted room |
|
||||
|
||||
Reject `webrtc` and `provider-websocket`; the error points clients to
|
||||
`talk.client.create`.
|
||||
|
||||
### `talk.session.join`
|
||||
|
||||
Joins or reconnects to a Gateway-owned managed room. Validate session id and
|
||||
token, never expose token hashes, emit `session.replaced` to the displaced
|
||||
client, and emit `session.ready` to the new owner.
|
||||
|
||||
### `talk.session.appendAudio`
|
||||
|
||||
Appends an input audio frame to a Gateway-owned relay session:
|
||||
|
||||
```ts
|
||||
await gateway.request("talk.session.appendAudio", {
|
||||
sessionId,
|
||||
audioBase64,
|
||||
timestamp,
|
||||
});
|
||||
```
|
||||
|
||||
Use for realtime Gateway relay and streaming transcription. Do not use this for
|
||||
managed-room native push-to-talk when the native node captures audio locally and
|
||||
returns transcript/output through node command results.
|
||||
|
||||
### Turn Verbs
|
||||
|
||||
Use explicit verbs instead of generic controls:
|
||||
|
||||
```ts
|
||||
await gateway.request("talk.session.startTurn", { sessionId });
|
||||
await gateway.request("talk.session.endTurn", { sessionId, turnId });
|
||||
await gateway.request("talk.session.cancelTurn", { sessionId, turnId, reason });
|
||||
await gateway.request("talk.session.cancelOutput", { sessionId, turnId, reason });
|
||||
```
|
||||
|
||||
`endTurn` rejects stale `turnId` before clearing active state. `cancelTurn`
|
||||
aborts capture, STT, provider response, agent consult, tools, TTS, relay output,
|
||||
and room streams tied to that turn. `cancelOutput` stops assistant audio without
|
||||
necessarily ending the user turn. Barge-in must be speech/VAD gated.
|
||||
|
||||
### `talk.session.submitToolResult`
|
||||
|
||||
Completes a provider tool call emitted inside a Gateway-owned relay session:
|
||||
|
||||
```ts
|
||||
await gateway.request("talk.session.submitToolResult", {
|
||||
sessionId,
|
||||
callId,
|
||||
output,
|
||||
});
|
||||
```
|
||||
|
||||
### `talk.session.close`
|
||||
|
||||
Closes a Gateway-owned session. Close emits one terminal event, stops capture and
|
||||
playback, aborts provider and agent work, drains TTS, revokes room join state,
|
||||
and removes retained state after its replay/debug window.
|
||||
|
||||
## Event Contract
|
||||
|
||||
All live Talk paths emit one public event channel:
|
||||
|
||||
```ts
|
||||
talk.event;
|
||||
```
|
||||
|
||||
Every event uses this envelope:
|
||||
|
||||
```ts
|
||||
type TalkEvent<TPayload = unknown> = {
|
||||
id: string;
|
||||
type: TalkEventType;
|
||||
sessionId: string;
|
||||
turnId?: string;
|
||||
captureId?: string;
|
||||
seq: number;
|
||||
timestamp: string;
|
||||
mode: TalkMode;
|
||||
transport: TalkTransport;
|
||||
brain: TalkBrain;
|
||||
provider?: string;
|
||||
final?: boolean;
|
||||
callId?: string;
|
||||
itemId?: string;
|
||||
parentId?: string;
|
||||
source?: string;
|
||||
payload: TPayload;
|
||||
};
|
||||
```
|
||||
|
||||
Core event types include `session.*`, `turn.*`, `capture.*`, `input.audio.*`,
|
||||
`transcript.*`, `output.text.*`, `output.audio.*`, `tool.*`, `usage.metrics`,
|
||||
`latency.metrics`, and `health.changed`.
|
||||
|
||||
Rules:
|
||||
|
||||
- `sessionId` is required for every event.
|
||||
- `turnId` is required for turn-bound input, output, transcript, tool, and cancellation events.
|
||||
- `captureId` is required while capture is active.
|
||||
- `seq` monotonically increases per session.
|
||||
- `timestamp` uses ISO 8601 UTC.
|
||||
- `callId`, `itemId`, and `parentId` correlate provider responses, tool calls, TTS jobs, and relay frames.
|
||||
- payloads must not duplicate large raw audio frames when transport already carries them.
|
||||
- consumers should rely on envelope fields instead of provider-specific payloads.
|
||||
|
||||
Text-ready is not audio-ready. Clients may show text after `output.text.done`,
|
||||
but should not enter speaking/playback state until `output.audio.started` or
|
||||
`output.audio.delta`.
|
||||
|
||||
## Shared Runtime Target
|
||||
|
||||
Keep one provider-agnostic runtime under `src/talk`. The first pass keeps names
|
||||
close to the old runtime modules so the move stays reviewable:
|
||||
|
||||
```text
|
||||
src/talk/
|
||||
audio-codec.ts
|
||||
agent-consult-runtime.ts
|
||||
agent-consult-tool.ts
|
||||
agent-talkback-runtime.ts
|
||||
fast-context-runtime.ts
|
||||
provider-registry.ts
|
||||
provider-resolver.ts
|
||||
provider-types.ts
|
||||
session-log-runtime.ts
|
||||
session-runtime.ts
|
||||
talk-events.ts
|
||||
talk-session-controller.ts
|
||||
```
|
||||
|
||||
New code should import the shared runtime from `src/talk` inside core. Plugins
|
||||
that already use the stable SDK subpath keep importing
|
||||
`openclaw/plugin-sdk/realtime-voice`; that facade re-exports the Talk runtime
|
||||
contract without exposing core file layout.
|
||||
|
||||
Responsibilities:
|
||||
|
||||
- normalize modes, transports, brains, codecs, and audio metadata
|
||||
- create, close, and replace session records
|
||||
- allocate turn ids and capture ids
|
||||
- reject stale turn ids before mutation
|
||||
- sequence events
|
||||
- retain recent events for replay, reconnect, and diagnostics
|
||||
- track active input capture and assistant output
|
||||
- coordinate barge-in and output cancellation
|
||||
- propagate abort signals
|
||||
- register provider tool calls and bind tool results
|
||||
- expose test builders for session/event assertions
|
||||
|
||||
Gateway method files should become thin adapters:
|
||||
|
||||
```text
|
||||
src/gateway/server-methods/
|
||||
talk.ts
|
||||
talk-client.ts
|
||||
talk-session.ts
|
||||
```
|
||||
|
||||
Internal Gateway helpers may exist only as staging files while code moves to
|
||||
`src/talk`.
|
||||
|
||||
## Cancellation Contract
|
||||
|
||||
Cancellation must abort underlying work, not only ignore stale output.
|
||||
|
||||
When a turn or session is cancelled:
|
||||
|
||||
- provider realtime response is cancelled when supported
|
||||
- provider session is closed or reset when cancellation cannot be scoped
|
||||
- streaming STT receives abort
|
||||
- agent consult receives abort
|
||||
- queued tools do not start after abort
|
||||
- already-started side-effecting tools receive abort and report cancellation
|
||||
- pending TTS jobs are drained
|
||||
- playback sources are stopped
|
||||
- relay streams are cleared
|
||||
- managed-room capture and output state reset
|
||||
- stale finals and stale audio deltas are ignored
|
||||
- one terminal cancellation event is emitted
|
||||
|
||||
Barge-in uses VAD or provider speech-started signals, ignores silence and echo,
|
||||
cancels output only after real user speech, and starts or ensures a turn before
|
||||
emitting `turn.cancelled`.
|
||||
|
||||
## Tool Policy Contract
|
||||
|
||||
Gateway owns Talk tool policy.
|
||||
|
||||
Client-owned flow: `talk.client.create`, provider tool call to client,
|
||||
`talk.client.toolCall`, Gateway policy validation, agent/direct-tool execution,
|
||||
client result submission to provider.
|
||||
|
||||
Gateway-owned flow: `talk.session.create`, provider tool call to Gateway,
|
||||
Gateway policy validation, agent/direct-tool execution, provider result
|
||||
submission, `talk.event` emission.
|
||||
|
||||
No Talk path accepts caller-provided instructions. Gateway builds instructions
|
||||
from trusted config and session context.
|
||||
229
docs/refactor/talk-execution.md
Normal file
229
docs/refactor/talk-execution.md
Normal file
@@ -0,0 +1,229 @@
|
||||
---
|
||||
summary: "Implementation packages, deletion checklist, test matrix, and verification commands for the Talk refactor"
|
||||
read_when:
|
||||
- Implementing the Talk refactor plan
|
||||
- Deleting legacy Talk RPCs, event channels, or realtime endpoint code
|
||||
- Verifying browser, native, telephony, meeting, STT, or TTS Talk behavior after refactor work
|
||||
title: "Talk refactor execution checklist"
|
||||
---
|
||||
|
||||
# Talk Refactor Execution Checklist
|
||||
|
||||
Use this as the PR tracker for [Talk refactor plan](/refactor/talk).
|
||||
|
||||
## Implementation Packages
|
||||
|
||||
### Package 1: Protocol
|
||||
|
||||
- update `src/gateway/protocol/schema/channels.ts`
|
||||
- update `src/gateway/protocol/schema/protocol-schemas.ts`
|
||||
- update `src/gateway/protocol/schema/types.ts`
|
||||
- update `src/gateway/protocol/index.ts`
|
||||
- regenerate generated protocol clients
|
||||
- remove old schemas from generated metadata
|
||||
- update protocol tests
|
||||
|
||||
Done when old RPC/event names are absent from generated protocol output.
|
||||
|
||||
### Package 2: Gateway Methods
|
||||
|
||||
- split client-owned methods into `talk-client.ts`
|
||||
- keep session-owned methods in `talk-session.ts`
|
||||
- keep catalog/config/speak/mode in `talk.ts`
|
||||
- classify every new method in method scopes
|
||||
- advertise only `talk.event` in hello event features
|
||||
- remove old method list entries
|
||||
- update authorization tests
|
||||
|
||||
Done when every public Talk method has an explicit scope.
|
||||
|
||||
### Package 3: Session Runtime
|
||||
|
||||
- add `src/talk` primitives
|
||||
- move event sequencing into shared runtime
|
||||
- move stale-turn rejection into shared runtime
|
||||
- move active output state into shared runtime
|
||||
- move cancellation bookkeeping into shared runtime
|
||||
- expose small test helpers
|
||||
|
||||
Done when relay, transcription, handoff, telephony, and meetings do not each
|
||||
invent event and turn bookkeeping.
|
||||
|
||||
### Package 4: Browser UI
|
||||
|
||||
- update realtime startup to `talk.client.create`
|
||||
- update realtime tool consult to `talk.client.toolCall`
|
||||
- update relay startup to `talk.session.create`
|
||||
- update relay audio to `talk.session.appendAudio`
|
||||
- update relay tool result to `talk.session.submitToolResult`
|
||||
- update relay output cancel to `talk.session.cancelOutput`
|
||||
- update relay close to `talk.session.close`
|
||||
- listen only to `talk.event`
|
||||
- remove relay mark RPC
|
||||
|
||||
Done when UI tests prove no removed RPC names remain.
|
||||
|
||||
### Package 5: Native And Nodes
|
||||
|
||||
- route native Talk through session events
|
||||
- map push-to-talk commands to managed-room turn lifecycle
|
||||
- clean capture state on failed start
|
||||
- keep local STT/TTS as adapter behavior
|
||||
- remove chat-history polling from the success path
|
||||
- keep fallback polling only if explicitly needed
|
||||
|
||||
Done when native voice success path is event-driven.
|
||||
|
||||
### Package 6: Voice Call
|
||||
|
||||
- map telephony realtime events into `talk.event`
|
||||
- map local speech detection to `startTurn`, `cancelOutput`, and `cancelTurn`
|
||||
- pass abort through agent consult and tools
|
||||
- keep marks, clear, u-law, and call lifecycle in the plugin
|
||||
- add tests for early speech before provider speech-started
|
||||
|
||||
Done when Voice Call shares event and cancellation semantics without leaking
|
||||
telephony into core.
|
||||
|
||||
### Package 7: Meetings
|
||||
|
||||
- map meeting speech and transcript state into `talk.event`
|
||||
- keep participant and room state in meeting adapter
|
||||
- add echo-suppression aware barge-in tests
|
||||
- ensure meeting adapters can choose realtime, transcription, or `stt-tts`
|
||||
|
||||
Done when meeting behavior is an adapter over Talk, not a parallel realtime loop.
|
||||
|
||||
### Package 8: Doctor And Migration
|
||||
|
||||
- detect old realtime selectors outside `talk.realtime`
|
||||
- write explicit `talk.realtime.provider`, `model`, `voice`, `transport`, and `brain`
|
||||
- report removed RPC names when logs show old clients
|
||||
- keep startup free of hidden config rewrites
|
||||
- update SDK migration, Gateway protocol, Talk node, Control UI, and TTS docs
|
||||
|
||||
Done when runtime config is explicit and docs mention removed API only in
|
||||
migration notes.
|
||||
|
||||
## Deletion Checklist
|
||||
|
||||
Delete or prove absent:
|
||||
|
||||
- `src/gateway/voiceclaw-realtime/`
|
||||
- `/voiceclaw/realtime`
|
||||
- `instructionsOverride`
|
||||
- `talk.realtime.*` public RPCs
|
||||
- `talk.transcription.*` public RPCs
|
||||
- `talk.handoff.*` public RPCs
|
||||
- `talk.session.inputAudio`
|
||||
- `talk.session.control`
|
||||
- `talk.session.toolResult`
|
||||
- `talk.realtime.relay`
|
||||
- `talk.transcription.relay`
|
||||
- old generated protocol models
|
||||
- old UI relay method calls
|
||||
|
||||
Keep only these old names in explicit migration tables.
|
||||
|
||||
## Test Matrix
|
||||
|
||||
Protocol:
|
||||
|
||||
- final methods exist in protocol schemas
|
||||
- removed methods are absent from protocol schemas
|
||||
- final event is advertised in hello features
|
||||
- removed events are absent from broadcast guards
|
||||
- generated clients match schema
|
||||
- request-time instruction override is rejected or impossible by schema
|
||||
|
||||
Gateway:
|
||||
|
||||
- `talk.client.create` creates WebRTC session result
|
||||
- `talk.client.create` creates provider WebSocket session result
|
||||
- `talk.client.create` rejects Gateway-owned transports
|
||||
- `talk.client.toolCall` validates caller, session, brain, and policy
|
||||
- `talk.session.create` creates realtime Gateway relay
|
||||
- `talk.session.create` creates transcription relay
|
||||
- `talk.session.create` creates STT/TTS managed room
|
||||
- `talk.session.create` rejects client-owned transports
|
||||
- `talk.session.join` replacement notifies displaced client
|
||||
- `talk.session.appendAudio` routes to relay/transcription session
|
||||
- `talk.session.startTurn` starts managed-room turn
|
||||
- `talk.session.endTurn` rejects stale turn ids
|
||||
- `talk.session.cancelTurn` aborts provider, agent, tools, TTS, and streams
|
||||
- `talk.session.cancelOutput` cancels playback only
|
||||
- `talk.session.submitToolResult` binds to provider call id
|
||||
- `talk.session.close` emits terminal event and releases resources
|
||||
|
||||
Browser:
|
||||
|
||||
- WebRTC path calls `talk.client.create`
|
||||
- provider WebSocket path calls `talk.client.create`
|
||||
- provider tool calls use `talk.client.toolCall`
|
||||
- Gateway relay uses only `talk.session.*`
|
||||
- Gateway relay listens only to `talk.event`
|
||||
- barge-in requires speech/VAD
|
||||
- relay close rejects or aborts pending consult runs
|
||||
- no removed RPC names in UI tests
|
||||
|
||||
Native:
|
||||
|
||||
- push-to-talk start emits capture/turn events
|
||||
- failed push-to-talk start cleans capture state
|
||||
- cancel clears capture and output state
|
||||
- STT/TTS success path is event-driven
|
||||
- fallback polling is explicit and tested if kept
|
||||
- node policy rejects untrusted Talk commands
|
||||
|
||||
Telephony:
|
||||
|
||||
- early speech before provider speech-started creates or guards turn before cancellation
|
||||
- marks and clear events map to output state
|
||||
- u-law codec stays adapter-owned
|
||||
- cancellation aborts consult run
|
||||
- closed call prevents stale tool result submission
|
||||
|
||||
Meetings:
|
||||
|
||||
- participant context appears as metadata, not core branching
|
||||
- echo suppression prevents false barge-in
|
||||
- transcript events use common envelope
|
||||
- meeting close aborts active work
|
||||
|
||||
Architecture:
|
||||
|
||||
- no removed public RPC names in protocol metadata
|
||||
- no retired realtime endpoint route
|
||||
- no retired realtime folder
|
||||
- no request-time instruction override field
|
||||
- no core branches on app platform names
|
||||
- provider behavior comes from capabilities
|
||||
|
||||
## Verification Commands
|
||||
|
||||
Focused local loop:
|
||||
|
||||
```sh
|
||||
pnpm test src/gateway/protocol/index.test.ts
|
||||
pnpm test src/gateway/server-methods/talk.test.ts
|
||||
pnpm test src/gateway/method-scopes.test.ts src/gateway/server-methods-list.test.ts
|
||||
pnpm test src/gateway/talk-realtime-relay.test.ts src/gateway/talk-transcription-relay.test.ts
|
||||
pnpm test ui/src/ui/realtime-talk.test.ts ui/src/ui/realtime-talk-gateway-relay.test.ts ui/src/ui/realtime-talk-webrtc.test.ts ui/src/ui/realtime-talk-google-live.test.ts
|
||||
pnpm exec oxfmt --check --threads=1 docs/refactor/talk.md docs/refactor/talk-execution.md
|
||||
```
|
||||
|
||||
Generation and docs:
|
||||
|
||||
```sh
|
||||
pnpm protocol:gen && pnpm protocol:gen:swift
|
||||
pnpm docs:check-mdx
|
||||
pnpm plugin-sdk:api:check
|
||||
```
|
||||
|
||||
Broad gate before push:
|
||||
|
||||
```sh
|
||||
pnpm check:changed
|
||||
```
|
||||
|
||||
Use Testbox for broad gates on maintainer machines.
|
||||
128
docs/refactor/talk-surfaces.md
Normal file
128
docs/refactor/talk-surfaces.md
Normal file
@@ -0,0 +1,128 @@
|
||||
---
|
||||
summary: "Surface adapter plan for browser, native, walkie-talkie, telephony, and meeting Talk refactor work"
|
||||
read_when:
|
||||
- Updating browser realtime Talk, native Talk, walkie-talkie handoff, Voice Call, or meeting voice code
|
||||
- Deciding whether a Talk behavior belongs in an adapter or shared runtime
|
||||
title: "Talk surface mapping"
|
||||
---
|
||||
|
||||
# Talk Surface Mapping
|
||||
|
||||
This maps product surfaces into [Talk refactor plan](/refactor/talk) primitives.
|
||||
|
||||
## Browser
|
||||
|
||||
WebRTC:
|
||||
|
||||
- call `talk.client.create`
|
||||
- open provider media connection in browser
|
||||
- forward provider tool calls through `talk.client.toolCall`
|
||||
- receive provider audio through provider media/data channel
|
||||
|
||||
Provider WebSocket:
|
||||
|
||||
- call `talk.client.create`
|
||||
- connect using constrained provider result
|
||||
- keep provider-specific framing in the browser adapter
|
||||
- forward tool calls through `talk.client.toolCall`
|
||||
|
||||
Gateway relay:
|
||||
|
||||
- call `talk.session.create`
|
||||
- send PCM frames with `talk.session.appendAudio`
|
||||
- listen only to `talk.event`
|
||||
- submit tool results with `talk.session.submitToolResult`
|
||||
- barge-in with `talk.session.cancelOutput`
|
||||
- close with `talk.session.close`
|
||||
|
||||
## Native And Nodes
|
||||
|
||||
Native apps map local audio lifecycle into Talk primitives.
|
||||
|
||||
Native realtime:
|
||||
|
||||
- use `talk.client.create` when the app owns provider media
|
||||
- use `talk.session.create` when Gateway owns provider relay
|
||||
|
||||
Native STT/TTS:
|
||||
|
||||
- use `talk.session.create({ mode: "stt-tts", transport: "managed-room" })`
|
||||
- keep local STT and local TTS behind native adapters
|
||||
- drive success path from Talk events
|
||||
- keep history polling only as a degraded fallback if explicitly tested
|
||||
|
||||
Native push-to-talk:
|
||||
|
||||
- press maps to `talk.session.startTurn`
|
||||
- release maps to `talk.session.endTurn`
|
||||
- cancel maps to `talk.session.cancelTurn`
|
||||
- node capture commands emit capture events
|
||||
- failed start cleans capture state
|
||||
- opening voice UI never mutates global Talk config
|
||||
|
||||
Trusted node command adapters may remain:
|
||||
|
||||
```ts
|
||||
talk.ptt.start;
|
||||
talk.ptt.stop;
|
||||
talk.ptt.cancel;
|
||||
talk.ptt.once;
|
||||
```
|
||||
|
||||
## Walkie-Talkie
|
||||
|
||||
Walkie-talkie is managed-room Talk:
|
||||
|
||||
```ts
|
||||
await gateway.request("talk.session.create", {
|
||||
mode: "stt-tts",
|
||||
transport: "managed-room",
|
||||
brain: "agent-consult",
|
||||
sessionKey,
|
||||
});
|
||||
```
|
||||
|
||||
Then:
|
||||
|
||||
- client joins with `talk.session.join`
|
||||
- press calls `talk.session.startTurn`
|
||||
- release calls `talk.session.endTurn`
|
||||
- cancel calls `talk.session.cancelTurn`
|
||||
- assistant speech emits `output.text.*` and `output.audio.*`
|
||||
- replacement emits `session.replaced` to old owner
|
||||
- close calls `talk.session.close`
|
||||
|
||||
Room state includes canonical session id, route/channel target, caller identity,
|
||||
mode, transport, brain, provider, model, voice, locale, expiry, token hash,
|
||||
active client id, active turn id, and replacement state.
|
||||
|
||||
Two simultaneous rooms must not share turn ids, transcripts, audio output, or
|
||||
cancellation tokens.
|
||||
|
||||
## Telephony
|
||||
|
||||
Voice Call becomes a telephony adapter over Talk semantics.
|
||||
|
||||
Keep telephony-owned: Twilio/Plivo WebSocket contracts, stream ids, call ids,
|
||||
G.711 u-law, marks, clear events, backpressure, phone call lifecycle, and inbound
|
||||
speech detection quirks.
|
||||
|
||||
Move shared behavior to Talk: event envelope, turn ids, cancellation, agent
|
||||
consult abort, tool policy, usage and latency metrics, and output state.
|
||||
|
||||
Telephony should emit `talk.event` for observability, even if phone media
|
||||
remains plugin-owned.
|
||||
|
||||
## Meetings
|
||||
|
||||
Google Meet and future meeting integrations become meeting adapters over Talk
|
||||
semantics.
|
||||
|
||||
Keep meeting-owned: meeting join/leave, participant identity, room permissions,
|
||||
echo suppression, transcript context, and meeting-specific mute/deafen behavior.
|
||||
|
||||
Move shared behavior to Talk: turn lifecycle, transcript events, assistant output
|
||||
events, tool policy, cancellation, and metrics.
|
||||
|
||||
Meeting adapters may run `transcription`, `stt-tts`, or `realtime` depending on
|
||||
provider support.
|
||||
@@ -1,55 +1,68 @@
|
||||
---
|
||||
summary: "Grand unification plan for Talk mode, realtime voice, voice-call, Google Meet, and VoiceClaw realtime"
|
||||
summary: "Breaking refactor plan for one Talk architecture across realtime voice, STT/TTS, browser, native, telephony, meetings, and walkie-talkie handoff"
|
||||
read_when:
|
||||
- Refactoring Talk mode, realtime voice, voice-call, Google Meet, or VoiceClaw realtime
|
||||
- Changing Talk protocol, provider contracts, browser realtime, or native voice behavior
|
||||
- Deciding whether a voice feature belongs in core, a provider plugin, or a surface adapter
|
||||
title: "Talk unification plan"
|
||||
- Refactoring Talk mode, realtime voice, voice-call, Google Meet, browser realtime voice, native push-to-talk, STT, or TTS
|
||||
- Changing Talk Gateway protocol, provider contracts, realtime transports, managed rooms, audio events, cancellation, or tool policy
|
||||
- Deciding whether a voice feature belongs in core, a provider plugin, a native app, a meeting adapter, or a telephony adapter
|
||||
title: "Talk refactor plan"
|
||||
---
|
||||
|
||||
# Talk Unification Plan
|
||||
# Talk Refactor Plan
|
||||
|
||||
OpenClaw has several voice loops that grew from different product surfaces: native Talk mode, browser realtime Talk, Voice Call realtime, Google Meet realtime, streaming STT, TTS reply playback, and `/voiceclaw/realtime`. The goal is not to force all of them into one implementation. The goal is one session contract, one event vocabulary, one policy boundary, and small adapters for each surface.
|
||||
This is the breaking-clean plan for unifying every live voice path behind one
|
||||
Talk architecture.
|
||||
|
||||
Core should know conversation modes, byte transports, audio formats, tool policy, and client capabilities. Core should not know platform product names such as iOS, Android, or macOS except as optional telemetry emitted by an edge client.
|
||||
The old architecture grew by product surface: browser realtime, Gateway relay,
|
||||
managed native handoff, streaming transcription, Voice Call, Google Meet, local
|
||||
STT/TTS, one-shot TTS, and a retired realtime WebSocket endpoint each learned
|
||||
their own names for sessions, turns, capture, output, barge-in, tool calls,
|
||||
cancellation, and transcript events.
|
||||
|
||||
## Goals
|
||||
The new architecture grows by primitive. There is one public Talk API, one
|
||||
event envelope, one turn model, one cancellation contract, one provider policy
|
||||
boundary, and one place for shared runtime state. Browser, native, telephony,
|
||||
meetings, and walkie-talkie become adapters over those primitives.
|
||||
|
||||
- Make browser Talk, native Talk, telephony, meetings, and VoiceClaw realtime share the same session semantics.
|
||||
- Keep provider-specific realtime behavior in provider plugins.
|
||||
- Keep telephony and meeting quirks in their owning plugins.
|
||||
- Move browser realtime agent consult out of browser-owned `chat.send`.
|
||||
- Keep existing public entry points only as migration adapters while the runtime converges.
|
||||
- Keep local STT/TTS as a first-class fallback, not a deprecated path.
|
||||
- Support a first-party walkie-talkie client that can hand off an existing OpenClaw session into voice without becoming a separate assistant.
|
||||
- Make event logs, latency, usage, tool calls, cancellation, and interruption observable in the same shape everywhere.
|
||||
## Product Target
|
||||
|
||||
## Non Goals
|
||||
OpenClaw supports three Talk products:
|
||||
|
||||
- Do not make core branch on app platforms.
|
||||
- Do not move OpenAI, Google, Twilio, or meeting-specific behavior into core.
|
||||
- Do not merge one-shot inbound audio attachments with live Talk sessions beyond sharing STT provider contracts where useful.
|
||||
- Do not remove `/voiceclaw/realtime` or existing Talk RPC entry points during the first migration; they may reject retired fields instead of preserving every old request shape.
|
||||
- Do not allow request-time instruction overrides for realtime sessions.
|
||||
- Do not copy VoiceClaw names or request fields into shared APIs; preserve the realtime runtime capabilities through the shared Talk contract, except request-time instruction overrides.
|
||||
| Product | User experience | Mode |
|
||||
| --------------------- | ----------------------------------------------------------------------- | --------------- |
|
||||
| Realtime conversation | Low-latency duplex speech with interruption and provider tool calls | `realtime` |
|
||||
| Walkie-talkie | Press or hold to speak, release, then hear OpenClaw answer | `stt-tts` |
|
||||
| Transcription | Live captions, dictation, notes, meeting transcript, no assistant audio | `transcription` |
|
||||
|
||||
## Current Surfaces
|
||||
All three products share session identity, join/reconnect state, turn and
|
||||
capture ids, input audio metadata, output text/audio state, transcript finality,
|
||||
tool-call correlation, cancellation, replay, provider capabilities, policy,
|
||||
auth, and observability.
|
||||
|
||||
| Surface | Current shape | Keep | Refactor target |
|
||||
| ------------------------ | ----------------------------------------------------------------------------------------------------------------------------------------------------- | ----------------------------------------------------------------- | ------------------------------------------------------------------------------------------------------- |
|
||||
| Browser Talk | `talk.realtime.session` returns WebRTC, provider WebSocket, or Gateway relay. Tool calls go through `talk.realtime.toolCall`. | Browser audio capture/playback and WebRTC data-channel handling. | Keep browser media ownership while Gateway owns realtime tool policy. |
|
||||
| Native Talk | Local STT, Gateway `chat.send`, response event or `chat.history` polling, then local or Gateway TTS. | Local STT/TTS fallback and native audio controls. | Event-driven success path with shared Talk events. |
|
||||
| Voice Call realtime | Telephony WebSocket with G.711 u-law, marks, interruption, and realtime voice bridge. | Telephony adapter ownership. | Adapter over shared Talk session contract. |
|
||||
| Voice Call streaming STT | Telephony stream through realtime transcription provider, then TTS playback. | STT/TTS pipeline mode. | Explicit `stt-tts` mode adapter. |
|
||||
| Google Meet realtime | Meeting participant context, echo suppression, realtime provider bridge, fast context. | Meeting adapter ownership. | Adapter over shared Talk session contract and metrics. |
|
||||
| VoiceClaw realtime | Separate WebSocket endpoint with Gemini Live, direct tools, audio/video frames, interruption, cancellation, session rotation/resumption, and metrics. | Migration endpoint; realtime runtime primitives except overrides. | Shared Talk contract; server-owned instructions; no request-time override. |
|
||||
| TTS | `talk.speak` and provider TTS config. | Speech provider abstraction. | Cleanly separated from realtime provider config. |
|
||||
| STT | Batch audio and streaming transcription providers. | Provider contracts. | Streaming STT is an input strategy for `stt-tts`; batch voice notes stay outside live Talk. |
|
||||
| Walkie-talkie handoff | Prototype pattern: existing session, phone capture, push-to-talk turn, STT, agent turn, TTS playback, and transcript mirror. | One-button voice handoff UX and long-form PTT. | Gateway-backed handoff room using shared Talk events, provider catalogs, and existing session delivery. |
|
||||
One-shot uploaded audio and one-shot TTS do not need live Talk session state
|
||||
unless they participate in live capture, turns, interruption, replay, or
|
||||
cancellation.
|
||||
|
||||
## Core Model
|
||||
## Hard Decisions
|
||||
|
||||
Separate the dimensions. Mode is how the conversation runs. Transport is how bytes move. Brain is who handles tools and agent reasoning. Surface is edge-owned and should not drive core branching.
|
||||
This refactor intentionally removes compatibility that would keep the design
|
||||
muddy:
|
||||
|
||||
- remove public `talk.realtime.*` RPCs
|
||||
- remove public `talk.transcription.*` RPCs
|
||||
- remove public `talk.handoff.*` RPCs
|
||||
- remove generic `talk.session.inputAudio`, `talk.session.control`, and
|
||||
`talk.session.toolResult`
|
||||
- remove old relay event channels
|
||||
- remove `/voiceclaw/realtime`
|
||||
- remove `src/gateway/voiceclaw-realtime/`
|
||||
- remove request-time instruction overrides
|
||||
- keep `talk.speak` as one-shot TTS, not a live session API
|
||||
- keep legacy realtime config repair in doctor, not startup
|
||||
- keep platform and product names out of core branching
|
||||
|
||||
## Vocabulary
|
||||
|
||||
Keep mode, transport, brain, and surface separate.
|
||||
|
||||
```ts
|
||||
type TalkMode = "realtime" | "stt-tts" | "transcription";
|
||||
@@ -61,366 +74,426 @@ type TalkBrain = "agent-consult" | "direct-tools" | "none";
|
||||
|
||||
### Modes
|
||||
|
||||
`realtime` is a provider-native live session. Audio goes in, audio comes out, interruptions and tool calls happen inside one low-latency session. OpenAI Realtime and Google Live fit here. WebRTC and provider WebSockets are transports for this mode, not separate modes.
|
||||
`realtime` means a provider owns a live voice session. Audio goes in, audio
|
||||
comes out, interruptions are possible, and provider tool calls may happen during
|
||||
one provider session.
|
||||
|
||||
`stt-tts` is the classic pipeline: speech-to-text, agent text turn, text-to-speech. It is higher latency, but it works with local native speech, streaming STT providers, low-cost fallback providers, offline-ish native paths, and providers that do not support realtime voice.
|
||||
`stt-tts` means input speech is transcribed, OpenClaw answers as text, and TTS
|
||||
renders the answer. This is the native Talk and walkie-talkie path when a full
|
||||
duplex provider session is not the right shape.
|
||||
|
||||
`transcription` is speech-to-text without an assistant speech response. It covers dictation, captions, meeting transcript capture, and voice-note style ingestion when the live session layer is useful. Gateway-owned transcription relay sessions use `talk.transcription.session`, `talk.transcription.relayAudio`, `talk.transcription.relayCancel`, and `talk.transcription.relayStop`. One-shot batch audio attachments can still use the existing media path without becoming Talk sessions.
|
||||
`transcription` means speech-to-text without assistant audio output. It covers
|
||||
captions, dictation, notes, meeting transcript capture, and live voice-note
|
||||
ingestion.
|
||||
|
||||
### Transports
|
||||
|
||||
`webrtc` is browser or WebRTC-capable client transport using SDP and media/data channels. It is the best fit for direct OpenAI Realtime browser sessions with ephemeral credentials.
|
||||
`webrtc` is client-owned SDP/media/data-channel transport. It fits browser-owned
|
||||
OpenAI Realtime sessions with ephemeral credentials.
|
||||
|
||||
`provider-websocket` is a constrained provider WebSocket carrying JSON control messages and PCM audio. It fits Google Live-style browser or server streams where WebRTC is not the provider contract.
|
||||
`provider-websocket` is client-owned provider JSON and audio framing. It fits
|
||||
browser-owned Google Live style sessions.
|
||||
|
||||
`gateway-relay` keeps the vendor session on the Gateway. Clients send authenticated audio frames to Gateway and receive audio/events back. This is the secure default for providers without browser-safe tokens and for server-owned tool policy.
|
||||
`gateway-relay` means the Gateway owns the provider connection. The client sends
|
||||
authenticated audio frames to the Gateway and receives `talk.event` plus audio
|
||||
output through Gateway-managed relay state.
|
||||
|
||||
`managed-room` is a Gateway-owned room/session where one or more clients join a managed Talk handoff. It is the primitive for first-party walkie-talkie clients: Gateway owns rendezvous, expiry, replacement, turn lifecycle events, and provider credentials while the edge client owns capture and playback.
|
||||
`managed-room` means the Gateway owns a room-like session that clients can join,
|
||||
replace, and drive with explicit turn verbs. It is the primitive for
|
||||
walkie-talkie and native handoff.
|
||||
|
||||
Telephony, meetings, and native apps are not core transports. They are surface adapters that choose one of the transports above or implement local `stt-tts` before handing text/audio events into the shared session contract.
|
||||
|
||||
Canonical transport names are the names above. Legacy browser-session transport names should be normalized at adapter boundaries (`webrtc-sdp` to `webrtc`, `json-pcm-websocket` to `provider-websocket`) so mixed-version clients and external providers keep working. Do not keep the legacy names as a second internal vocabulary. When a versioned creation RPC exists, freeze the old RPC shape and delete the aliases only after the announced compatibility window.
|
||||
Telephony and meetings are not core transports. They are adapters that map
|
||||
phone or meeting media into `gateway-relay`, `managed-room`, or `stt-tts` while
|
||||
keeping call and meeting lifecycle outside core.
|
||||
|
||||
### Brain Strategies
|
||||
|
||||
`agent-consult` means the realtime model asks Gateway to consult an OpenClaw agent. Gateway applies tool policy, chooses fork or isolated context, runs the agent, and returns a concise result to the realtime provider.
|
||||
`agent-consult` means provider tool calls or session turns consult an OpenClaw
|
||||
agent. Gateway owns prompt construction, context selection, authorization, abort
|
||||
signals, and final result delivery.
|
||||
|
||||
`direct-tools` means the realtime provider receives a direct OpenClaw tool declaration and calls Gateway-owned tools. This is the VoiceClaw-style brain and should require owner-level authorization.
|
||||
`direct-tools` means a trusted first-party surface can call selected OpenClaw
|
||||
tools directly through Gateway policy. Keep this privileged.
|
||||
|
||||
`none` means the session is pure transcription, external orchestration, or client-managed speech without OpenClaw tool access.
|
||||
`none` means transcription-only, external orchestration, or no OpenClaw tool
|
||||
access.
|
||||
|
||||
## Shared Talk Session Runtime
|
||||
## Ownership Boundaries
|
||||
|
||||
The next cleanup layer is a shared Talk session controller. It should be the only code that owns event sequencing, active turn state, capture state, output audio state, recent event retention, and stale-turn rejection. Surface adapters may decide when to call it, but they should not each reimplement turn bookkeeping.
|
||||
Core owns generic Talk semantics:
|
||||
|
||||
The controller contract should cover:
|
||||
- mode, transport, brain, codec, and audio descriptors
|
||||
- session records and session ownership
|
||||
- turn ids and capture ids
|
||||
- event envelope, sequencing, replay, and stale-output suppression
|
||||
- active capture state
|
||||
- active assistant output state
|
||||
- replacement and reconnect state
|
||||
- cancellation propagation
|
||||
- tool policy and tool-call correlation
|
||||
- usage, latency, and health events
|
||||
|
||||
- `emit(...)` for session, health, usage, latency, and tool events that do not mutate turn state
|
||||
- `startTurn(...)` and `ensureTurn(...)` for capture, STT, realtime provider, telephony, and meeting adapters
|
||||
- `endTurn(...)` and `cancelTurn(...)` with stale `turnId` rejection before clearing the active turn
|
||||
- `startOutputAudio(...)`, `emitOutputAudioDelta(...)`, and `finishOutputAudio(...)` for playback, marks, relay clear, and barge-in
|
||||
- recent event retention for reconnect, diagnostics, hello/event discovery tests, and native UI replay
|
||||
- compatibility normalization for legacy transport result names at adapter boundaries
|
||||
Provider plugins own vendor behavior:
|
||||
|
||||
The public API migration is adapter-first. Keep existing RPCs such as `talk.realtime.session`, `talk.realtime.relayAudio`, `talk.transcription.session`, `talk.transcription.relayAudio`, and `talk.handoff.*` while moving their internals onto the shared controller. Gateway-managed sessions expose the common model directly:
|
||||
- OpenAI Realtime SDP and data-channel details
|
||||
- Google Live WebSocket framing
|
||||
- streaming STT provider details
|
||||
- TTS provider details
|
||||
- provider auth, model, voice, codec, and resume quirks
|
||||
- provider capability declarations
|
||||
|
||||
Surface adapters own IO and product quirks:
|
||||
|
||||
- browser capture and playback
|
||||
- native audio sessions, local speech engines, and foreground Talk UX
|
||||
- node command dispatch
|
||||
- telephony media streams, marks, clear events, u-law, and call lifecycle
|
||||
- meeting join/leave, participants, echo suppression, and authorization
|
||||
|
||||
Core may store optional surface metadata for diagnostics. Core must not branch
|
||||
on browser, iOS, Android, macOS, Google Meet, Voice Call, or any retired product
|
||||
name.
|
||||
|
||||
## Final Gateway API
|
||||
|
||||
The public Gateway surface is deliberately small:
|
||||
|
||||
```ts
|
||||
// Discovery and configuration.
|
||||
talk.catalog;
|
||||
talk.config;
|
||||
|
||||
// One-shot speech output.
|
||||
talk.speak;
|
||||
|
||||
// Client-owned provider sessions.
|
||||
talk.client.create;
|
||||
talk.client.toolCall;
|
||||
|
||||
// Gateway-owned live sessions.
|
||||
talk.session.create;
|
||||
talk.session.inputAudio;
|
||||
talk.session.control;
|
||||
talk.session.toolResult;
|
||||
talk.session.join;
|
||||
talk.session.appendAudio;
|
||||
talk.session.startTurn;
|
||||
talk.session.endTurn;
|
||||
talk.session.cancelTurn;
|
||||
talk.session.cancelOutput;
|
||||
talk.session.submitToolResult;
|
||||
talk.session.close;
|
||||
|
||||
// Events and foreground node mode.
|
||||
talk.event;
|
||||
talk.mode;
|
||||
```
|
||||
|
||||
The old RPCs stay as compatibility adapters while new clients use `talk.session.*` for gateway-relay realtime, gateway-relay transcription, and managed-room native STT/TTS sessions. Browser-owned WebRTC/provider-websocket sessions remain on `talk.realtime.session` because the browser owns provider negotiation and media transport there. The internal controller must be provider-agnostic and platform-agnostic: provider plugins own vendor sessions, voice-call owns telephony, Google Meet owns meeting details, and browser/native clients own capture and playback UX.
|
||||
Use `talk.client.*` when the client owns provider media transport. Use
|
||||
`talk.session.*` when the Gateway owns live session state.
|
||||
|
||||
## VoiceClaw Runtime Scope
|
||||
`talk.mode` is the existing foreground node mode broadcast. It can stay, but it
|
||||
is not part of the Talk session control API.
|
||||
|
||||
VoiceClaw is an adapter target, not a feature template for the unified runtime. We do not need every VoiceClaw product or API feature. We do want the useful realtime runtime primitives: live provider sessions, audio and optional video frames, interruption, cancellation, session lifecycle, rotation/resumption, metrics, latency reporting, and direct tools when explicitly authorized. Those should arrive as shared Talk primitives instead of VoiceClaw-only knobs.
|
||||
### Supported Creation Matrix
|
||||
|
||||
The deliberate feature removal is request-time instruction override. Unified Talk instructions must be server-owned. If a capability depends on provider support, owner-scoped auth, or the selected brain strategy, the adapter should gate it through shared Talk capability metadata rather than deleting it. Do not preserve `instructionsOverride`; it is intentionally outside the unified Talk contract. Everything else in the existing realtime runtime is presumed in scope unless a later implementation review proves that it is dead, unsafe, or impossible to express as a shared Talk primitive.
|
||||
| Method | Mode | Transport | Brain | Owner |
|
||||
| --------------------- | --------------- | -------------------- | --------------- | ------- |
|
||||
| `talk.client.create` | `realtime` | `webrtc` | `agent-consult` | client |
|
||||
| `talk.client.create` | `realtime` | `provider-websocket` | `agent-consult` | client |
|
||||
| `talk.session.create` | `realtime` | `gateway-relay` | `agent-consult` | Gateway |
|
||||
| `talk.session.create` | `transcription` | `gateway-relay` | `none` | Gateway |
|
||||
| `talk.session.create` | `stt-tts` | `managed-room` | `agent-consult` | Gateway |
|
||||
| `talk.session.create` | `stt-tts` | `managed-room` | `direct-tools` | Gateway |
|
||||
|
||||
Keep:
|
||||
Reject combinations that blur ownership. `talk.client.create` must reject
|
||||
Gateway-owned transports. `talk.session.create` must reject client-owned
|
||||
transports.
|
||||
|
||||
- `/voiceclaw/realtime` endpoint shape during migration
|
||||
- existing auth expectations where they remain owner-scoped
|
||||
- Gemini Live provider bridge
|
||||
- audio input and output frames
|
||||
- video frames when the selected provider supports them
|
||||
- interruption and response cancellation
|
||||
- session rotation and resumption where the provider supports them
|
||||
- metrics and latency reporting
|
||||
- direct tool calls behind the explicit `direct-tools` brain
|
||||
## Removed API
|
||||
|
||||
Do not keep:
|
||||
Remove these names from handlers, method lists, scopes, protocol schemas,
|
||||
generated clients, broadcast guards, tests, and docs except explicit migration
|
||||
tables:
|
||||
|
||||
- request-time `instructionsOverride`
|
||||
- VoiceClaw-only request fields that duplicate server-owned instructions, tool policy, provider selection, or session policy
|
||||
- VoiceClaw-specific configuration names in new shared Talk APIs
|
||||
| Removed | Replacement |
|
||||
| ------------------------------- | -------------------------------------------------------- |
|
||||
| `talk.realtime.session` | `talk.client.create` |
|
||||
| `talk.realtime.toolCall` | `talk.client.toolCall` |
|
||||
| `talk.realtime.relayAudio` | `talk.session.appendAudio` |
|
||||
| `talk.realtime.relayCancel` | `talk.session.cancelOutput` or `talk.session.cancelTurn` |
|
||||
| `talk.realtime.relayMark` | internal relay output state |
|
||||
| `talk.realtime.relayToolResult` | `talk.session.submitToolResult` |
|
||||
| `talk.realtime.relayClose` | `talk.session.close` |
|
||||
| `talk.realtime.relay` | `talk.event` |
|
||||
| `talk.transcription.session` | `talk.session.create({ mode: "transcription" })` |
|
||||
| `talk.transcription.audio` | `talk.session.appendAudio` |
|
||||
| `talk.transcription.cancel` | `talk.session.cancelTurn` |
|
||||
| `talk.transcription.close` | `talk.session.close` |
|
||||
| `talk.transcription.relay` | `talk.event` |
|
||||
| `talk.handoff.create` | `talk.session.create({ transport: "managed-room" })` |
|
||||
| `talk.handoff.join` | `talk.session.join` |
|
||||
| `talk.handoff.revoke` | `talk.session.close` |
|
||||
| `talk.session.inputAudio` | `talk.session.appendAudio` |
|
||||
| `talk.session.control` | explicit turn/output verbs |
|
||||
| `talk.session.toolResult` | `talk.session.submitToolResult` |
|
||||
|
||||
Realtime instruction policy must come from server-side config, agent identity, selected brain strategy, or another owner-controlled policy surface. If a client sends `instructionsOverride`, the compatibility adapter should reject the request rather than silently applying, partially honoring, or translating it. Everything in the Keep list remains in scope and should migrate onto shared Talk primitives.
|
||||
Delete this endpoint:
|
||||
|
||||
Compatibility here means "old entry point can route to the new runtime," not "old clients can keep every old knob forever." `/voiceclaw/realtime` should be allowed to return a clear unsupported-field error for retired request fields, especially `instructionsOverride`, while preserving the runtime behavior that still belongs in Talk.
|
||||
|
||||
## Event Vocabulary
|
||||
|
||||
All Talk sessions should emit a common event stream:
|
||||
|
||||
- `session.started`, `session.ready`, `session.replaced`, `session.closed`, `session.error`
|
||||
- `turn.started`, `turn.ended`, `turn.cancelled`
|
||||
- `capture.started`, `capture.stopped`, `capture.cancelled`, `capture.once`
|
||||
- `input.audio.delta`, `input.audio.committed`
|
||||
- `transcript.delta`, `transcript.done`
|
||||
- `output.text.delta`, `output.text.done`
|
||||
- `output.audio.started`, `output.audio.delta`, `output.audio.done`
|
||||
- `tool.call`, `tool.progress`, `tool.result`, `tool.error`
|
||||
- `usage.metrics`
|
||||
- `latency.metrics`
|
||||
- `health.changed`
|
||||
|
||||
Adapters may add vendor or surface metadata, but the common event names should be enough for UI, native clients, logs, tests, and metrics.
|
||||
|
||||
Every common event must use the same envelope:
|
||||
|
||||
```ts
|
||||
type TalkEvent<TPayload = unknown> = {
|
||||
id: string;
|
||||
type: TalkEventType;
|
||||
sessionId: string;
|
||||
turnId?: string;
|
||||
captureId?: string;
|
||||
seq: number;
|
||||
timestamp: string;
|
||||
mode: TalkMode;
|
||||
transport: TalkTransport;
|
||||
brain: TalkBrain;
|
||||
provider?: string;
|
||||
final?: boolean;
|
||||
callId?: string;
|
||||
itemId?: string;
|
||||
parentId?: string;
|
||||
payload: TPayload;
|
||||
};
|
||||
```text
|
||||
/voiceclaw/realtime
|
||||
```
|
||||
|
||||
`sessionId` is required for every event. `turnId` is required for every event tied to one user/assistant turn. `captureId` is required while push-to-talk capture is active. `seq` is monotonically increasing within a session. `callId`, `itemId`, and `parentId` correlate provider tool calls, realtime response items, TTS jobs, and relay frames. Replay, stale-output suppression, metrics, and tests should rely on these envelope fields rather than vendor-specific payload shapes.
|
||||
Delete this folder:
|
||||
|
||||
Walkie-talkie clients need one extra timing rule: text-ready is not audio-ready. A client may show transcript text after `output.text.done`, but it should not transition from "thinking" to "speaking" until `output.audio.delta` or an explicit `output.audio.started` event arrives. That keeps hold music, waveform, replay, and barge-in UX honest when the agent turn finishes before TTS is ready.
|
||||
|
||||
## Walkie-Talkie App Primitives
|
||||
|
||||
The app should be buildable from the same primitives, not a parallel voice stack.
|
||||
|
||||
### Session Handoff
|
||||
|
||||
Voice handoff starts from an existing OpenClaw session. The handoff primitive should carry:
|
||||
|
||||
- canonical session id
|
||||
- optional session key for human-readable thread lookup
|
||||
- delivery route, such as channel and target
|
||||
- caller identity and scope
|
||||
- selected `TalkMode`, `TalkTransport`, and `TalkBrain`
|
||||
- optional session-scoped provider, model, and voice ids
|
||||
- expiration, revocation, and replacement policy
|
||||
|
||||
The existing Gateway session APIs and `chat.send`/agent delivery paths already cover the canonical conversation side. First-class Talk handoff RPCs provide the rendezvous primitive: `talk.handoff.create` returns an ephemeral room token or join URL, `talk.handoff.join` validates the later voice join without exposing stored token hashes, `talk.handoff.turnStart`/`turnEnd`/`turnCancel` drive the room turn lifecycle, and `talk.handoff.revoke` invalidates stale or replaced handoffs.
|
||||
|
||||
### Room and Rendezvous
|
||||
|
||||
The room model must allow one device or browser client to host multiple active voice handoffs for different sessions without cross-talk. A deterministic room key is fine for local or development flows, but the product path should prefer Gateway-owned room creation with caller auth, expiry, and revoke semantics.
|
||||
|
||||
The minimum room events are:
|
||||
|
||||
- `session.ready`
|
||||
- `session.replaced`
|
||||
- `turn.started`
|
||||
- `turn.ended`
|
||||
- `turn.cancelled`
|
||||
- `session.closed`
|
||||
- `session.error`
|
||||
|
||||
`managed-room` is public only through handoff clients. Browser `talk.realtime.session` should keep rejecting `managed-room` until the browser owns a real room client instead of treating it as a browser-session result shape.
|
||||
|
||||
### Push-To-Talk
|
||||
|
||||
Push-to-talk is a turn-control primitive, not a platform primitive. It should map to browser capture, native local capture, or node commands:
|
||||
|
||||
- `capture.started`
|
||||
- `capture.stopped`
|
||||
- `capture.cancelled`
|
||||
- `capture.once`
|
||||
|
||||
Native node support has `talk.ptt.start`, `talk.ptt.stop`, `talk.ptt.cancel`, and `talk.ptt.once` command handlers. The Gateway policy treats them as first-class defaults only for trusted Talk-capable nodes: a node must advertise the `talk` capability or declare `talk.*` command support, and the command must still be present in the paired command snapshot.
|
||||
|
||||
### Provider Catalogs and Settings
|
||||
|
||||
Walkie-talkie settings should be per session or per device. The client should request STT, TTS, and realtime catalogs through Gateway, store only provider ids, model ids, voice ids, and locales, and never receive provider API keys or mutate global Talk provider defaults as a side effect of opening the app.
|
||||
|
||||
The catalog contract should describe which combinations are valid:
|
||||
|
||||
- local STT plus local TTS
|
||||
- streaming STT plus provider TTS
|
||||
- realtime provider with provider-native output audio
|
||||
- Gateway relay when browser-safe credentials are not available
|
||||
- managed room when the Gateway owns the session
|
||||
|
||||
### Canonical Transcript
|
||||
|
||||
The OpenClaw session is the source of truth. A walkie-talkie app may keep a local transcript cache for replay, export, reconnect, or offline UX, but the agent turn and durable transcript should go through the existing session delivery route. Transcript mirroring should be best effort and must not block the voice turn.
|
||||
|
||||
### Connectivity and Backgrounding
|
||||
|
||||
Native apps can use node pairing, `node.invoke`, and platform wake mechanisms when available. Browser or standalone web clients need either Gateway relay, a managed room, or hosted WebRTC signaling with ICE/TURN. Background continuous audio remains platform-limited; the product should promise foreground push-to-talk first and treat background capture as best effort.
|
||||
|
||||
### Cancellation and Replacement
|
||||
|
||||
Every turn should carry a turn token or capture id. Stale STT finals, stale agent replies, and stale TTS output must be ignored after `turn.cancelled` or `session.replaced`. This is required for "tap again to interrupt", reconnect replacement, and multi-session isolation.
|
||||
|
||||
Cancellation must also abort underlying work, not only hide stale output. A cancelled or replaced turn must:
|
||||
|
||||
- cancel provider responses or realtime sessions when the provider supports it
|
||||
- abort agent consult and tool runtime work through an `AbortSignal`
|
||||
- prevent newly queued side-effecting tools from starting after cancellation
|
||||
- let already-started side-effecting tools report cancellation status instead of inventing success
|
||||
- drain pending TTS jobs and stop audio playback/relay writes
|
||||
- close or reset relay and managed-room streams tied to the stale turn
|
||||
- emit one terminal cancellation event with the final abort reason
|
||||
|
||||
## Config Direction
|
||||
|
||||
The current public Talk config is speech-provider oriented. Keep it as the speech config and add realtime config beside it. Do not introduce a second `talk.speech` namespace during this refactor.
|
||||
|
||||
```ts
|
||||
type TalkConfig = {
|
||||
provider?: string;
|
||||
providers?: Record<string, unknown>;
|
||||
realtime?: {
|
||||
provider?: string;
|
||||
model?: string;
|
||||
voice?: string;
|
||||
mode?: TalkMode;
|
||||
transport?: TalkTransport;
|
||||
brain?: TalkBrain;
|
||||
};
|
||||
input?: {
|
||||
interruptOnSpeech?: boolean;
|
||||
silenceTimeoutMs?: number;
|
||||
};
|
||||
};
|
||||
```text
|
||||
src/gateway/voiceclaw-realtime/
|
||||
```
|
||||
|
||||
Rule: `talk.provider` and `talk.providers.*` continue to mean speech, STT, and TTS provider configuration. Realtime provider selection uses `talk.realtime.provider`, then registered realtime capabilities. Voice Call fallback inference should be deleted once the realtime config exists in schema, docs, forms, and doctor repair.
|
||||
Do not leave a compatibility namespace around retired code.
|
||||
|
||||
## Provider Contracts
|
||||
## Target Source Layout
|
||||
|
||||
Provider plugins should declare capabilities, not force core to infer behavior from ids:
|
||||
Shared runtime:
|
||||
|
||||
```ts
|
||||
type RealtimeVoiceProviderCapabilities = {
|
||||
transports: TalkTransport[];
|
||||
inputAudioFormats: AudioFormat[];
|
||||
outputAudioFormats: AudioFormat[];
|
||||
supportsBrowserSession?: boolean;
|
||||
supportsBargeIn?: boolean;
|
||||
supportsToolCalls?: boolean;
|
||||
supportsVideoFrames?: boolean;
|
||||
supportsSessionResumption?: boolean;
|
||||
};
|
||||
```text
|
||||
src/talk/
|
||||
audio-codec.ts
|
||||
agent-consult-runtime.ts
|
||||
agent-consult-tool.ts
|
||||
agent-talkback-runtime.ts
|
||||
fast-context-runtime.ts
|
||||
provider-registry.ts
|
||||
provider-resolver.ts
|
||||
provider-types.ts
|
||||
session-log-runtime.ts
|
||||
session-runtime.ts
|
||||
talk-events.ts
|
||||
talk-session-controller.ts
|
||||
```
|
||||
|
||||
OpenAI owns OpenAI Realtime details. Google owns Gemini Live details, continuation, compression, and session resumption. STT plugins own streaming transcription. TTS plugins own synthesis and telephony-compatible output formats.
|
||||
Gateway adapters:
|
||||
|
||||
## Gateway Policy Boundary
|
||||
```text
|
||||
src/gateway/server-methods/
|
||||
talk.ts # catalog, config, speak, mode, composition
|
||||
talk-client.ts # client-owned provider sessions
|
||||
talk-session.ts # Gateway-owned live sessions
|
||||
```
|
||||
|
||||
Browser realtime should not run agent consult by calling `chat.send` directly. The browser may own the media connection when a provider requires it, but Gateway should own the consult/tool policy.
|
||||
Gateway relay helpers can exist while the code moves, but the long-term shape
|
||||
is that relay, transcription, and handoff state use `src/talk` primitives
|
||||
instead of each reimplementing turns and events.
|
||||
|
||||
Target flow for browser-owned provider sessions:
|
||||
Public SDK:
|
||||
|
||||
1. Provider emits a tool call to the browser.
|
||||
2. Browser forwards the structured tool call to Gateway with the session id.
|
||||
3. Gateway validates the session, caller, tool policy, brain strategy, and owner permissions.
|
||||
4. Gateway runs `agent-consult`, `direct-tools`, or rejects the call.
|
||||
5. Browser submits the provider-specific tool result back to the provider.
|
||||
```text
|
||||
src/plugin-sdk/realtime-voice.ts
|
||||
```
|
||||
|
||||
Target flow for Gateway-owned sessions:
|
||||
Keep this SDK subpath as the stable plugin import facade. It may re-export
|
||||
Talk runtime contracts, but plugin authors should not import core file layout.
|
||||
|
||||
1. Provider emits a tool call to Gateway.
|
||||
2. Gateway runs policy and tool handling directly.
|
||||
3. Client only receives status, transcript, audio, and visible tool progress events.
|
||||
## Event Contract
|
||||
|
||||
## Surface Adapters
|
||||
All live paths emit `talk.event` with the envelope defined in
|
||||
[Talk API and runtime contract](/refactor/talk-api-contract). The required
|
||||
shape is: `id`, `type`, `sessionId`, `seq`, `timestamp`, `mode`, `transport`,
|
||||
`brain`, and `payload`, with `turnId`, `captureId`, `callId`, `itemId`, and
|
||||
`parentId` when the event is tied to turn, capture, provider item, tool call, or
|
||||
TTS output.
|
||||
|
||||
Adapters convert surface-specific IO into the shared model.
|
||||
Core event families are `session.*`, `turn.*`, `capture.*`, `input.audio.*`,
|
||||
`transcript.*`, `output.text.*`, `output.audio.*`, `tool.*`, `usage.metrics`,
|
||||
`latency.metrics`, and `health.changed`. Payloads must not duplicate large raw
|
||||
audio frames when the transport already carries them. Text-ready is not
|
||||
audio-ready; clients enter playback state only on audio events.
|
||||
|
||||
Browser adapter handles microphone capture, playback, WebRTC SDP, data channels, provider WebSocket framing, relay RPCs, and provider-specific tool result submission.
|
||||
## Cancellation Contract
|
||||
|
||||
Native adapter handles local STT/TTS, push-to-talk, continuous listening, local interruption, audio session lifecycles, and optional Gateway realtime or managed-room clients. Core sees capabilities such as PCM input support, local TTS fallback, and barge-in support, not platform names.
|
||||
Cancellation must abort underlying work, not only ignore stale output.
|
||||
|
||||
Telephony adapter handles Twilio or Plivo media streams, G.711 u-law, stream ids, marks, clear events, backpressure, call lifecycle, and phone-specific interruption behavior.
|
||||
When a turn or session is cancelled:
|
||||
|
||||
Meeting adapter handles room lifecycle, participant context, echo suppression, meeting transcript context, and meeting-specific authorization.
|
||||
- provider realtime response is cancelled when supported
|
||||
- provider session is closed or reset when cancellation cannot be scoped
|
||||
- streaming STT receives abort
|
||||
- agent consult receives abort
|
||||
- queued tools do not start after abort
|
||||
- already-started side-effecting tools receive abort and report cancellation
|
||||
- pending TTS jobs are drained
|
||||
- playback sources are stopped
|
||||
- relay streams are cleared
|
||||
- managed-room capture and output state reset
|
||||
- stale finals and stale audio deltas are ignored
|
||||
- one terminal cancellation event is emitted
|
||||
|
||||
VoiceClaw adapter handles `/voiceclaw/realtime`, auth expectations that remain owner-scoped, Gemini Live compatibility, audio/video frames, interruption, response cancellation, session rotation/resumption, metrics, latency reporting, and the `direct-tools` brain while using common Talk events internally. It must reject request-time `instructionsOverride` and must not introduce VoiceClaw-only policy fields into the shared Talk API.
|
||||
Barge-in requires real speech: provider speech-started, local VAD, or an
|
||||
adapter-owned speech detector. Silence, echo, or microphone buffers alone must
|
||||
not cancel assistant output.
|
||||
|
||||
## Migration Phases
|
||||
## Config Contract
|
||||
|
||||
### Phase 1: Contracts
|
||||
Config stays under `talk`; do not add `talk.speech`. `talk.provider` and
|
||||
`talk.providers.*` remain speech/STT/TTS provider config. Realtime selectors
|
||||
live under `talk.realtime.provider`, `talk.realtime.providers.*`, `model`,
|
||||
`voice`, `mode`, `transport`, and `brain`.
|
||||
|
||||
- Add shared Talk mode, transport, brain, capabilities, command, and event types.
|
||||
- Add a config resolver that preserves legacy `talk.provider`.
|
||||
- Keep existing `RealtimeVoiceProvider` APIs while introducing capability metadata.
|
||||
- Add handoff, room, capture, provider catalog, cancellation, and replacement event contracts.
|
||||
- Make `talk.ptt.start`, `talk.ptt.stop`, `talk.ptt.cancel`, and `talk.ptt.once` explicit safe commands for Talk-capable nodes.
|
||||
- Add protocol tests for no request-time instruction override.
|
||||
`talk.config` returns effective config without secrets unless privileged.
|
||||
`talk.catalog` returns provider capabilities, not inferred provider-id guesses.
|
||||
Doctor migrates old realtime placement into `talk.realtime`; runtime startup
|
||||
does not reinterpret Voice Call, STT, or TTS config as realtime config.
|
||||
|
||||
### Phase 2: Gateway Tool Policy
|
||||
## Surface Mapping
|
||||
|
||||
- Add Gateway RPC for realtime tool calls from browser-owned provider sessions.
|
||||
- Add Gateway RPCs for `talk.handoff.create`, `talk.handoff.join`, `talk.handoff.revoke`, and explicit handoff turn start/end/cancel, with session identity, expiry, revocation, join authorization, and event replay.
|
||||
- Add session-scoped STT, TTS, and realtime provider catalog RPCs.
|
||||
- Keep browser `openclaw_agent_consult` handling on `talk.realtime.toolCall`, not browser-side `chat.send`.
|
||||
- Reuse existing agent consult runtime and tool allow policy.
|
||||
- Add owner-only gate for `direct-tools`.
|
||||
| Surface | Talk mapping |
|
||||
| ------------------------------- | ----------------------------------------------------------------------------------------------------- |
|
||||
| Browser WebRTC | `talk.client.create`, client-owned provider media, `talk.client.toolCall` for provider tool calls |
|
||||
| Browser provider WebSocket | `talk.client.create`, browser-owned provider framing, Gateway-owned credentials and policy |
|
||||
| Browser Gateway relay | `talk.session.create`, `appendAudio`, `submitToolResult`, `cancelOutput`, `close`, and `talk.event` |
|
||||
| Native push-to-talk | `stt-tts` plus `managed-room`; press/startTurn, release/endTurn, cancel/cancelTurn |
|
||||
| Walkie-talkie | managed-room join/replacement plus shared turn/output events |
|
||||
| Voice Call | telephony adapter over Talk events; call ids, stream ids, u-law, marks, clear events stay plugin side |
|
||||
| Google Meet and future meetings | meeting adapter over Talk events; participant state, permissions, mute, and echo suppression stay out |
|
||||
|
||||
### Phase 3: Browser Runtime
|
||||
See [Talk surface mapping](/refactor/talk-surfaces) for the adapter-level
|
||||
rules.
|
||||
|
||||
- Normalize browser WebRTC, provider WebSocket, and relay adapters behind common Talk events.
|
||||
- Keep `managed-room` scoped to handoff clients until the browser has a real room client.
|
||||
- Add a walkie-talkie browser client path over Gateway relay or managed room.
|
||||
- Keep provider credentials on Gateway; browser receives only ephemeral room/session credentials.
|
||||
- Add browser tests proving realtime consult does not call `chat.send`.
|
||||
## Detailed Refactor Phases
|
||||
|
||||
### Phase 4: Native Runtime
|
||||
### Phase 1: Protocol Is The Source Of Truth
|
||||
|
||||
- Make native Talk consume response events in the success path.
|
||||
- Remove normal-path `chat.history` polling and keep history polling only as a degraded fallback if needed.
|
||||
- Preserve local STT and local TTS fallback.
|
||||
- Route native push-to-talk through the shared capture and turn events.
|
||||
- Verify node command policy allows `talk.ptt.*` for trusted Talk-capable native nodes.
|
||||
- Align native emitted state with common Talk events.
|
||||
- define final `talk.client.*`, `talk.session.*`, `talk.event`, `talk.catalog`, `talk.config`, `talk.speak`, and `talk.mode`
|
||||
- delete removed RPCs from method lists and generated metadata
|
||||
- delete removed event channels from hello feature advertising
|
||||
- classify every final method in `METHOD_SCOPE_GROUPS`
|
||||
- regenerate TypeScript and Swift protocol clients
|
||||
- add protocol tests proving removed names are absent
|
||||
|
||||
### Phase 5: VoiceClaw Runtime
|
||||
Exit criteria: generated clients expose only the final public Talk API.
|
||||
|
||||
- Rebase `/voiceclaw/realtime` onto the shared Talk session runtime.
|
||||
- Keep the endpoint as a thin migration adapter and preserve auth expectations only where they map cleanly to the shared Talk contract.
|
||||
- Remove request-time `instructionsOverride`; owner policy must come from server-side config, agent identity, or the selected brain strategy.
|
||||
- Map Gemini Live metrics, latency reporting, rotation, resumption, interruption, cancellation, audio, video, and tool events into the common event stream.
|
||||
- Keep `direct-tools` separate from `agent-consult`.
|
||||
- Do not add VoiceClaw-specific config names, override fields, or client policy knobs to new Talk contracts.
|
||||
### Phase 2: Shared Runtime Becomes `src/talk`
|
||||
|
||||
### Phase 6: Voice Call and Meetings
|
||||
- move provider-agnostic realtime voice modules into `src/talk`
|
||||
- keep the plugin SDK facade at `openclaw/plugin-sdk/realtime-voice`
|
||||
- rename logs and tests from realtime-voice wording to Talk wording where that improves clarity
|
||||
- centralize event sequencing, active turn state, capture state, output state, stale-turn rejection, and replay history
|
||||
- keep provider adapters out of this folder
|
||||
|
||||
- Convert Voice Call realtime into a telephony adapter over shared Talk sessions.
|
||||
- Convert Voice Call streaming STT into explicit `stt-tts`.
|
||||
- Convert Google Meet realtime into a meeting adapter over shared Talk sessions.
|
||||
- Keep telephony marks, u-law, backpressure, participant context, and echo suppression in their owning adapters.
|
||||
Exit criteria: core and bundled surfaces import shared semantics from `src/talk`
|
||||
or the SDK facade, not from surface-local helpers.
|
||||
|
||||
### Phase 7: Docs and Cleanup
|
||||
### Phase 3: Gateway Method Split
|
||||
|
||||
- Update [Talk mode](/nodes/talk), [Control UI](/web/control-ui), [Gateway protocol](/gateway/protocol), [Media overview](/tools/media-overview), [Text-to-speech](/tools/tts), and plugin SDK docs.
|
||||
- Retire duplicate event names after compatibility windows.
|
||||
- Remove browser-side consult-through-chat code after all supported providers use Gateway tool policy.
|
||||
- make `talk.ts` a composition point for catalog, config, speak, mode, client, and session handlers
|
||||
- put client-owned provider session methods in `talk-client.ts`
|
||||
- put Gateway-owned session methods in `talk-session.ts`
|
||||
- make relay, transcription, and managed-room handlers thin adapters over shared runtime primitives
|
||||
- route session replacement notifications to the displaced connection
|
||||
- reject stale turn completion before mutating active room state
|
||||
|
||||
## Test Matrix
|
||||
Exit criteria: public RPC handlers read like API adapters, not separate Talk
|
||||
implementations.
|
||||
|
||||
- WebRTC plus `agent-consult`.
|
||||
- Provider WebSocket plus `agent-consult`.
|
||||
- Gateway relay plus `agent-consult`.
|
||||
- Public clients updated to canonical transport names, or a versioned RPC proves old result names stay isolated until deletion.
|
||||
- VoiceClaw compatibility plus `direct-tools`, without request-time `instructionsOverride`.
|
||||
- Telephony WebSocket with marks, clear, interruption, and u-law.
|
||||
- Meeting adapter with participant context and echo suppression.
|
||||
- Native `stt-tts` with no `chat.history` polling in the normal success path.
|
||||
- Transcription-only Gateway relay session with partial/final transcript Talk events and no assistant brain.
|
||||
- TTS-only `talk.speak`.
|
||||
- Walkie-talkie handoff from an existing session into a voice room.
|
||||
- Two simultaneous walkie-talkie handoffs for the same host but different sessions with no transcript, audio, or turn-token cross-talk.
|
||||
- Push-to-talk start, stop, cancel, and once through `node.invoke` on a trusted talk-capable node.
|
||||
- Text-ready before TTS-ready, proving the client does not enter playback until audio starts.
|
||||
- Session-scoped provider catalog selection that does not mutate global Talk config.
|
||||
- Cancellation aborts provider work, agent consult, queued tools, TTS, and relay/room streams.
|
||||
- Security checks for no instruction override, no browser standard API keys, owner-only direct tools, and session-scoped tool calls.
|
||||
### Phase 4: Browser UI Uses The Final API
|
||||
|
||||
## End State
|
||||
- update WebRTC and provider WebSocket startup to `talk.client.create`
|
||||
- update browser provider tool calls to `talk.client.toolCall`
|
||||
- update Gateway relay startup to `talk.session.create`
|
||||
- update relay audio to `talk.session.appendAudio`
|
||||
- update relay tool result submission to `talk.session.submitToolResult`
|
||||
- update relay close to `talk.session.close`
|
||||
- listen only to `talk.event`
|
||||
- handle aborted consult runs immediately instead of timing out
|
||||
- gate relay barge-in on speech or VAD
|
||||
|
||||
OpenClaw has one Talk architecture with three execution modes, four core transports, explicit brain strategies, provider-owned vendor logic, Gateway-owned tool policy, and adapters for browser, native, telephony, meetings, and VoiceClaw compatibility. Users get better Talk mode. Maintainers get one place to reason about sessions, events, policy, metrics, and tests.
|
||||
Exit criteria: UI tests contain no calls to removed Talk RPC names.
|
||||
|
||||
### Phase 5: Native And Nodes Become Event-Driven
|
||||
|
||||
- map native push-to-talk into managed-room sessions
|
||||
- start, end, cancel, and replace turns through explicit session verbs
|
||||
- clean capture state when push-to-talk start fails
|
||||
- keep local STT and TTS as native adapter behavior
|
||||
- remove chat-history polling from the success path
|
||||
- keep fallback polling only if there is an explicit degraded-mode test
|
||||
|
||||
Exit criteria: native Talk success path is driven by `talk.event`, not hidden
|
||||
chat side effects.
|
||||
|
||||
### Phase 6: Telephony And Meetings Become Adapters
|
||||
|
||||
- map Voice Call realtime and streaming STT into Talk event/cancellation semantics
|
||||
- create or guard a turn before early speech cancellation events
|
||||
- keep telephony codec, marks, clear events, and call lifecycle outside core
|
||||
- map Google Meet transcript and assistant output into `talk.event`
|
||||
- keep participant and echo-suppression behavior in the meeting adapter
|
||||
- pass abort signals into agent consult and tool runtime
|
||||
|
||||
Exit criteria: Voice Call and meetings share event and cancellation semantics
|
||||
without introducing telephony or meeting branches in core.
|
||||
|
||||
### Phase 7: Config And Doctor Cleanup
|
||||
|
||||
- keep `talk.provider` and `talk.providers.*` as speech/STT/TTS config
|
||||
- keep realtime voice selectors under `talk.realtime`
|
||||
- make `talk.config` return only resolved effective provider data
|
||||
- repair legacy realtime placement in doctor
|
||||
- document that runtime startup does not guess or rewrite config
|
||||
- update SDK migration, Gateway protocol, Talk node, Control UI, and TTS docs
|
||||
|
||||
Exit criteria: no second speech namespace, no startup migrations, and no
|
||||
ambiguous active provider in `talk.config`.
|
||||
|
||||
### Phase 8: Delete The Retired Stack
|
||||
|
||||
- remove `/voiceclaw/realtime`
|
||||
- delete `src/gateway/voiceclaw-realtime/`
|
||||
- remove request-time `instructionsOverride`
|
||||
- remove old RPC handlers, scopes, broadcast guards, protocol schemas, generated clients, docs, and UI calls
|
||||
- keep old names only in explicit migration tables and negative tests
|
||||
|
||||
Exit criteria: repository search finds removed public names only in migration
|
||||
notes or tests that assert absence.
|
||||
|
||||
## Test And Verification Plan
|
||||
|
||||
The full matrix lives in
|
||||
[Talk refactor execution checklist](/refactor/talk-execution). The required
|
||||
proof areas are:
|
||||
|
||||
- protocol and generated clients expose only the final Talk API
|
||||
- Gateway tests cover every `talk.client.*` and `talk.session.*` method
|
||||
- UI tests prove browser WebRTC, provider WebSocket, and relay paths use the final API
|
||||
- native tests prove managed-room push-to-talk cleanup, replacement, and event flow
|
||||
- Voice Call and meeting tests prove early speech, barge-in, output state, and cancellation behavior
|
||||
- config tests prove `talk.config` reports only resolved effective provider data
|
||||
- architecture searches prove removed RPCs, events, endpoint, folder, and instruction override stay gone
|
||||
- docs, protocol generation, SDK API checks, Android tests, build, and `pnpm check:changed` pass before push
|
||||
|
||||
## Definition Of Done
|
||||
|
||||
The refactor is complete when:
|
||||
|
||||
- final API is the only advertised public API
|
||||
- removed RPCs are gone from handlers, scopes, method lists, schemas, generated clients, docs, and UI
|
||||
- removed event channels are gone
|
||||
- retired realtime HTTP endpoint is gone
|
||||
- retired realtime folder is gone
|
||||
- browser Talk works through `talk.client.*` or `talk.session.*`
|
||||
- native Talk works through session events
|
||||
- streaming STT works through `talk.session.*`
|
||||
- TTS one-shot remains `talk.speak`
|
||||
- walkie-talkie works through managed-room sessions
|
||||
- Voice Call and meetings use shared events and cancellation semantics
|
||||
- cancellation aborts underlying work
|
||||
- event envelopes are consistent
|
||||
- config migration is handled by doctor
|
||||
- tests prove the deleted API cannot accidentally return
|
||||
|
||||
Supporting details:
|
||||
|
||||
- [Talk API and runtime contract](/refactor/talk-api-contract)
|
||||
- [Talk surface mapping](/refactor/talk-surfaces)
|
||||
- [Talk refactor execution checklist](/refactor/talk-execution)
|
||||
|
||||
The end state: one Talk system, a small public API, provider-owned vendor
|
||||
logic, surface-owned IO, and a Gateway core that owns policy, events, sessions,
|
||||
turns, cancellation, and observability.
|
||||
|
||||
@@ -96,7 +96,7 @@ Imported themes are stored only in the current browser profile. They are not wri
|
||||
<AccordionGroup>
|
||||
<Accordion title="Chat and Talk">
|
||||
- Chat with the model via Gateway WS (`chat.history`, `chat.send`, `chat.abort`, `chat.inject`).
|
||||
- Talk through browser realtime sessions. OpenAI uses direct WebRTC, Google Live uses a constrained one-use browser token over WebSocket, and backend-only realtime voice plugins use the Gateway relay transport. The relay keeps provider credentials on the Gateway while the browser streams microphone PCM through `talk.realtime.relay*` RPCs and forwards `openclaw_agent_consult` provider tool calls through `talk.realtime.toolCall` for Gateway policy and the larger configured OpenClaw model.
|
||||
- Talk through browser realtime sessions. OpenAI uses direct WebRTC, Google Live uses a constrained one-use browser token over WebSocket, and backend-only realtime voice plugins use the Gateway relay transport. Client-owned provider sessions start with `talk.client.create`; Gateway relay sessions start with `talk.session.create`. The relay keeps provider credentials on the Gateway while the browser streams microphone PCM through `talk.session.appendAudio` and forwards `openclaw_agent_consult` provider tool calls through `talk.client.toolCall` for Gateway policy and the larger configured OpenClaw model.
|
||||
- Stream tool calls + live tool output cards in Chat (agent events).
|
||||
|
||||
</Accordion>
|
||||
@@ -168,9 +168,9 @@ Imported themes are stored only in the current browser profile. They are not wri
|
||||
|
||||
</Accordion>
|
||||
<Accordion title="Talk mode (browser realtime)">
|
||||
Talk mode uses a registered realtime voice provider. Configure OpenAI with `talk.realtime.provider: "openai"` plus `talk.realtime.providers.openai.apiKey`, or configure Google with `talk.realtime.provider: "google"` plus `talk.realtime.providers.google.apiKey`; Voice Call realtime provider config can still be reused as the fallback. The browser never receives a standard provider API key. OpenAI receives an ephemeral Realtime client secret for WebRTC. Google Live receives a one-use constrained Live API auth token for a browser WebSocket session, with instructions and tool declarations locked into the token by the Gateway. Providers that only expose a backend realtime bridge run through the Gateway relay transport, so credentials and vendor sockets stay server-side while browser audio moves through authenticated Gateway RPCs. The Realtime session prompt is assembled by the Gateway; `talk.realtime.session` does not accept caller-provided instruction overrides.
|
||||
Talk mode uses a registered realtime voice provider. Configure OpenAI with `talk.realtime.provider: "openai"` plus `talk.realtime.providers.openai.apiKey`, or configure Google with `talk.realtime.provider: "google"` plus `talk.realtime.providers.google.apiKey`. The browser never receives a standard provider API key. OpenAI receives an ephemeral Realtime client secret for WebRTC. Google Live receives a one-use constrained Live API auth token for a browser WebSocket session, with instructions and tool declarations locked into the token by the Gateway. Providers that only expose a backend realtime bridge run through the Gateway relay transport, so credentials and vendor sockets stay server-side while browser audio moves through authenticated Gateway RPCs. The Realtime session prompt is assembled by the Gateway; `talk.client.create` does not accept caller-provided instruction overrides.
|
||||
|
||||
In the Chat composer, the Talk control is the waves button next to the microphone dictation button. When Talk starts, the composer status row shows `Connecting Talk...`, then `Talk live` while audio is connected, or `Asking OpenClaw...` while a realtime tool call is consulting the configured larger model through `talk.realtime.toolCall`.
|
||||
In the Chat composer, the Talk control is the waves button next to the microphone dictation button. When Talk starts, the composer status row shows `Connecting Talk...`, then `Talk live` while audio is connected, or `Asking OpenClaw...` while a realtime tool call is consulting the configured larger model through `talk.client.toolCall`.
|
||||
|
||||
Maintainer live smoke: `OPENAI_API_KEY=... GEMINI_API_KEY=... node --import tsx scripts/dev/realtime-talk-live-smoke.ts` verifies the OpenAI browser WebRTC SDP exchange, Google Live constrained-token browser WebSocket setup, and the Gateway relay browser adapter with fake microphone media. The command prints provider status only and does not log secrets.
|
||||
|
||||
|
||||
Reference in New Issue
Block a user