docs: outline unified talk API

2026-05-06 09:30:43 +00:00 · 2026-05-05 20:59:13 +01:00
parent 1f7d0ef310
commit 24853ced11
13 changed files with 625 additions and 13 deletions
--- a/docs/plugins/sdk-migration.md
+++ b/docs/plugins/sdk-migration.md
@@ -77,6 +77,97 @@ Current bundled provider examples:
 - OpenRouter keeps provider builder and onboarding/config helpers in its own
  `api.ts`

+## Talk and realtime voice migration plan
+
+Realtime voice, telephony, meeting, and browser Talk code is moving from
+surface-local turn bookkeeping to a shared Talk session controller exported by
+`openclaw/plugin-sdk/realtime-voice`. The new controller owns the common Talk
+event envelope, active turn state, capture state, output-audio state, recent
+event history, and stale-turn rejection. Provider plugins should keep owning
+vendor-specific realtime sessions; surface plugins should keep owning capture,
+playback, telephony, and meeting quirks.
+
+This migration is intentionally adapter-first:
+
+1. Add shared controller/runtime primitives to `plugin-sdk/realtime-voice`.
+2. Keep existing public Gateway RPCs such as `talk.realtime.session`,
+   `talk.realtime.relayAudio`, `talk.transcription.session`, and
+   `talk.handoff.*` as compatibility adapters.
+3. Move bundled surfaces onto the shared controller: browser relay, managed-room
+   handoff, voice-call realtime, voice-call streaming STT, Google Meet realtime,
+   and VoiceClaw realtime.
+4. Advertise all Talk event channels in Gateway `hello-ok.features.events` so
+   clients can discover `talk.event`, `talk.realtime.relay`, and
+   `talk.transcription.relay`.
+5. Expose the versioned `talk.session.*` API for Gateway-managed Talk sessions
+   after the adapters are internally backed by the same controller.
+
+New code should not call `createTalkEventSequencer(...)` directly unless it is
+implementing a low-level adapter or test fixture. Prefer the shared controller
+so turn-scoped events cannot be emitted without a turn id, stale `turnEnd` /
+`turnCancel` calls cannot clear a newer active turn, and output-audio lifecycle
+events stay consistent across telephony, meetings, browser relay, managed-room
+handoff, and native Talk clients.
+
+The target public API shape is:
+
+```typescript
+// Versioned Gateway-managed Talk session API.
+await gateway.request("talk.session.create", {
+  mode: "realtime",
+  transport: "gateway-relay",
+  brain: "agent-consult",
+  sessionKey: "main",
+});
+await gateway.request("talk.session.inputAudio", { sessionId, audioBase64 });
+await gateway.request("talk.session.control", { sessionId, type: "turn.cancel" });
+await gateway.request("talk.session.toolResult", { sessionId, callId, result });
+await gateway.request("talk.session.close", { sessionId });
+```
+
+Browser-owned WebRTC/provider-websocket sessions stay on
+`talk.realtime.session`, because the browser owns the provider negotiation and
+media transport. `talk.session.*` is the common Gateway-managed surface for
+gateway-relay realtime, gateway-relay transcription, and managed-room native
+STT/TTS sessions.
+
+Legacy configs that placed realtime selectors beside `talk.provider` /
+`talk.providers` should be repaired with `openclaw doctor --fix`; runtime Talk
+does not reinterpret speech/TTS provider config as realtime provider config.
+
+The supported `talk.session.create` combinations are intentionally small:
+
+| Mode            | Transport       | Brain           | Owner              | Notes                                                                                                              |
+| --------------- | --------------- | --------------- | ------------------ | ------------------------------------------------------------------------------------------------------------------ |
+| `realtime`      | `gateway-relay` | `agent-consult` | Gateway            | Full-duplex provider audio bridged through the Gateway; tool calls are routed through the agent-consult tool.      |
+| `transcription` | `gateway-relay` | `none`          | Gateway            | Streaming STT only; callers send input audio and receive transcript events.                                        |
+| `stt-tts`       | `managed-room`  | `agent-consult` | Native/client room | Push-to-talk and walkie-talkie style rooms where the client owns capture/playback and the Gateway owns turn state. |
+| `stt-tts`       | `managed-room`  | `direct-tools`  | Native/client room | Admin-only room mode for trusted first-party surfaces that execute Gateway tool actions directly.                  |
+
+Everything else should stay on the existing owner-specific adapter until there
+is a real Gateway-managed transport for it:
+
+| Existing adapter        | Keep using it for                                                                        |
+| ----------------------- | ---------------------------------------------------------------------------------------- |
+| `talk.realtime.session` | Browser-owned WebRTC and provider-websocket realtime sessions.                           |
+| `talk.realtime.relay*`  | Compatibility for existing browser relay clients while they migrate to `talk.session.*`. |
+| `talk.transcription.*`  | Compatibility for existing streaming STT clients while they migrate to `talk.session.*`. |
+| `talk.handoff.*`        | Compatibility for room-style native clients; internally this is the managed-room shape.  |
+
+The unified control vocabulary is also deliberately narrow:
+
+| Method                    | Applies to                                              | Contract                                                                                                 |
+| ------------------------- | ------------------------------------------------------- | -------------------------------------------------------------------------------------------------------- |
+| `talk.session.inputAudio` | `realtime/gateway-relay`, `transcription/gateway-relay` | Append a base64 PCM audio chunk to the provider session owned by the same Gateway connection.            |
+| `talk.session.control`    | all unified sessions                                    | `turn.cancel` for relay sessions; `turn.start`, `turn.end`, and `turn.cancel` for managed-room sessions. |
+| `talk.session.toolResult` | `realtime/gateway-relay`                                | Complete a provider tool call emitted by the relay.                                                      |
+| `talk.session.close`      | all unified sessions                                    | Stop relay sessions or revoke managed-room handoff state, then forget the unified session id.            |
+
+Do not introduce provider or platform special cases in core to make this work.
+Core owns Talk session semantics. Provider plugins own vendor session setup.
+Voice-call and Google Meet own telephony/meeting adapters. Browser and native
+apps own device capture/playback UX.
+
 ## Compatibility policy

 For external plugins, compatibility work follows this order:
@@ -497,7 +588,7 @@ releases.
  | `plugin-sdk/speech` | Speech helpers | Speech provider types plus provider-facing directive, registry, validation helpers, and OpenAI-compatible TTS builder |
  | `plugin-sdk/speech-core` | Shared speech core | Speech provider types, registry, directives, normalization |
  | `plugin-sdk/realtime-transcription` | Realtime transcription helpers | Provider types, registry helpers, and shared WebSocket session helper |
-  | `plugin-sdk/realtime-voice` | Realtime voice helpers | Provider types, registry/resolution helpers, and bridge session helpers |
+  | `plugin-sdk/realtime-voice` | Realtime voice helpers | Provider types, registry/resolution helpers, bridge session helpers, shared agent talk-back queues, transcript/event health, echo suppression, and fast context consult helpers |
  | `plugin-sdk/image-generation` | Image-generation helpers | Image generation provider types plus image asset/data URL helpers and the OpenAI-compatible image provider builder |
  | `plugin-sdk/image-generation-core` | Shared image-generation core | Image-generation types, failover, auth, and registry helpers |
  | `plugin-sdk/music-generation` | Music-generation helpers | Music-generation provider/request/result types |
--- a/docs/plugins/sdk-provider-plugins.md
+++ b/docs/plugins/sdk-provider-plugins.md
@@ -588,6 +588,13 @@ API key auth, and dynamic model resolution.
        api.registerRealtimeVoiceProvider({
          id: "acme-ai",
          label: "Acme Realtime Voice",
+          capabilities: {
+            transports: ["gateway-relay"],
+            inputAudioFormats: [{ encoding: "pcm16", sampleRateHz: 24000, channels: 1 }],
+            outputAudioFormats: [{ encoding: "pcm16", sampleRateHz: 24000, channels: 1 }],
+            supportsBargeIn: true,
+            supportsToolCalls: true,
+          },
          isConfigured: ({ providerConfig }) => Boolean(providerConfig.apiKey),
          createBridge: (req) => ({
            // Set this only if the provider accepts multiple tool responses for
@@ -606,9 +613,11 @@ API key auth, and dynamic model resolution.
        });
        ```

-        Implement `handleBargeIn` when a transport can detect that a human is
-        interrupting assistant playback and the provider supports truncating or
-        clearing the active audio response.
+        Declare `capabilities` so `talk.catalog` can expose valid modes,
+        transports, audio formats, and feature flags to browser and native Talk
+        clients. Implement `handleBargeIn` when a transport can detect that a
+        human is interrupting assistant playback and the provider supports
+        truncating or clearing the active audio response.
      </Tab>
      <Tab title="Media understanding">
        ```typescript