refactor: move voice-call realtime providers into extensions

2026-04-06 06:41:08 +00:00 · 2026-04-04 12:04:37 +09:00
parent 61f93540b2
commit a23ab9b906
90 changed files with 3134 additions and 792 deletions
--- a/docs/plugins/architecture.md
+++ b/docs/plugins/architecture.md
@@ -32,6 +32,7 @@ native OpenClaw plugin registers against one or more capability types:
 | Text inference        | `api.registerProvider(...)`                   | `openai`, `anthropic`     |
 | CLI inference backend | `api.registerCliBackend(...)`                 | `openai`, `anthropic`     |
 | Speech                | `api.registerSpeechProvider(...)`             | `elevenlabs`, `microsoft` |
+| Realtime voice        | `api.registerRealtimeVoiceProvider(...)`      | `openai`                  |
 | Media understanding   | `api.registerMediaUnderstandingProvider(...)` | `openai`, `google`        |
 | Image generation      | `api.registerImageGenerationProvider(...)`    | `openai`, `google`        |
 | Web search            | `api.registerWebSearchProvider(...)`          | `google`                  |
@@ -239,8 +240,9 @@ Examples:
 - the bundled `minimax`, `mistral`, `moonshot`, and `zai` plugins own their
  media-understanding backends
 - the `voice-call` plugin is a feature plugin: it owns call transport, tools,
-  CLI, routes, and runtime, but it consumes core TTS/STT capability instead of
-  inventing a second speech stack
+  CLI, routes, and Twilio media-stream bridging, but it consumes shared speech
+  plus realtime-transcription and realtime-voice capabilities instead of
+  importing vendor plugins directly

 The intended end state is:

--- a/docs/plugins/building-plugins.md
+++ b/docs/plugins/building-plugins.md
@@ -146,6 +146,7 @@ A single plugin can register any number of capabilities via the `api` object:
 | CLI inference backend | `api.registerCliBackend(...)`                 | [CLI Backends](/gateway/cli-backends)                                           |
 | Channel / messaging   | `api.registerChannel(...)`                    | [Channel Plugins](/plugins/sdk-channel-plugins)                                 |
 | Speech (TTS/STT)      | `api.registerSpeechProvider(...)`             | [Provider Plugins](/plugins/sdk-provider-plugins#step-5-add-extra-capabilities) |
+| Realtime voice        | `api.registerRealtimeVoiceProvider(...)`      | [Provider Plugins](/plugins/sdk-provider-plugins#step-5-add-extra-capabilities) |
 | Media understanding   | `api.registerMediaUnderstandingProvider(...)` | [Provider Plugins](/plugins/sdk-provider-plugins#step-5-add-extra-capabilities) |
 | Image generation      | `api.registerImageGenerationProvider(...)`    | [Provider Plugins](/plugins/sdk-provider-plugins#step-5-add-extra-capabilities) |
 | Web search            | `api.registerWebSearchProvider(...)`          | [Provider Plugins](/plugins/sdk-provider-plugins#step-5-add-extra-capabilities) |
--- a/docs/plugins/manifest.md
+++ b/docs/plugins/manifest.md
@@ -196,6 +196,8 @@ read without importing the plugin runtime.
 {
  "contracts": {
    "speechProviders": ["openai"],
+    "realtimeTranscriptionProviders": ["openai"],
+    "realtimeVoiceProviders": ["openai"],
    "mediaUnderstandingProviders": ["openai", "openai-codex"],
    "imageGenerationProviders": ["openai"],
    "webSearchProviders": ["gemini"],
@@ -206,13 +208,15 @@ read without importing the plugin runtime.

 Each list is optional:

-| Field                         | Type       | What it means                                                  |
-| ----------------------------- | ---------- | -------------------------------------------------------------- |
-| `speechProviders`             | `string[]` | Speech provider ids this plugin owns.                          |
-| `mediaUnderstandingProviders` | `string[]` | Media-understanding provider ids this plugin owns.             |
-| `imageGenerationProviders`    | `string[]` | Image-generation provider ids this plugin owns.                |
-| `webSearchProviders`          | `string[]` | Web-search provider ids this plugin owns.                      |
-| `tools`                       | `string[]` | Agent tool names this plugin owns for bundled contract checks. |
+| Field                            | Type       | What it means                                                  |
+| -------------------------------- | ---------- | -------------------------------------------------------------- |
+| `speechProviders`                | `string[]` | Speech provider ids this plugin owns.                          |
+| `realtimeTranscriptionProviders` | `string[]` | Realtime-transcription provider ids this plugin owns.          |
+| `realtimeVoiceProviders`         | `string[]` | Realtime-voice provider ids this plugin owns.                  |
+| `mediaUnderstandingProviders`    | `string[]` | Media-understanding provider ids this plugin owns.             |
+| `imageGenerationProviders`       | `string[]` | Image-generation provider ids this plugin owns.                |
+| `webSearchProviders`             | `string[]` | Web-search provider ids this plugin owns.                      |
+| `tools`                          | `string[]` | Agent tool names this plugin owns for bundled contract checks. |

 Legacy top-level `speechProviders`, `mediaUnderstandingProviders`, and
 `imageGenerationProviders` are deprecated. Use `openclaw doctor --fix` to move
--- a/docs/plugins/sdk-overview.md
+++ b/docs/plugins/sdk-overview.md
@@ -128,15 +128,17 @@ methods:

 ### Capability registration

-| Method                                        | What it registers              |
-| --------------------------------------------- | ------------------------------ |
-| `api.registerProvider(...)`                   | Text inference (LLM)           |
-| `api.registerCliBackend(...)`                 | Local CLI inference backend    |
-| `api.registerChannel(...)`                    | Messaging channel              |
-| `api.registerSpeechProvider(...)`             | Text-to-speech / STT synthesis |
-| `api.registerMediaUnderstandingProvider(...)` | Image/audio/video analysis     |
-| `api.registerImageGenerationProvider(...)`    | Image generation               |
-| `api.registerWebSearchProvider(...)`          | Web search                     |
+| Method                                           | What it registers                |
+| ------------------------------------------------ | -------------------------------- |
+| `api.registerProvider(...)`                      | Text inference (LLM)             |
+| `api.registerCliBackend(...)`                    | Local CLI inference backend      |
+| `api.registerChannel(...)`                       | Messaging channel                |
+| `api.registerSpeechProvider(...)`                | Text-to-speech / STT synthesis   |
+| `api.registerRealtimeTranscriptionProvider(...)` | Streaming realtime transcription |
+| `api.registerRealtimeVoiceProvider(...)`         | Duplex realtime voice sessions   |
+| `api.registerMediaUnderstandingProvider(...)`    | Image/audio/video analysis       |
+| `api.registerImageGenerationProvider(...)`       | Image generation                 |
+| `api.registerWebSearchProvider(...)`             | Web search                       |

 ### Tools and commands

--- a/docs/plugins/sdk-provider-plugins.md
+++ b/docs/plugins/sdk-provider-plugins.md
@@ -324,8 +324,8 @@ API key auth, and dynamic model resolution.

  <Step title="Add extra capabilities (optional)">
    <a id="step-5-add-extra-capabilities"></a>
-    A provider plugin can register speech, media understanding, image
-    generation, and web search alongside text inference:
+    A provider plugin can register speech, realtime transcription, realtime voice, media
+    understanding, image generation, and web search alongside text inference:

    ```typescript
    register(api) {
@@ -343,6 +343,33 @@ API key auth, and dynamic model resolution.
        }),
      });

+      api.registerRealtimeTranscriptionProvider({
+        id: "acme-ai",
+        label: "Acme Realtime Transcription",
+        isConfigured: () => true,
+        createSession: (req) => ({
+          connect: async () => {},
+          sendAudio: () => {},
+          close: () => {},
+          isConnected: () => true,
+        }),
+      });
+
+      api.registerRealtimeVoiceProvider({
+        id: "acme-ai",
+        label: "Acme Realtime Voice",
+        isConfigured: ({ providerConfig }) => Boolean(providerConfig.apiKey),
+        createBridge: (req) => ({
+          connect: async () => {},
+          sendAudio: () => {},
+          setMediaTimestamp: () => {},
+          submitToolResult: () => {},
+          acknowledgeMark: () => {},
+          close: () => {},
+          isConnected: () => true,
+        }),
+      });
+
      api.registerMediaUnderstandingProvider({
        id: "acme-ai",
        capabilities: ["image", "audio"],