refactor: move voice-call realtime providers into extensions

This commit is contained in:
Peter Steinberger
2026-04-04 12:04:37 +09:00
parent 61f93540b2
commit a23ab9b906
90 changed files with 3134 additions and 792 deletions

View File

@@ -32,6 +32,7 @@ native OpenClaw plugin registers against one or more capability types:
| Text inference | `api.registerProvider(...)` | `openai`, `anthropic` |
| CLI inference backend | `api.registerCliBackend(...)` | `openai`, `anthropic` |
| Speech | `api.registerSpeechProvider(...)` | `elevenlabs`, `microsoft` |
| Realtime voice | `api.registerRealtimeVoiceProvider(...)` | `openai` |
| Media understanding | `api.registerMediaUnderstandingProvider(...)` | `openai`, `google` |
| Image generation | `api.registerImageGenerationProvider(...)` | `openai`, `google` |
| Web search | `api.registerWebSearchProvider(...)` | `google` |
@@ -239,8 +240,9 @@ Examples:
- the bundled `minimax`, `mistral`, `moonshot`, and `zai` plugins own their
media-understanding backends
- the `voice-call` plugin is a feature plugin: it owns call transport, tools,
CLI, routes, and runtime, but it consumes core TTS/STT capability instead of
inventing a second speech stack
CLI, routes, and Twilio media-stream bridging, but it consumes shared speech
plus realtime-transcription and realtime-voice capabilities instead of
importing vendor plugins directly
The intended end state is:

View File

@@ -146,6 +146,7 @@ A single plugin can register any number of capabilities via the `api` object:
| CLI inference backend | `api.registerCliBackend(...)` | [CLI Backends](/gateway/cli-backends) |
| Channel / messaging | `api.registerChannel(...)` | [Channel Plugins](/plugins/sdk-channel-plugins) |
| Speech (TTS/STT) | `api.registerSpeechProvider(...)` | [Provider Plugins](/plugins/sdk-provider-plugins#step-5-add-extra-capabilities) |
| Realtime voice | `api.registerRealtimeVoiceProvider(...)` | [Provider Plugins](/plugins/sdk-provider-plugins#step-5-add-extra-capabilities) |
| Media understanding | `api.registerMediaUnderstandingProvider(...)` | [Provider Plugins](/plugins/sdk-provider-plugins#step-5-add-extra-capabilities) |
| Image generation | `api.registerImageGenerationProvider(...)` | [Provider Plugins](/plugins/sdk-provider-plugins#step-5-add-extra-capabilities) |
| Web search | `api.registerWebSearchProvider(...)` | [Provider Plugins](/plugins/sdk-provider-plugins#step-5-add-extra-capabilities) |

View File

@@ -196,6 +196,8 @@ read without importing the plugin runtime.
{
"contracts": {
"speechProviders": ["openai"],
"realtimeTranscriptionProviders": ["openai"],
"realtimeVoiceProviders": ["openai"],
"mediaUnderstandingProviders": ["openai", "openai-codex"],
"imageGenerationProviders": ["openai"],
"webSearchProviders": ["gemini"],
@@ -206,13 +208,15 @@ read without importing the plugin runtime.
Each list is optional:
| Field | Type | What it means |
| ----------------------------- | ---------- | -------------------------------------------------------------- |
| `speechProviders` | `string[]` | Speech provider ids this plugin owns. |
| `mediaUnderstandingProviders` | `string[]` | Media-understanding provider ids this plugin owns. |
| `imageGenerationProviders` | `string[]` | Image-generation provider ids this plugin owns. |
| `webSearchProviders` | `string[]` | Web-search provider ids this plugin owns. |
| `tools` | `string[]` | Agent tool names this plugin owns for bundled contract checks. |
| Field | Type | What it means |
| -------------------------------- | ---------- | -------------------------------------------------------------- |
| `speechProviders` | `string[]` | Speech provider ids this plugin owns. |
| `realtimeTranscriptionProviders` | `string[]` | Realtime-transcription provider ids this plugin owns. |
| `realtimeVoiceProviders` | `string[]` | Realtime-voice provider ids this plugin owns. |
| `mediaUnderstandingProviders` | `string[]` | Media-understanding provider ids this plugin owns. |
| `imageGenerationProviders` | `string[]` | Image-generation provider ids this plugin owns. |
| `webSearchProviders` | `string[]` | Web-search provider ids this plugin owns. |
| `tools` | `string[]` | Agent tool names this plugin owns for bundled contract checks. |
Legacy top-level `speechProviders`, `mediaUnderstandingProviders`, and
`imageGenerationProviders` are deprecated. Use `openclaw doctor --fix` to move

View File

@@ -128,15 +128,17 @@ methods:
### Capability registration
| Method | What it registers |
| --------------------------------------------- | ------------------------------ |
| `api.registerProvider(...)` | Text inference (LLM) |
| `api.registerCliBackend(...)` | Local CLI inference backend |
| `api.registerChannel(...)` | Messaging channel |
| `api.registerSpeechProvider(...)` | Text-to-speech / STT synthesis |
| `api.registerMediaUnderstandingProvider(...)` | Image/audio/video analysis |
| `api.registerImageGenerationProvider(...)` | Image generation |
| `api.registerWebSearchProvider(...)` | Web search |
| Method | What it registers |
| ------------------------------------------------ | -------------------------------- |
| `api.registerProvider(...)` | Text inference (LLM) |
| `api.registerCliBackend(...)` | Local CLI inference backend |
| `api.registerChannel(...)` | Messaging channel |
| `api.registerSpeechProvider(...)` | Text-to-speech / STT synthesis |
| `api.registerRealtimeTranscriptionProvider(...)` | Streaming realtime transcription |
| `api.registerRealtimeVoiceProvider(...)` | Duplex realtime voice sessions |
| `api.registerMediaUnderstandingProvider(...)` | Image/audio/video analysis |
| `api.registerImageGenerationProvider(...)` | Image generation |
| `api.registerWebSearchProvider(...)` | Web search |
### Tools and commands

View File

@@ -324,8 +324,8 @@ API key auth, and dynamic model resolution.
<Step title="Add extra capabilities (optional)">
<a id="step-5-add-extra-capabilities"></a>
A provider plugin can register speech, media understanding, image
generation, and web search alongside text inference:
A provider plugin can register speech, realtime transcription, realtime voice, media
understanding, image generation, and web search alongside text inference:
```typescript
register(api) {
@@ -343,6 +343,33 @@ API key auth, and dynamic model resolution.
}),
});
api.registerRealtimeTranscriptionProvider({
id: "acme-ai",
label: "Acme Realtime Transcription",
isConfigured: () => true,
createSession: (req) => ({
connect: async () => {},
sendAudio: () => {},
close: () => {},
isConnected: () => true,
}),
});
api.registerRealtimeVoiceProvider({
id: "acme-ai",
label: "Acme Realtime Voice",
isConfigured: ({ providerConfig }) => Boolean(providerConfig.apiKey),
createBridge: (req) => ({
connect: async () => {},
sendAudio: () => {},
setMediaTimestamp: () => {},
submitToolResult: () => {},
acknowledgeMark: () => {},
close: () => {},
isConnected: () => true,
}),
});
api.registerMediaUnderstandingProvider({
id: "acme-ai",
capabilities: ["image", "audio"],