feat: add xai realtime transcription

This commit is contained in:
Peter Steinberger
2026-04-23 01:35:08 +01:00
parent d4c171f594
commit 67f09ea87a
12 changed files with 899 additions and 25 deletions

View File

@@ -155,8 +155,8 @@ Current runtime behavior:
- `streaming.provider` is optional. If unset, Voice Call uses the first
registered realtime transcription provider.
- Today the bundled provider is OpenAI, registered by the bundled `openai`
plugin.
- Bundled realtime transcription providers include OpenAI (`openai`) and xAI
(`xai`), registered by their provider plugins.
- Provider-owned raw config lives under `streaming.providers.<providerId>`.
- If `streaming.provider` points at an unregistered provider, or no realtime
transcription provider is registered at all, Voice Call logs a warning and
@@ -169,6 +169,15 @@ OpenAI streaming transcription defaults:
- `silenceDurationMs`: `800`
- `vadThreshold`: `0.5`
xAI streaming transcription defaults:
- API key: `streaming.providers.xai.apiKey` or `XAI_API_KEY`
- endpoint: `wss://api.x.ai/v1/stt`
- `encoding`: `mulaw`
- `sampleRate`: `8000`
- `endpointingMs`: `800`
- `interimResults`: `true`
Example:
```json5
@@ -197,6 +206,33 @@ Example:
}
```
Use xAI instead:
```json5
{
plugins: {
entries: {
"voice-call": {
config: {
streaming: {
enabled: true,
provider: "xai",
streamPath: "/voice/stream",
providers: {
xai: {
apiKey: "${XAI_API_KEY}", // optional if XAI_API_KEY is set
endpointingMs: 800,
language: "en",
},
},
},
},
},
},
},
}
```
Legacy keys are still auto-migrated by `openclaw doctor --fix`:
- `streaming.sttProvider``streaming.provider`

View File

@@ -79,16 +79,17 @@ provider and tool contracts where the behavior fits cleanly.
| Batch text-to-speech | `messages.tts.provider: "xai"` / `tts` | Yes |
| Streaming TTS | — | Not exposed; OpenClaw's TTS contract returns complete audio buffers |
| Batch speech-to-text | `tools.media.audio` / media understanding | Yes |
| Streaming speech-to-text | — | Not exposed; needs streaming transcription contract mapping |
| Streaming speech-to-text | Voice Call `streaming.provider: "xai"` | Yes |
| Realtime voice | — | Not exposed yet; different session/WebSocket contract |
| Files / batches | Generic model API compatibility only | Not a first-class OpenClaw tool |
<Note>
OpenClaw uses xAI's REST image/video/TTS/STT APIs for media generation,
speech, and transcription, and the Responses API for model, search, and
code-execution tools. Features that need new OpenClaw contracts, such as
streaming STT or Realtime voice sessions, are documented here as upstream
capabilities rather than hidden plugin behavior.
speech, and batch transcription, xAI's streaming STT WebSocket for live
voice-call transcription, and the Responses API for model, search, and
code-execution tools. Features that need different OpenClaw contracts, such as
Realtime voice sessions, are documented here as upstream capabilities rather
than hidden plugin behavior.
</Note>
### Fast-mode mappings
@@ -277,10 +278,54 @@ Legacy aliases still normalize to the canonical bundled ids:
surface, but the xAI REST STT integration only forwards file, model, and
language because those map cleanly to the current public xAI endpoint.
</Accordion>
<Accordion title="Streaming speech-to-text">
The bundled `xai` plugin also registers a realtime transcription provider
for live voice-call audio.
- Endpoint: xAI WebSocket `wss://api.x.ai/v1/stt`
- Default encoding: `mulaw`
- Default sample rate: `8000`
- Default endpointing: `800ms`
- Interim transcripts: enabled by default
Voice Call's Twilio media stream sends G.711 µ-law audio frames, so the
xAI provider can forward those frames directly without transcoding:
```json5
{
plugins: {
entries: {
"voice-call": {
config: {
streaming: {
enabled: true,
provider: "xai",
providers: {
xai: {
apiKey: "${XAI_API_KEY}",
endpointingMs: 800,
language: "en",
},
},
},
},
},
},
},
}
```
Provider-owned config lives under
`plugins.entries.voice-call.config.streaming.providers.xai`. Supported
keys are `apiKey`, `baseUrl`, `sampleRate`, `encoding` (`pcm`, `mulaw`, or
`alaw`), `interimResults`, `endpointingMs`, and `language`.
<Note>
xAI also offers streaming STT over `wss://api.x.ai/v1/stt`. OpenClaw's
bundled xAI plugin does not expose that yet; the current provider is batch
STT for file/segment transcription.
This streaming provider is for Voice Call's realtime transcription path.
Discord voice currently records short segments and uses the batch
`tools.media.audio` transcription path instead.
</Note>
</Accordion>
@@ -362,9 +407,9 @@ Legacy aliases still normalize to the canonical bundled ids:
- `grok-4.20-multi-agent-experimental-beta-0304` is not supported on the
normal xAI provider path because it requires a different upstream API
surface than the standard OpenClaw xAI transport.
- xAI streaming STT and Realtime voice are not registered as OpenClaw
providers yet. Batch xAI STT is registered through media understanding.
Streaming STT and Realtime voice need WebSocket/session contract mapping.
- xAI Realtime voice is not registered as an OpenClaw provider yet. It
needs a different bidirectional voice session contract than batch STT or
streaming transcription.
- xAI image `quality`, image `mask`, and extra native-only aspect ratios are
not exposed until the shared `image_generate` tool has corresponding
cross-provider controls.
@@ -401,10 +446,10 @@ OPENCLAW_LIVE_TEST=1 OPENCLAW_LIVE_TEST_QUIET=1 OPENCLAW_LIVE_IMAGE_GENERATION_P
```
The provider-specific live file synthesizes normal TTS, telephony-friendly PCM
TTS, transcribes audio through xAI STT, generates text-to-image output, and
edits a reference image. The shared image live file verifies the same xAI
provider through OpenClaw's runtime selection, fallback, normalization, and
media attachment path.
TTS, transcribes audio through xAI batch STT, streams the same PCM through xAI
realtime STT, generates text-to-image output, and edits a reference image. The
shared image live file verifies the same xAI provider through OpenClaw's
runtime selection, fallback, normalization, and media attachment path.
## Related

View File

@@ -52,9 +52,9 @@ Media understanding uses any vision-capable or audio-capable model registered in
Video and music generation run as background tasks because provider processing typically takes 30 seconds to several minutes. When the agent calls `video_generate` or `music_generate`, OpenClaw submits the request to the provider, returns a task ID immediately, and tracks the job in the task ledger. The agent continues responding to other messages while the job runs. When the provider finishes, OpenClaw wakes the agent so it can post the finished media back into the original channel. Image generation and TTS are synchronous and complete inline with the reply.
xAI currently maps to OpenClaw's image, video, search, code-execution, batch
TTS, and batch STT surfaces. xAI streaming STT and Realtime voice are upstream
capabilities, but they are not registered in OpenClaw until the shared
streaming transcription and realtime voice contracts can represent them.
TTS, batch STT, and Voice Call streaming STT surfaces. xAI Realtime voice is
an upstream capability, but it is not registered in OpenClaw until the shared
realtime voice contract can represent it.
## Quick links