feat(providers): add streaming stt providers

This commit is contained in:
Peter Steinberger
2026-04-23 03:05:44 +01:00
parent 5b68092351
commit 51ed22e608
32 changed files with 2399 additions and 16 deletions

View File

@@ -2,18 +2,22 @@
summary: "Deepgram transcription for inbound voice notes"
read_when:
- You want Deepgram speech-to-text for audio attachments
- You want Deepgram streaming transcription for Voice Call
- You need a quick Deepgram config example
title: "Deepgram"
---
# Deepgram (Audio Transcription)
Deepgram is a speech-to-text API. In OpenClaw it is used for **inbound audio/voice note
transcription** via `tools.media.audio`.
Deepgram is a speech-to-text API. In OpenClaw it is used for inbound
audio/voice-note transcription through `tools.media.audio` and for Voice Call
streaming STT through `plugins.entries.voice-call.config.streaming`.
When enabled, OpenClaw uploads the audio file to Deepgram and injects the transcript
into the reply pipeline (`{{Transcript}}` + `[Audio]` block). This is **not streaming**;
it uses the pre-recorded transcription endpoint.
For batch transcription, OpenClaw uploads the complete audio file to Deepgram
and injects the transcript into the reply pipeline (`{{Transcript}}` +
`[Audio]` block). For Voice Call streaming, OpenClaw forwards live G.711
u-law frames over Deepgram's WebSocket `listen` endpoint and emits partial or
final transcripts as Deepgram returns them.
| Detail | Value |
| ------------- | ---------------------------------------------------------- |
@@ -101,6 +105,52 @@ it uses the pre-recorded transcription endpoint.
</Tab>
</Tabs>
## Voice Call streaming STT
The bundled `deepgram` plugin also registers a realtime transcription provider
for the Voice Call plugin.
| Setting | Config path | Default |
| --------------- | ----------------------------------------------------------------------- | -------------------------------- |
| API key | `plugins.entries.voice-call.config.streaming.providers.deepgram.apiKey` | Falls back to `DEEPGRAM_API_KEY` |
| Model | `...deepgram.model` | `nova-3` |
| Language | `...deepgram.language` | (unset) |
| Encoding | `...deepgram.encoding` | `mulaw` |
| Sample rate | `...deepgram.sampleRate` | `8000` |
| Endpointing | `...deepgram.endpointingMs` | `800` |
| Interim results | `...deepgram.interimResults` | `true` |
```json5
{
plugins: {
entries: {
"voice-call": {
config: {
streaming: {
enabled: true,
provider: "deepgram",
providers: {
deepgram: {
apiKey: "${DEEPGRAM_API_KEY}",
model: "nova-3",
endpointingMs: 800,
language: "en-US",
},
},
},
},
},
},
},
}
```
<Note>
Voice Call receives telephony audio as 8 kHz G.711 u-law. The Deepgram
streaming provider defaults to `encoding: "mulaw"` and `sampleRate: 8000`, so
Twilio media frames can be forwarded directly.
</Note>
## Notes
<AccordionGroup>
@@ -118,12 +168,6 @@ it uses the pre-recorded transcription endpoint.
</Accordion>
</AccordionGroup>
<Note>
Deepgram transcription is **pre-recorded only** (not real-time streaming). OpenClaw
uploads the complete audio file and waits for the full transcript before injecting
it into the conversation.
</Note>
## Related
<CardGroup cols={2}>

View File

@@ -0,0 +1,111 @@
---
summary: "Use ElevenLabs speech, Scribe STT, and realtime transcription with OpenClaw"
read_when:
- You want ElevenLabs text-to-speech in OpenClaw
- You want ElevenLabs Scribe speech-to-text for audio attachments
- You want ElevenLabs realtime transcription for Voice Call
title: "ElevenLabs"
---
# ElevenLabs
OpenClaw uses ElevenLabs for text-to-speech, batch speech-to-text with Scribe
v2, and Voice Call streaming STT with Scribe v2 Realtime.
| Capability | OpenClaw surface | Default |
| ------------------------ | --------------------------------------------- | ------------------------ |
| Text-to-speech | `messages.tts` / `talk` | `eleven_multilingual_v2` |
| Batch speech-to-text | `tools.media.audio` | `scribe_v2` |
| Streaming speech-to-text | Voice Call `streaming.provider: "elevenlabs"` | `scribe_v2_realtime` |
## Authentication
Set `ELEVENLABS_API_KEY` in the environment. `XI_API_KEY` is also accepted for
compatibility with existing ElevenLabs tooling.
```bash
export ELEVENLABS_API_KEY="..."
```
## Text-to-speech
```json5
{
messages: {
tts: {
providers: {
elevenlabs: {
apiKey: "${ELEVENLABS_API_KEY}",
voiceId: "pMsXgVXv3BLzUgSXRplE",
modelId: "eleven_multilingual_v2",
},
},
},
},
}
```
## Speech-to-text
Use Scribe v2 for inbound audio attachments and short recorded voice segments:
```json5
{
tools: {
media: {
audio: {
enabled: true,
models: [{ provider: "elevenlabs", model: "scribe_v2" }],
},
},
},
}
```
OpenClaw sends multipart audio to ElevenLabs `/v1/speech-to-text` with
`model_id: "scribe_v2"`. Language hints map to `language_code` when present.
## Voice Call streaming STT
The bundled `elevenlabs` plugin registers Scribe v2 Realtime for Voice Call
streaming transcription.
| Setting | Config path | Default |
| --------------- | ------------------------------------------------------------------------- | ------------------------------------------------- |
| API key | `plugins.entries.voice-call.config.streaming.providers.elevenlabs.apiKey` | Falls back to `ELEVENLABS_API_KEY` / `XI_API_KEY` |
| Model | `...elevenlabs.modelId` | `scribe_v2_realtime` |
| Audio format | `...elevenlabs.audioFormat` | `ulaw_8000` |
| Sample rate | `...elevenlabs.sampleRate` | `8000` |
| Commit strategy | `...elevenlabs.commitStrategy` | `vad` |
| Language | `...elevenlabs.languageCode` | (unset) |
```json5
{
plugins: {
entries: {
"voice-call": {
config: {
streaming: {
enabled: true,
provider: "elevenlabs",
providers: {
elevenlabs: {
apiKey: "${ELEVENLABS_API_KEY}",
audioFormat: "ulaw_8000",
commitStrategy: "vad",
languageCode: "en",
},
},
},
},
},
},
},
}
```
<Note>
Voice Call receives Twilio media as 8 kHz G.711 u-law. The ElevenLabs realtime
provider defaults to `ulaw_8000`, so telephony frames can be forwarded without
transcoding.
</Note>

View File

@@ -82,6 +82,9 @@ Looking for chat channel docs (WhatsApp/Telegram/Discord/Slack/Mattermost (plugi
## Transcription providers
- [Deepgram (audio transcription)](/providers/deepgram)
- [ElevenLabs](/providers/elevenlabs#speech-to-text)
- [Mistral](/providers/mistral#audio-transcription-voxtral)
- [OpenAI](/providers/openai#speech-to-text)
- [xAI](/providers/xai#speech-to-text)
## Community tools

View File

@@ -2,6 +2,7 @@
summary: "Use Mistral models and Voxtral transcription with OpenClaw"
read_when:
- You want to use Mistral models in OpenClaw
- You want Voxtral realtime transcription for Voice Call
- You need Mistral API key onboarding and model refs
title: "Mistral"
---
@@ -65,7 +66,8 @@ OpenClaw currently ships this bundled Mistral catalog:
## Audio transcription (Voxtral)
Use Voxtral for audio transcription through the media understanding pipeline.
Use Voxtral for batch audio transcription through the media understanding
pipeline.
```json5
{
@@ -84,6 +86,48 @@ Use Voxtral for audio transcription through the media understanding pipeline.
The media transcription path uses `/v1/audio/transcriptions`. The default audio model for Mistral is `voxtral-mini-latest`.
</Tip>
## Voice Call streaming STT
The bundled `mistral` plugin registers Voxtral Realtime as a Voice Call
streaming STT provider.
| Setting | Config path | Default |
| ------------ | ---------------------------------------------------------------------- | --------------------------------------- |
| API key | `plugins.entries.voice-call.config.streaming.providers.mistral.apiKey` | Falls back to `MISTRAL_API_KEY` |
| Model | `...mistral.model` | `voxtral-mini-transcribe-realtime-2602` |
| Encoding | `...mistral.encoding` | `pcm_mulaw` |
| Sample rate | `...mistral.sampleRate` | `8000` |
| Target delay | `...mistral.targetStreamingDelayMs` | `800` |
```json5
{
plugins: {
entries: {
"voice-call": {
config: {
streaming: {
enabled: true,
provider: "mistral",
providers: {
mistral: {
apiKey: "${MISTRAL_API_KEY}",
targetStreamingDelayMs: 800,
},
},
},
},
},
},
},
}
```
<Note>
OpenClaw defaults Mistral realtime STT to `pcm_mulaw` at 8 kHz so Voice Call
can forward Twilio media frames directly. Use `encoding: "pcm_s16le"` and a
matching `sampleRate` only if your upstream stream is already raw PCM.
</Note>
## Advanced configuration
<AccordionGroup>

View File

@@ -31,11 +31,12 @@ This table shows which providers support which media capabilities across the pla
| BytePlus | | Yes | | | | |
| ComfyUI | Yes | Yes | Yes | | | |
| Deepgram | | | | | Yes | |
| ElevenLabs | | | | Yes | | |
| ElevenLabs | | | | Yes | Yes | |
| fal | Yes | Yes | | | | |
| Google | Yes | Yes | Yes | | | Yes |
| Microsoft | | | | Yes | | |
| MiniMax | Yes | Yes | Yes | Yes | | |
| Mistral | | | | | Yes | |
| OpenAI | Yes | Yes | | Yes | Yes | Yes |
| Qwen | | Yes | | | | |
| Runway | | Yes | | | | |
@@ -51,6 +52,12 @@ Media understanding uses any vision-capable or audio-capable model registered in
Video and music generation run as background tasks because provider processing typically takes 30 seconds to several minutes. When the agent calls `video_generate` or `music_generate`, OpenClaw submits the request to the provider, returns a task ID immediately, and tracks the job in the task ledger. The agent continues responding to other messages while the job runs. When the provider finishes, OpenClaw wakes the agent so it can post the finished media back into the original channel. Image generation and TTS are synchronous and complete inline with the reply.
Deepgram, ElevenLabs, Mistral, OpenAI, and xAI can all transcribe inbound
audio through the batch `tools.media.audio` path when configured. Deepgram,
ElevenLabs, Mistral, OpenAI, and xAI also register Voice Call streaming STT
providers, so live phone audio can be forwarded to the selected vendor
without waiting for a completed recording.
OpenAI maps to OpenClaw's image, video, batch TTS, batch STT, Voice Call
streaming STT, realtime voice, and memory embedding surfaces. xAI currently
maps to OpenClaw's image, video, search, code-execution, batch TTS, batch STT,