feat(providers): add streaming stt providers

2026-05-06 11:20:43 +00:00 · 2026-04-23 03:05:44 +01:00
parent 5b68092351
commit 51ed22e608
32 changed files with 2399 additions and 16 deletions
--- a/docs/providers/deepgram.md
+++ b/docs/providers/deepgram.md
@@ -2,18 +2,22 @@
 summary: "Deepgram transcription for inbound voice notes"
 read_when:
  - You want Deepgram speech-to-text for audio attachments
+  - You want Deepgram streaming transcription for Voice Call
  - You need a quick Deepgram config example
 title: "Deepgram"
 ---

 # Deepgram (Audio Transcription)

-Deepgram is a speech-to-text API. In OpenClaw it is used for **inbound audio/voice note
-transcription** via `tools.media.audio`.
+Deepgram is a speech-to-text API. In OpenClaw it is used for inbound
+audio/voice-note transcription through `tools.media.audio` and for Voice Call
+streaming STT through `plugins.entries.voice-call.config.streaming`.

-When enabled, OpenClaw uploads the audio file to Deepgram and injects the transcript
-into the reply pipeline (`{{Transcript}}` + `[Audio]` block). This is **not streaming**;
-it uses the pre-recorded transcription endpoint.
+For batch transcription, OpenClaw uploads the complete audio file to Deepgram
+and injects the transcript into the reply pipeline (`{{Transcript}}` +
+`[Audio]` block). For Voice Call streaming, OpenClaw forwards live G.711
+u-law frames over Deepgram's WebSocket `listen` endpoint and emits partial or
+final transcripts as Deepgram returns them.

 | Detail        | Value                                                      |
 | ------------- | ---------------------------------------------------------- |
@@ -101,6 +105,52 @@ it uses the pre-recorded transcription endpoint.
  </Tab>
 </Tabs>

+## Voice Call streaming STT
+
+The bundled `deepgram` plugin also registers a realtime transcription provider
+for the Voice Call plugin.
+
+| Setting         | Config path                                                             | Default                          |
+| --------------- | ----------------------------------------------------------------------- | -------------------------------- |
+| API key         | `plugins.entries.voice-call.config.streaming.providers.deepgram.apiKey` | Falls back to `DEEPGRAM_API_KEY` |
+| Model           | `...deepgram.model`                                                     | `nova-3`                         |
+| Language        | `...deepgram.language`                                                  | (unset)                          |
+| Encoding        | `...deepgram.encoding`                                                  | `mulaw`                          |
+| Sample rate     | `...deepgram.sampleRate`                                                | `8000`                           |
+| Endpointing     | `...deepgram.endpointingMs`                                             | `800`                            |
+| Interim results | `...deepgram.interimResults`                                            | `true`                           |
+
+```json5
+{
+  plugins: {
+    entries: {
+      "voice-call": {
+        config: {
+          streaming: {
+            enabled: true,
+            provider: "deepgram",
+            providers: {
+              deepgram: {
+                apiKey: "${DEEPGRAM_API_KEY}",
+                model: "nova-3",
+                endpointingMs: 800,
+                language: "en-US",
+              },
+            },
+          },
+        },
+      },
+    },
+  },
+}
+```
+
+<Note>
+Voice Call receives telephony audio as 8 kHz G.711 u-law. The Deepgram
+streaming provider defaults to `encoding: "mulaw"` and `sampleRate: 8000`, so
+Twilio media frames can be forwarded directly.
+</Note>
+
 ## Notes

 <AccordionGroup>
@@ -118,12 +168,6 @@ it uses the pre-recorded transcription endpoint.
  </Accordion>
 </AccordionGroup>

-<Note>
-Deepgram transcription is **pre-recorded only** (not real-time streaming). OpenClaw
-uploads the complete audio file and waits for the full transcript before injecting
-it into the conversation.
-</Note>
-
 ## Related

 <CardGroup cols={2}>
--- a/docs/providers/elevenlabs.md
+++ b/docs/providers/elevenlabs.md
@@ -0,0 +1,111 @@
+---
+summary: "Use ElevenLabs speech, Scribe STT, and realtime transcription with OpenClaw"
+read_when:
+  - You want ElevenLabs text-to-speech in OpenClaw
+  - You want ElevenLabs Scribe speech-to-text for audio attachments
+  - You want ElevenLabs realtime transcription for Voice Call
+title: "ElevenLabs"
+---
+
+# ElevenLabs
+
+OpenClaw uses ElevenLabs for text-to-speech, batch speech-to-text with Scribe
+v2, and Voice Call streaming STT with Scribe v2 Realtime.
+
+| Capability               | OpenClaw surface                              | Default                  |
+| ------------------------ | --------------------------------------------- | ------------------------ |
+| Text-to-speech           | `messages.tts` / `talk`                       | `eleven_multilingual_v2` |
+| Batch speech-to-text     | `tools.media.audio`                           | `scribe_v2`              |
+| Streaming speech-to-text | Voice Call `streaming.provider: "elevenlabs"` | `scribe_v2_realtime`     |
+
+## Authentication
+
+Set `ELEVENLABS_API_KEY` in the environment. `XI_API_KEY` is also accepted for
+compatibility with existing ElevenLabs tooling.
+
+```bash
+export ELEVENLABS_API_KEY="..."
+```
+
+## Text-to-speech
+
+```json5
+{
+  messages: {
+    tts: {
+      providers: {
+        elevenlabs: {
+          apiKey: "${ELEVENLABS_API_KEY}",
+          voiceId: "pMsXgVXv3BLzUgSXRplE",
+          modelId: "eleven_multilingual_v2",
+        },
+      },
+    },
+  },
+}
+```
+
+## Speech-to-text
+
+Use Scribe v2 for inbound audio attachments and short recorded voice segments:
+
+```json5
+{
+  tools: {
+    media: {
+      audio: {
+        enabled: true,
+        models: [{ provider: "elevenlabs", model: "scribe_v2" }],
+      },
+    },
+  },
+}
+```
+
+OpenClaw sends multipart audio to ElevenLabs `/v1/speech-to-text` with
+`model_id: "scribe_v2"`. Language hints map to `language_code` when present.
+
+## Voice Call streaming STT
+
+The bundled `elevenlabs` plugin registers Scribe v2 Realtime for Voice Call
+streaming transcription.
+
+| Setting         | Config path                                                               | Default                                           |
+| --------------- | ------------------------------------------------------------------------- | ------------------------------------------------- |
+| API key         | `plugins.entries.voice-call.config.streaming.providers.elevenlabs.apiKey` | Falls back to `ELEVENLABS_API_KEY` / `XI_API_KEY` |
+| Model           | `...elevenlabs.modelId`                                                   | `scribe_v2_realtime`                              |
+| Audio format    | `...elevenlabs.audioFormat`                                               | `ulaw_8000`                                       |
+| Sample rate     | `...elevenlabs.sampleRate`                                                | `8000`                                            |
+| Commit strategy | `...elevenlabs.commitStrategy`                                            | `vad`                                             |
+| Language        | `...elevenlabs.languageCode`                                              | (unset)                                           |
+
+```json5
+{
+  plugins: {
+    entries: {
+      "voice-call": {
+        config: {
+          streaming: {
+            enabled: true,
+            provider: "elevenlabs",
+            providers: {
+              elevenlabs: {
+                apiKey: "${ELEVENLABS_API_KEY}",
+                audioFormat: "ulaw_8000",
+                commitStrategy: "vad",
+                languageCode: "en",
+              },
+            },
+          },
+        },
+      },
+    },
+  },
+}
+```
+
+<Note>
+Voice Call receives Twilio media as 8 kHz G.711 u-law. The ElevenLabs realtime
+provider defaults to `ulaw_8000`, so telephony frames can be forwarded without
+transcoding.
+</Note>
--- a/docs/providers/index.md
+++ b/docs/providers/index.md
@@ -82,6 +82,9 @@ Looking for chat channel docs (WhatsApp/Telegram/Discord/Slack/Mattermost (plugi
 ## Transcription providers

 - [Deepgram (audio transcription)](/providers/deepgram)
+- [ElevenLabs](/providers/elevenlabs#speech-to-text)
+- [Mistral](/providers/mistral#audio-transcription-voxtral)
+- [OpenAI](/providers/openai#speech-to-text)
 - [xAI](/providers/xai#speech-to-text)

 ## Community tools
--- a/docs/providers/mistral.md
+++ b/docs/providers/mistral.md
@@ -2,6 +2,7 @@
 summary: "Use Mistral models and Voxtral transcription with OpenClaw"
 read_when:
  - You want to use Mistral models in OpenClaw
+  - You want Voxtral realtime transcription for Voice Call
  - You need Mistral API key onboarding and model refs
 title: "Mistral"
 ---
@@ -65,7 +66,8 @@ OpenClaw currently ships this bundled Mistral catalog:

 ## Audio transcription (Voxtral)

-Use Voxtral for audio transcription through the media understanding pipeline.
+Use Voxtral for batch audio transcription through the media understanding
+pipeline.

 ```json5
 {
@@ -84,6 +86,48 @@ Use Voxtral for audio transcription through the media understanding pipeline.
 The media transcription path uses `/v1/audio/transcriptions`. The default audio model for Mistral is `voxtral-mini-latest`.
 </Tip>

+## Voice Call streaming STT
+
+The bundled `mistral` plugin registers Voxtral Realtime as a Voice Call
+streaming STT provider.
+
+| Setting      | Config path                                                            | Default                                 |
+| ------------ | ---------------------------------------------------------------------- | --------------------------------------- |
+| API key      | `plugins.entries.voice-call.config.streaming.providers.mistral.apiKey` | Falls back to `MISTRAL_API_KEY`         |
+| Model        | `...mistral.model`                                                     | `voxtral-mini-transcribe-realtime-2602` |
+| Encoding     | `...mistral.encoding`                                                  | `pcm_mulaw`                             |
+| Sample rate  | `...mistral.sampleRate`                                                | `8000`                                  |
+| Target delay | `...mistral.targetStreamingDelayMs`                                    | `800`                                   |
+
+```json5
+{
+  plugins: {
+    entries: {
+      "voice-call": {
+        config: {
+          streaming: {
+            enabled: true,
+            provider: "mistral",
+            providers: {
+              mistral: {
+                apiKey: "${MISTRAL_API_KEY}",
+                targetStreamingDelayMs: 800,
+              },
+            },
+          },
+        },
+      },
+    },
+  },
+}
+```
+
+<Note>
+OpenClaw defaults Mistral realtime STT to `pcm_mulaw` at 8 kHz so Voice Call
+can forward Twilio media frames directly. Use `encoding: "pcm_s16le"` and a
+matching `sampleRate` only if your upstream stream is already raw PCM.
+</Note>
+
 ## Advanced configuration

 <AccordionGroup>
--- a/docs/tools/media-overview.md
+++ b/docs/tools/media-overview.md
@@ -31,11 +31,12 @@ This table shows which providers support which media capabilities across the pla
 | BytePlus   |       | Yes   |       |     |                     |                     |
 | ComfyUI    | Yes   | Yes   | Yes   |     |                     |                     |
 | Deepgram   |       |       |       |     | Yes                 |                     |
-| ElevenLabs |       |       |       | Yes |                     |                     |
+| ElevenLabs |       |       |       | Yes | Yes                 |                     |
 | fal        | Yes   | Yes   |       |     |                     |                     |
 | Google     | Yes   | Yes   | Yes   |     |                     | Yes                 |
 | Microsoft  |       |       |       | Yes |                     |                     |
 | MiniMax    | Yes   | Yes   | Yes   | Yes |                     |                     |
+| Mistral    |       |       |       |     | Yes                 |                     |
 | OpenAI     | Yes   | Yes   |       | Yes | Yes                 | Yes                 |
 | Qwen       |       | Yes   |       |     |                     |                     |
 | Runway     |       | Yes   |       |     |                     |                     |
@@ -51,6 +52,12 @@ Media understanding uses any vision-capable or audio-capable model registered in

 Video and music generation run as background tasks because provider processing typically takes 30 seconds to several minutes. When the agent calls `video_generate` or `music_generate`, OpenClaw submits the request to the provider, returns a task ID immediately, and tracks the job in the task ledger. The agent continues responding to other messages while the job runs. When the provider finishes, OpenClaw wakes the agent so it can post the finished media back into the original channel. Image generation and TTS are synchronous and complete inline with the reply.

+Deepgram, ElevenLabs, Mistral, OpenAI, and xAI can all transcribe inbound
+audio through the batch `tools.media.audio` path when configured. Deepgram,
+ElevenLabs, Mistral, OpenAI, and xAI also register Voice Call streaming STT
+providers, so live phone audio can be forwarded to the selected vendor
+without waiting for a completed recording.
+
 OpenAI maps to OpenClaw's image, video, batch TTS, batch STT, Voice Call
 streaming STT, realtime voice, and memory embedding surfaces. xAI currently
 maps to OpenClaw's image, video, search, code-execution, batch TTS, batch STT,