feat(providers): add streaming stt providers

2026-05-06 17:10:49 +00:00 · 2026-04-23 03:05:44 +01:00
parent 5b68092351
commit 51ed22e608
32 changed files with 2399 additions and 16 deletions
--- a/docs/tools/media-overview.md
+++ b/docs/tools/media-overview.md
@@ -31,11 +31,12 @@ This table shows which providers support which media capabilities across the pla
 | BytePlus   |       | Yes   |       |     |                     |                     |
 | ComfyUI    | Yes   | Yes   | Yes   |     |                     |                     |
 | Deepgram   |       |       |       |     | Yes                 |                     |
-| ElevenLabs |       |       |       | Yes |                     |                     |
+| ElevenLabs |       |       |       | Yes | Yes                 |                     |
 | fal        | Yes   | Yes   |       |     |                     |                     |
 | Google     | Yes   | Yes   | Yes   |     |                     | Yes                 |
 | Microsoft  |       |       |       | Yes |                     |                     |
 | MiniMax    | Yes   | Yes   | Yes   | Yes |                     |                     |
+| Mistral    |       |       |       |     | Yes                 |                     |
 | OpenAI     | Yes   | Yes   |       | Yes | Yes                 | Yes                 |
 | Qwen       |       | Yes   |       |     |                     |                     |
 | Runway     |       | Yes   |       |     |                     |                     |
@@ -51,6 +52,12 @@ Media understanding uses any vision-capable or audio-capable model registered in

 Video and music generation run as background tasks because provider processing typically takes 30 seconds to several minutes. When the agent calls `video_generate` or `music_generate`, OpenClaw submits the request to the provider, returns a task ID immediately, and tracks the job in the task ledger. The agent continues responding to other messages while the job runs. When the provider finishes, OpenClaw wakes the agent so it can post the finished media back into the original channel. Image generation and TTS are synchronous and complete inline with the reply.

+Deepgram, ElevenLabs, Mistral, OpenAI, and xAI can all transcribe inbound
+audio through the batch `tools.media.audio` path when configured. Deepgram,
+ElevenLabs, Mistral, OpenAI, and xAI also register Voice Call streaming STT
+providers, so live phone audio can be forwarded to the selected vendor
+without waiting for a completed recording.
+
 OpenAI maps to OpenClaw's image, video, batch TTS, batch STT, Voice Call
 streaming STT, realtime voice, and memory embedding surfaces. xAI currently
 maps to OpenClaw's image, video, search, code-execution, batch TTS, batch STT,