feat: add xai speech-to-text support

2026-05-06 16:30:57 +00:00 · 2026-04-23 00:46:19 +01:00
parent 2bec189174
commit 012841816d
14 changed files with 307 additions and 30 deletions
--- a/docs/nodes/media-understanding.md
+++ b/docs/nodes/media-understanding.md
@@ -164,7 +164,7 @@ working option**:
     example through `agents.defaults.imageModel` or
     `openclaw infer image describe --model ollama/<vision-model>`.
   - Bundled fallback order:
-     - Audio: OpenAI → Groq → Deepgram → Google → Mistral
+     - Audio: OpenAI → Groq → xAI → Deepgram → Google → Mistral
     - Image: OpenAI → Anthropic → Google → MiniMax → MiniMax Portal → Z.AI
     - Video: Google → Qwen → Moonshot

@@ -212,6 +212,7 @@ lists, OpenClaw can infer defaults:
 - `mistral`: **audio**
 - `zai`: **image**
 - `groq`: **audio**
+- `xai`: **audio**
 - `deepgram`: **audio**
 - Any `models.providers.<id>.models[]` catalog with an image-capable model:
  **image**
--- a/docs/providers/index.md
+++ b/docs/providers/index.md
@@ -82,6 +82,7 @@ Looking for chat channel docs (WhatsApp/Telegram/Discord/Slack/Mattermost (plugi
 ## Transcription providers

 - [Deepgram (audio transcription)](/providers/deepgram)
+- [xAI](/providers/xai#speech-to-text)

 ## Community tools

--- a/docs/providers/xai.md
+++ b/docs/providers/xai.md
@@ -68,25 +68,27 @@ current image-capable Grok refs in the bundled catalog.
 The bundled plugin maps xAI's current public API surface onto OpenClaw's shared
 provider and tool contracts where the behavior fits cleanly.

-| xAI capability             | OpenClaw surface                       | Status                                                              |
-| -------------------------- | -------------------------------------- | ------------------------------------------------------------------- |
-| Chat / Responses           | `xai/<model>` model provider           | Yes                                                                 |
-| Server-side web search     | `web_search` provider `grok`           | Yes                                                                 |
-| Server-side X search       | `x_search` tool                        | Yes                                                                 |
-| Server-side code execution | `code_execution` tool                  | Yes                                                                 |
-| Images                     | `image_generate`                       | Yes                                                                 |
-| Videos                     | `video_generate`                       | Yes                                                                 |
-| Batch text-to-speech       | `messages.tts.provider: "xai"` / `tts` | Yes                                                                 |
-| Streaming TTS              | —                                      | Not exposed; OpenClaw's TTS contract returns complete audio buffers |
-| Speech-to-text             | —                                      | Not exposed yet; needs a transcription provider surface             |
-| Realtime voice             | —                                      | Not exposed yet; different session/WebSocket contract               |
-| Files / batches            | Generic model API compatibility only   | Not a first-class OpenClaw tool                                     |
+| xAI capability             | OpenClaw surface                          | Status                                                              |
+| -------------------------- | ----------------------------------------- | ------------------------------------------------------------------- |
+| Chat / Responses           | `xai/<model>` model provider              | Yes                                                                 |
+| Server-side web search     | `web_search` provider `grok`              | Yes                                                                 |
+| Server-side X search       | `x_search` tool                           | Yes                                                                 |
+| Server-side code execution | `code_execution` tool                     | Yes                                                                 |
+| Images                     | `image_generate`                          | Yes                                                                 |
+| Videos                     | `video_generate`                          | Yes                                                                 |
+| Batch text-to-speech       | `messages.tts.provider: "xai"` / `tts`    | Yes                                                                 |
+| Streaming TTS              | —                                         | Not exposed; OpenClaw's TTS contract returns complete audio buffers |
+| Batch speech-to-text       | `tools.media.audio` / media understanding | Yes                                                                 |
+| Streaming speech-to-text   | —                                         | Not exposed; needs streaming transcription contract mapping         |
+| Realtime voice             | —                                         | Not exposed yet; different session/WebSocket contract               |
+| Files / batches            | Generic model API compatibility only      | Not a first-class OpenClaw tool                                     |

 <Note>
-OpenClaw uses xAI's REST image/video/TTS APIs for media generation and the
-Responses API for model, search, and code-execution tools. Features that need
-new OpenClaw contracts, such as streaming STT or Realtime voice sessions, are
-documented here as upstream capabilities rather than hidden plugin behavior.
+OpenClaw uses xAI's REST image/video/TTS/STT APIs for media generation,
+speech, and transcription, and the Responses API for model, search, and
+code-execution tools. Features that need new OpenClaw contracts, such as
+streaming STT or Realtime voice sessions, are documented here as upstream
+capabilities rather than hidden plugin behavior.
 </Note>

 ### Fast-mode mappings
@@ -239,6 +241,50 @@ Legacy aliases still normalize to the canonical bundled ids:

  </Accordion>

+  <Accordion title="Speech-to-text">
+    The bundled `xai` plugin registers batch speech-to-text through OpenClaw's
+    media-understanding transcription surface.
+
+    - Default model: `grok-stt`
+    - Endpoint: xAI REST `/v1/stt`
+    - Input path: multipart audio file upload
+    - Supported by OpenClaw wherever inbound audio transcription uses
+      `tools.media.audio`, including Discord voice-channel segments and
+      channel audio attachments
+
+    To force xAI for inbound audio transcription:
+
+    ```json5
+    {
+      tools: {
+        media: {
+          audio: {
+            models: [
+              {
+                type: "provider",
+                provider: "xai",
+                model: "grok-stt",
+              },
+            ],
+          },
+        },
+      },
+    }
+    ```
+
+    Language can be supplied through the shared audio media config or per-call
+    transcription request. Prompt hints are accepted by the shared OpenClaw
+    surface, but the xAI REST STT integration only forwards file, model, and
+    language because those map cleanly to the current public xAI endpoint.
+
+    <Note>
+    xAI also offers streaming STT over `wss://api.x.ai/v1/stt`. OpenClaw's
+    bundled xAI plugin does not expose that yet; the current provider is batch
+    STT for file/segment transcription.
+    </Note>
+
+  </Accordion>
+
  <Accordion title="x_search configuration">
    The bundled xAI plugin exposes `x_search` as an OpenClaw tool for searching
    X (formerly Twitter) content via Grok.
@@ -316,9 +362,9 @@ Legacy aliases still normalize to the canonical bundled ids:
    - `grok-4.20-multi-agent-experimental-beta-0304` is not supported on the
      normal xAI provider path because it requires a different upstream API
      surface than the standard OpenClaw xAI transport.
-    - xAI STT and Realtime voice are not registered as OpenClaw providers yet.
-      They require transcription/session contracts rather than the existing
-      batch TTS provider shape.
+    - xAI streaming STT and Realtime voice are not registered as OpenClaw
+      providers yet. Batch xAI STT is registered through media understanding.
+      Streaming STT and Realtime voice need WebSocket/session contract mapping.
    - xAI image `quality`, image `mask`, and extra native-only aspect ratios are
      not exposed until the shared `image_generate` tool has corresponding
      cross-provider controls.
@@ -355,9 +401,10 @@ OPENCLAW_LIVE_TEST=1 OPENCLAW_LIVE_TEST_QUIET=1 OPENCLAW_LIVE_IMAGE_GENERATION_P
 ```

 The provider-specific live file synthesizes normal TTS, telephony-friendly PCM
-TTS, text-to-image generation, and reference-image editing. The shared image
-live file verifies the same xAI provider through OpenClaw's runtime selection,
-fallback, normalization, and media attachment path.
+TTS, transcribes audio through xAI STT, generates text-to-image output, and
+edits a reference image. The shared image live file verifies the same xAI
+provider through OpenClaw's runtime selection, fallback, normalization, and
+media attachment path.

 ## Related

--- a/docs/tools/media-overview.md
+++ b/docs/tools/media-overview.md
@@ -41,7 +41,7 @@ This table shows which providers support which media capabilities across the pla
 | Runway     |       | Yes   |       |     |                     |                     |
 | Together   |       | Yes   |       |     |                     |                     |
 | Vydra      | Yes   | Yes   |       |     |                     |                     |
-| xAI        | Yes   | Yes   |       | Yes |                     |                     |
+| xAI        | Yes   | Yes   |       | Yes | Yes                 | Yes                 |

 <Note>
 Media understanding uses any vision-capable or audio-capable model registered in your provider config. The table above highlights providers with dedicated media-understanding support; most LLM providers with multimodal models (Anthropic, Google, OpenAI, etc.) can also understand inbound media when configured as the active reply model.
@@ -51,10 +51,10 @@ Media understanding uses any vision-capable or audio-capable model registered in

 Video and music generation run as background tasks because provider processing typically takes 30 seconds to several minutes. When the agent calls `video_generate` or `music_generate`, OpenClaw submits the request to the provider, returns a task ID immediately, and tracks the job in the task ledger. The agent continues responding to other messages while the job runs. When the provider finishes, OpenClaw wakes the agent so it can post the finished media back into the original channel. Image generation and TTS are synchronous and complete inline with the reply.

-xAI currently maps to OpenClaw's image, video, search, code-execution, and
-batch TTS surfaces. xAI STT and Realtime voice are upstream capabilities, but
-they are not registered in OpenClaw until the shared transcription and realtime
-voice contracts can represent them.
+xAI currently maps to OpenClaw's image, video, search, code-execution, batch
+TTS, and batch STT surfaces. xAI streaming STT and Realtime voice are upstream
+capabilities, but they are not registered in OpenClaw until the shared
+streaming transcription and realtime voice contracts can represent them.

 ## Quick links