feat: add xai media providers

Add xAI image generation and text-to-speech provider support with docs, live tests, and guarded provider HTTP handling.\n\nThanks @KateWilkins.
2026-05-06 15:40:44 +00:00 · 2026-04-23 00:07:39 +01:00
parent 386a0884d7
commit f342da5fcc
21 changed files with 1369 additions and 26 deletions
--- a/docs/tools/image-generation.md
+++ b/docs/tools/image-generation.md
@@ -1,5 +1,5 @@
 ---
-summary: "Generate and edit images using configured providers (OpenAI, Google Gemini, fal, MiniMax, ComfyUI, Vydra)"
+summary: "Generate and edit images using configured providers (OpenAI, Google Gemini, fal, MiniMax, ComfyUI, Vydra, xAI)"
 read_when:
  - Generating images via the agent
  - Configuring image generation providers and models
@@ -46,6 +46,7 @@ The agent calls `image_generate` automatically. No tool allow-listing needed —
 | MiniMax  | `image-01`                       | Yes (subject reference)            | `MINIMAX_API_KEY` or MiniMax OAuth (`minimax-portal`) |
 | ComfyUI  | `workflow`                       | Yes (1 image, workflow-configured) | `COMFY_API_KEY` or `COMFY_CLOUD_API_KEY` for cloud    |
 | Vydra    | `grok-imagine`                   | No                                 | `VYDRA_API_KEY`                                       |
+| xAI      | `grok-imagine-image`             | Yes (up to 5 images)               | `XAI_API_KEY`                                         |

 Use `action: "list"` to inspect available providers and models at runtime:

@@ -115,13 +116,13 @@ Notes:

 ### Image editing

-OpenAI, Google, fal, MiniMax, and ComfyUI support editing reference images. Pass a reference image path or URL:
+OpenAI, Google, fal, MiniMax, ComfyUI, and xAI support editing reference images. Pass a reference image path or URL:

 ```
 "Generate a watercolor version of this photo" + image: "/path/to/photo.jpg"
 ```

-OpenAI and Google support up to 5 reference images via the `images` parameter. fal, MiniMax, and ComfyUI support 1.
+OpenAI, Google, and xAI support up to 5 reference images via the `images` parameter. fal, MiniMax, and ComfyUI support 1.

 ### OpenAI `gpt-image-2`

@@ -166,13 +167,29 @@ MiniMax image generation is available through both bundled MiniMax auth paths:

 ## Provider capabilities

-| Capability            | OpenAI               | Google               | fal                 | MiniMax                    | ComfyUI                            | Vydra   |
-| --------------------- | -------------------- | -------------------- | ------------------- | -------------------------- | ---------------------------------- | ------- |
-| Generate              | Yes (up to 4)        | Yes (up to 4)        | Yes (up to 4)       | Yes (up to 9)              | Yes (workflow-defined outputs)     | Yes (1) |
-| Edit/reference        | Yes (up to 5 images) | Yes (up to 5 images) | Yes (1 image)       | Yes (1 image, subject ref) | Yes (1 image, workflow-configured) | No      |
-| Size control          | Yes (up to 4K)       | Yes                  | Yes                 | No                         | No                                 | No      |
-| Aspect ratio          | No                   | Yes                  | Yes (generate only) | Yes                        | No                                 | No      |
-| Resolution (1K/2K/4K) | No                   | Yes                  | Yes                 | No                         | No                                 | No      |
+| Capability            | OpenAI               | Google               | fal                 | MiniMax                    | ComfyUI                            | Vydra   | xAI                  |
+| --------------------- | -------------------- | -------------------- | ------------------- | -------------------------- | ---------------------------------- | ------- | -------------------- |
+| Generate              | Yes (up to 4)        | Yes (up to 4)        | Yes (up to 4)       | Yes (up to 9)              | Yes (workflow-defined outputs)     | Yes (1) | Yes (up to 4)        |
+| Edit/reference        | Yes (up to 5 images) | Yes (up to 5 images) | Yes (1 image)       | Yes (1 image, subject ref) | Yes (1 image, workflow-configured) | No      | Yes (up to 5 images) |
+| Size control          | Yes (up to 4K)       | Yes                  | Yes                 | No                         | No                                 | No      | No                   |
+| Aspect ratio          | No                   | Yes                  | Yes (generate only) | Yes                        | No                                 | No      | Yes                  |
+| Resolution (1K/2K/4K) | No                   | Yes                  | Yes                 | No                         | No                                 | No      | Yes (1K/2K)          |
+
+### xAI `grok-imagine-image`
+
+The bundled xAI provider uses `/v1/images/generations` for prompt-only requests
+and `/v1/images/edits` when `image` or `images` is present.
+
+- Models: `xai/grok-imagine-image`, `xai/grok-imagine-image-pro`
+- Count: up to 4
+- References: one `image` or up to five `images`
+- Aspect ratios: `1:1`, `16:9`, `9:16`, `4:3`, `3:4`, `2:3`, `3:2`
+- Resolutions: `1K`, `2K`
+- Outputs: returned as OpenClaw-managed image attachments
+
+OpenClaw intentionally does not expose xAI-native `quality`, `mask`, `user`, or
+extra native-only aspect ratios until those controls exist in the shared
+cross-provider `image_generate` contract.

 ## Related

@@ -183,5 +200,6 @@ MiniMax image generation is available through both bundled MiniMax auth paths:
 - [MiniMax](/providers/minimax) — MiniMax image provider setup
 - [OpenAI](/providers/openai) — OpenAI Images provider setup
 - [Vydra](/providers/vydra) — Vydra image, video, and speech setup
+- [xAI](/providers/xai) — Grok image, video, search, code execution, and TTS setup
 - [Configuration Reference](/gateway/configuration-reference#agent-defaults) — `imageGenerationModel` config
 - [Models](/concepts/models) — model configuration and failover
--- a/docs/tools/media-overview.md
+++ b/docs/tools/media-overview.md
@@ -15,10 +15,10 @@ OpenClaw generates images, videos, and music, understands inbound media (images,

 | Capability           | Tool             | Providers                                                                                    | What it does                                            |
 | -------------------- | ---------------- | -------------------------------------------------------------------------------------------- | ------------------------------------------------------- |
-| Image generation     | `image_generate` | ComfyUI, fal, Google, MiniMax, OpenAI, Vydra                                                 | Creates or edits images from text prompts or references |
+| Image generation     | `image_generate` | ComfyUI, fal, Google, MiniMax, OpenAI, Vydra, xAI                                            | Creates or edits images from text prompts or references |
 | Video generation     | `video_generate` | Alibaba, BytePlus, ComfyUI, fal, Google, MiniMax, OpenAI, Qwen, Runway, Together, Vydra, xAI | Creates videos from text, images, or existing videos    |
 | Music generation     | `music_generate` | ComfyUI, Google, MiniMax                                                                     | Creates music or audio tracks from text prompts         |
-| Text-to-speech (TTS) | `tts`            | ElevenLabs, Microsoft, MiniMax, OpenAI                                                       | Converts outbound replies to spoken audio               |
+| Text-to-speech (TTS) | `tts`            | ElevenLabs, Microsoft, MiniMax, OpenAI, xAI                                                  | Converts outbound replies to spoken audio               |
 | Media understanding  | (automatic)      | Any vision/audio-capable model provider, plus CLI fallbacks                                  | Summarizes inbound images, audio, and video             |

 ## Provider capability matrix
@@ -41,7 +41,7 @@ This table shows which providers support which media capabilities across the pla
 | Runway     |       | Yes   |       |     |                     |                     |
 | Together   |       | Yes   |       |     |                     |                     |
 | Vydra      | Yes   | Yes   |       |     |                     |                     |
-| xAI        |       | Yes   |       |     |                     |                     |
+| xAI        | Yes   | Yes   |       | Yes |                     |                     |

 <Note>
 Media understanding uses any vision-capable or audio-capable model registered in your provider config. The table above highlights providers with dedicated media-understanding support; most LLM providers with multimodal models (Anthropic, Google, OpenAI, etc.) can also understand inbound media when configured as the active reply model.
@@ -51,6 +51,11 @@ Media understanding uses any vision-capable or audio-capable model registered in

 Video and music generation run as background tasks because provider processing typically takes 30 seconds to several minutes. When the agent calls `video_generate` or `music_generate`, OpenClaw submits the request to the provider, returns a task ID immediately, and tracks the job in the task ledger. The agent continues responding to other messages while the job runs. When the provider finishes, OpenClaw wakes the agent so it can post the finished media back into the original channel. Image generation and TTS are synchronous and complete inline with the reply.

+xAI currently maps to OpenClaw's image, video, search, code-execution, and
+batch TTS surfaces. xAI STT and Realtime voice are upstream capabilities, but
+they are not registered in OpenClaw until the shared transcription and realtime
+voice contracts can represent them.
+
 ## Quick links

 - [Image Generation](/tools/image-generation) -- generating and editing images
--- a/docs/tools/tts.md
+++ b/docs/tools/tts.md
@@ -9,7 +9,7 @@ title: "Text-to-Speech"

 # Text-to-speech (TTS)

-OpenClaw can convert outbound replies into audio using ElevenLabs, Google Gemini, Microsoft, MiniMax, or OpenAI.
+OpenClaw can convert outbound replies into audio using ElevenLabs, Google Gemini, Microsoft, MiniMax, OpenAI, or xAI.
 It works anywhere OpenClaw can send audio.

 ## Supported services
@@ -19,6 +19,7 @@ It works anywhere OpenClaw can send audio.
 - **Microsoft** (primary or fallback provider; current bundled implementation uses `node-edge-tts`)
 - **MiniMax** (primary or fallback provider; uses the T2A v2 API)
 - **OpenAI** (primary or fallback provider; also used for summaries)
+- **xAI** (primary or fallback provider; uses the xAI TTS API)

 ### Microsoft speech notes

@@ -35,12 +36,13 @@ or ElevenLabs.

 ## Optional keys

-If you want OpenAI, ElevenLabs, Google Gemini, or MiniMax:
+If you want OpenAI, ElevenLabs, Google Gemini, MiniMax, or xAI:

 - `ELEVENLABS_API_KEY` (or `XI_API_KEY`)
 - `GEMINI_API_KEY` (or `GOOGLE_API_KEY`)
 - `MINIMAX_API_KEY`
 - `OPENAI_API_KEY`
+- `XAI_API_KEY`

 Microsoft speech does **not** require an API key.

@@ -57,6 +59,7 @@ so that provider must also be authenticated if you enable summaries.
 - [MiniMax T2A v2 API](https://platform.minimaxi.com/document/T2A%20V2)
 - [node-edge-tts](https://github.com/SchneeHertz/node-edge-tts)
 - [Microsoft Speech output formats](https://learn.microsoft.com/azure/ai-services/speech-service/rest-text-to-speech#audio-outputs)
+- [xAI Text to Speech](https://docs.x.ai/developers/rest-api-reference/inference/voice#text-to-speech-rest)

 ## Is it enabled by default?

@@ -198,6 +201,33 @@ by the bundled Google image-generation provider. Resolution order is
 `messages.tts.providers.google.apiKey` -> `models.providers.google.apiKey` ->
 `GEMINI_API_KEY` -> `GOOGLE_API_KEY`.

+### xAI primary
+
+```json5
+{
+  messages: {
+    tts: {
+      auto: "always",
+      provider: "xai",
+      providers: {
+        xai: {
+          apiKey: "xai_api_key",
+          voiceId: "eve",
+          language: "en",
+          responseFormat: "mp3",
+          speed: 1.0,
+        },
+      },
+    },
+  },
+}
+```
+
+xAI TTS uses the same `XAI_API_KEY` path as the bundled Grok model provider.
+Resolution order is `messages.tts.providers.xai.apiKey` -> `XAI_API_KEY`.
+Current live voices are `ara`, `eve`, `leo`, `rex`, `sal`, and `una`; `eve` is
+the default. `language` accepts a BCP-47 tag or `auto`.
+
 ### Disable Microsoft speech

 ```json5
@@ -300,6 +330,12 @@ Then run:
 - `providers.google.voiceName`: Gemini prebuilt voice name (default `Kore`; `voice` is also accepted).
 - `providers.google.baseUrl`: override the Gemini API base URL. Only `https://generativelanguage.googleapis.com` is accepted.
  - If `messages.tts.providers.google.apiKey` is omitted, TTS can reuse `models.providers.google.apiKey` before env fallback.
+- `providers.xai.apiKey`: xAI TTS API key (env: `XAI_API_KEY`).
+- `providers.xai.baseUrl`: override the xAI TTS base URL (default `https://api.x.ai/v1`, env: `XAI_BASE_URL`).
+- `providers.xai.voiceId`: xAI voice id (default `eve`; current live voices: `ara`, `eve`, `leo`, `rex`, `sal`, `una`).
+- `providers.xai.language`: BCP-47 language code or `auto` (default `en`).
+- `providers.xai.responseFormat`: `mp3`, `wav`, `pcm`, `mulaw`, or `alaw` (default `mp3`).
+- `providers.xai.speed`: provider-native speed override.
 - `providers.microsoft.enabled`: allow Microsoft speech usage (default `true`; no API key).
 - `providers.microsoft.voice`: Microsoft neural voice name (e.g. `en-US-MichelleNeural`).
 - `providers.microsoft.lang`: language code (e.g. `en-US`).
@@ -335,7 +371,7 @@ Here you go.
 Available directive keys (when enabled):

 - `provider` (registered speech provider id, for example `openai`, `elevenlabs`, `google`, `minimax`, or `microsoft`; requires `allowProvider: true`)
- `voice` (OpenAI voice), `voiceName` / `voice_name` / `google_voice` (Google voice), or `voiceId` (ElevenLabs / MiniMax)
+- `voice` (OpenAI voice), `voiceName` / `voice_name` / `google_voice` (Google voice), or `voiceId` (ElevenLabs / MiniMax / xAI)
 - `model` (OpenAI TTS model, ElevenLabs model id, or MiniMax model) or `google_model` (Google TTS model)
 - `stability`, `similarityBoost`, `style`, `speed`, `useSpeakerBoost`
 - `vol` / `volume` (MiniMax volume, 0-10)
@@ -397,6 +433,7 @@ These override `messages.tts.*` for that host.
  - 44.1kHz / 128kbps is the default balance for speech clarity.
 - **MiniMax**: MP3 (`speech-2.8-hd` model, 32kHz sample rate). Voice-note format not natively supported; use OpenAI or ElevenLabs for guaranteed Opus voice messages.
 - **Google Gemini**: Gemini API TTS returns raw 24kHz PCM. OpenClaw wraps it as WAV for audio attachments and returns PCM directly for Talk/telephony. Native Opus voice-note format is not supported by this path.
+- **xAI**: MP3 by default; `responseFormat` may be `mp3`, `wav`, `pcm`, `mulaw`, or `alaw`. OpenClaw uses xAI's batch REST TTS endpoint and returns a complete audio attachment; xAI's streaming TTS WebSocket is not used by this provider path. Native Opus voice-note format is not supported by this path.
 - **Microsoft**: uses `microsoft.outputFormat` (default `audio-24khz-48kbitrate-mono-mp3`).
  - The bundled transport accepts an `outputFormat`, but not all formats are available from the service.
  - Output format values follow Microsoft Speech output formats (including Ogg/WebM Opus).