feat: add xai media providers

Add xAI image generation and text-to-speech provider support with docs, live tests, and guarded provider HTTP handling.\n\nThanks @KateWilkins.
This commit is contained in:
KateWilkins
2026-04-23 00:07:39 +01:00
committed by GitHub
parent 386a0884d7
commit f342da5fcc
21 changed files with 1369 additions and 26 deletions

View File

@@ -781,10 +781,11 @@ If you want to rely on env keys (e.g. exported in your `~/.profile`), run local
- Current bundled providers covered:
- `openai`
- `google`
- `xai`
- Optional narrowing:
- `OPENCLAW_LIVE_IMAGE_GENERATION_PROVIDERS="openai,google"`
- `OPENCLAW_LIVE_IMAGE_GENERATION_MODELS="openai/gpt-image-2,google/gemini-3.1-flash-image-preview"`
- `OPENCLAW_LIVE_IMAGE_GENERATION_CASES="google:flash-generate,google:pro-edit"`
- `OPENCLAW_LIVE_IMAGE_GENERATION_PROVIDERS="openai,google,xai"`
- `OPENCLAW_LIVE_IMAGE_GENERATION_MODELS="openai/gpt-image-2,google/gemini-3.1-flash-image-preview,xai/grok-imagine-image"`
- `OPENCLAW_LIVE_IMAGE_GENERATION_CASES="google:flash-generate,google:pro-edit,xai:default-generate,xai:default-edit"`
- Optional auth behavior:
- `OPENCLAW_LIVE_REQUIRE_PROFILE_KEYS=1` to force profile-store auth and ignore env-only overrides

View File

@@ -63,6 +63,32 @@ they follow the same API shape.
current image-capable Grok refs in the bundled catalog.
</Tip>
## OpenClaw feature coverage
The bundled plugin maps xAI's current public API surface onto OpenClaw's shared
provider and tool contracts where the behavior fits cleanly.
| xAI capability | OpenClaw surface | Status |
| -------------------------- | -------------------------------------- | ------------------------------------------------------------------- |
| Chat / Responses | `xai/<model>` model provider | Yes |
| Server-side web search | `web_search` provider `grok` | Yes |
| Server-side X search | `x_search` tool | Yes |
| Server-side code execution | `code_execution` tool | Yes |
| Images | `image_generate` | Yes |
| Videos | `video_generate` | Yes |
| Batch text-to-speech | `messages.tts.provider: "xai"` / `tts` | Yes |
| Streaming TTS | — | Not exposed; OpenClaw's TTS contract returns complete audio buffers |
| Speech-to-text | — | Not exposed yet; needs a transcription provider surface |
| Realtime voice | — | Not exposed yet; different session/WebSocket contract |
| Files / batches | Generic model API compatibility only | Not a first-class OpenClaw tool |
<Note>
OpenClaw uses xAI's REST image/video/TTS APIs for media generation and the
Responses API for model, search, and code-execution tools. Features that need
new OpenClaw contracts, such as streaming STT or Realtime voice sessions, are
documented here as upstream capabilities rather than hidden plugin behavior.
</Note>
### Fast-mode mappings
`/fast on` or `agents.defaults.models["xai/<model>"].params.fastMode: true`
@@ -103,12 +129,17 @@ Legacy aliases still normalize to the canonical bundled ids:
`video_generate` tool.
- Default video model: `xai/grok-imagine-video`
- Modes: text-to-video, image-to-video, and remote video edit/extend flows
- Supports `aspectRatio` and `resolution`
- Modes: text-to-video, image-to-video, remote video edit, and remote video
extension
- Aspect ratios: `1:1`, `16:9`, `9:16`, `4:3`, `3:4`, `3:2`, `2:3`
- Resolutions: `480P`, `720P`
- Duration: 1-15 seconds for generation/image-to-video, 2-10 seconds for
extension
<Warning>
Local video buffers are not accepted. Use remote `http(s)` URLs for
video-reference and edit inputs.
video edit/extend inputs. Image-to-video accepts local image buffers because
OpenClaw can encode those as data URLs for xAI.
</Warning>
To use xAI as the default video provider:
@@ -132,6 +163,82 @@ Legacy aliases still normalize to the canonical bundled ids:
</Accordion>
<Accordion title="Image generation">
The bundled `xai` plugin registers image generation through the shared
`image_generate` tool.
- Default image model: `xai/grok-imagine-image`
- Additional model: `xai/grok-imagine-image-pro`
- Modes: text-to-image and reference-image edit
- Reference inputs: one `image` or up to five `images`
- Aspect ratios: `1:1`, `16:9`, `9:16`, `4:3`, `3:4`, `2:3`, `3:2`
- Resolutions: `1K`, `2K`
- Count: up to 4 images
OpenClaw asks xAI for `b64_json` image responses so generated media can be
stored and delivered through the normal channel attachment path. Local
reference images are converted to data URLs; remote `http(s)` references are
passed through.
To use xAI as the default image provider:
```json5
{
agents: {
defaults: {
imageGenerationModel: {
primary: "xai/grok-imagine-image",
},
},
},
}
```
<Note>
xAI also documents `quality`, `mask`, `user`, and additional native ratios
such as `1:2`, `2:1`, `9:20`, and `20:9`. OpenClaw forwards only the
shared cross-provider image controls today; unsupported native-only knobs
are intentionally not exposed through `image_generate`.
</Note>
</Accordion>
<Accordion title="Text-to-speech">
The bundled `xai` plugin registers text-to-speech through the shared `tts`
provider surface.
- Voices: `eve`, `ara`, `rex`, `sal`, `leo`, `una`
- Default voice: `eve`
- Formats: `mp3`, `wav`, `pcm`, `mulaw`, `alaw`
- Language: BCP-47 code or `auto`
- Speed: provider-native speed override
- Native Opus voice-note format is not supported
To use xAI as the default TTS provider:
```json5
{
messages: {
tts: {
provider: "xai",
providers: {
xai: {
voiceId: "eve",
},
},
},
},
}
```
<Note>
OpenClaw uses xAI's batch `/v1/tts` endpoint. xAI also offers streaming TTS
over WebSocket, but the OpenClaw speech provider contract currently expects
a complete audio buffer before reply delivery.
</Note>
</Accordion>
<Accordion title="x_search configuration">
The bundled xAI plugin exposes `x_search` as an OpenClaw tool for searching
X (formerly Twitter) content via Grok.
@@ -209,6 +316,12 @@ Legacy aliases still normalize to the canonical bundled ids:
- `grok-4.20-multi-agent-experimental-beta-0304` is not supported on the
normal xAI provider path because it requires a different upstream API
surface than the standard OpenClaw xAI transport.
- xAI STT and Realtime voice are not registered as OpenClaw providers yet.
They require transcription/session contracts rather than the existing
batch TTS provider shape.
- xAI image `quality`, image `mask`, and extra native-only aspect ratios are
not exposed until the shared `image_generate` tool has corresponding
cross-provider controls.
</Accordion>
<Accordion title="Advanced notes">
@@ -229,6 +342,23 @@ Legacy aliases still normalize to the canonical bundled ids:
</Accordion>
</AccordionGroup>
## Live testing
The xAI media paths are covered by unit tests and opt-in live suites. The live
commands load secrets from your login shell, including `~/.profile`, before
probing `XAI_API_KEY`.
```bash
pnpm test extensions/xai
OPENCLAW_LIVE_TEST=1 OPENCLAW_LIVE_TEST_QUIET=1 pnpm test:live -- extensions/xai/xai.live.test.ts
OPENCLAW_LIVE_TEST=1 OPENCLAW_LIVE_TEST_QUIET=1 OPENCLAW_LIVE_IMAGE_GENERATION_PROVIDERS=xai pnpm test:live -- test/image-generation.runtime.live.test.ts
```
The provider-specific live file synthesizes normal TTS, telephony-friendly PCM
TTS, text-to-image generation, and reference-image editing. The shared image
live file verifies the same xAI provider through OpenClaw's runtime selection,
fallback, normalization, and media attachment path.
## Related
<CardGroup cols={2}>

View File

@@ -1,5 +1,5 @@
---
summary: "Generate and edit images using configured providers (OpenAI, Google Gemini, fal, MiniMax, ComfyUI, Vydra)"
summary: "Generate and edit images using configured providers (OpenAI, Google Gemini, fal, MiniMax, ComfyUI, Vydra, xAI)"
read_when:
- Generating images via the agent
- Configuring image generation providers and models
@@ -46,6 +46,7 @@ The agent calls `image_generate` automatically. No tool allow-listing needed —
| MiniMax | `image-01` | Yes (subject reference) | `MINIMAX_API_KEY` or MiniMax OAuth (`minimax-portal`) |
| ComfyUI | `workflow` | Yes (1 image, workflow-configured) | `COMFY_API_KEY` or `COMFY_CLOUD_API_KEY` for cloud |
| Vydra | `grok-imagine` | No | `VYDRA_API_KEY` |
| xAI | `grok-imagine-image` | Yes (up to 5 images) | `XAI_API_KEY` |
Use `action: "list"` to inspect available providers and models at runtime:
@@ -115,13 +116,13 @@ Notes:
### Image editing
OpenAI, Google, fal, MiniMax, and ComfyUI support editing reference images. Pass a reference image path or URL:
OpenAI, Google, fal, MiniMax, ComfyUI, and xAI support editing reference images. Pass a reference image path or URL:
```
"Generate a watercolor version of this photo" + image: "/path/to/photo.jpg"
```
OpenAI and Google support up to 5 reference images via the `images` parameter. fal, MiniMax, and ComfyUI support 1.
OpenAI, Google, and xAI support up to 5 reference images via the `images` parameter. fal, MiniMax, and ComfyUI support 1.
### OpenAI `gpt-image-2`
@@ -166,13 +167,29 @@ MiniMax image generation is available through both bundled MiniMax auth paths:
## Provider capabilities
| Capability | OpenAI | Google | fal | MiniMax | ComfyUI | Vydra |
| --------------------- | -------------------- | -------------------- | ------------------- | -------------------------- | ---------------------------------- | ------- |
| Generate | Yes (up to 4) | Yes (up to 4) | Yes (up to 4) | Yes (up to 9) | Yes (workflow-defined outputs) | Yes (1) |
| Edit/reference | Yes (up to 5 images) | Yes (up to 5 images) | Yes (1 image) | Yes (1 image, subject ref) | Yes (1 image, workflow-configured) | No |
| Size control | Yes (up to 4K) | Yes | Yes | No | No | No |
| Aspect ratio | No | Yes | Yes (generate only) | Yes | No | No |
| Resolution (1K/2K/4K) | No | Yes | Yes | No | No | No |
| Capability | OpenAI | Google | fal | MiniMax | ComfyUI | Vydra | xAI |
| --------------------- | -------------------- | -------------------- | ------------------- | -------------------------- | ---------------------------------- | ------- | -------------------- |
| Generate | Yes (up to 4) | Yes (up to 4) | Yes (up to 4) | Yes (up to 9) | Yes (workflow-defined outputs) | Yes (1) | Yes (up to 4) |
| Edit/reference | Yes (up to 5 images) | Yes (up to 5 images) | Yes (1 image) | Yes (1 image, subject ref) | Yes (1 image, workflow-configured) | No | Yes (up to 5 images) |
| Size control | Yes (up to 4K) | Yes | Yes | No | No | No | No |
| Aspect ratio | No | Yes | Yes (generate only) | Yes | No | No | Yes |
| Resolution (1K/2K/4K) | No | Yes | Yes | No | No | No | Yes (1K/2K) |
### xAI `grok-imagine-image`
The bundled xAI provider uses `/v1/images/generations` for prompt-only requests
and `/v1/images/edits` when `image` or `images` is present.
- Models: `xai/grok-imagine-image`, `xai/grok-imagine-image-pro`
- Count: up to 4
- References: one `image` or up to five `images`
- Aspect ratios: `1:1`, `16:9`, `9:16`, `4:3`, `3:4`, `2:3`, `3:2`
- Resolutions: `1K`, `2K`
- Outputs: returned as OpenClaw-managed image attachments
OpenClaw intentionally does not expose xAI-native `quality`, `mask`, `user`, or
extra native-only aspect ratios until those controls exist in the shared
cross-provider `image_generate` contract.
## Related
@@ -183,5 +200,6 @@ MiniMax image generation is available through both bundled MiniMax auth paths:
- [MiniMax](/providers/minimax) — MiniMax image provider setup
- [OpenAI](/providers/openai) — OpenAI Images provider setup
- [Vydra](/providers/vydra) — Vydra image, video, and speech setup
- [xAI](/providers/xai) — Grok image, video, search, code execution, and TTS setup
- [Configuration Reference](/gateway/configuration-reference#agent-defaults) — `imageGenerationModel` config
- [Models](/concepts/models) — model configuration and failover

View File

@@ -15,10 +15,10 @@ OpenClaw generates images, videos, and music, understands inbound media (images,
| Capability | Tool | Providers | What it does |
| -------------------- | ---------------- | -------------------------------------------------------------------------------------------- | ------------------------------------------------------- |
| Image generation | `image_generate` | ComfyUI, fal, Google, MiniMax, OpenAI, Vydra | Creates or edits images from text prompts or references |
| Image generation | `image_generate` | ComfyUI, fal, Google, MiniMax, OpenAI, Vydra, xAI | Creates or edits images from text prompts or references |
| Video generation | `video_generate` | Alibaba, BytePlus, ComfyUI, fal, Google, MiniMax, OpenAI, Qwen, Runway, Together, Vydra, xAI | Creates videos from text, images, or existing videos |
| Music generation | `music_generate` | ComfyUI, Google, MiniMax | Creates music or audio tracks from text prompts |
| Text-to-speech (TTS) | `tts` | ElevenLabs, Microsoft, MiniMax, OpenAI | Converts outbound replies to spoken audio |
| Text-to-speech (TTS) | `tts` | ElevenLabs, Microsoft, MiniMax, OpenAI, xAI | Converts outbound replies to spoken audio |
| Media understanding | (automatic) | Any vision/audio-capable model provider, plus CLI fallbacks | Summarizes inbound images, audio, and video |
## Provider capability matrix
@@ -41,7 +41,7 @@ This table shows which providers support which media capabilities across the pla
| Runway | | Yes | | | | |
| Together | | Yes | | | | |
| Vydra | Yes | Yes | | | | |
| xAI | | Yes | | | | |
| xAI | Yes | Yes | | Yes | | |
<Note>
Media understanding uses any vision-capable or audio-capable model registered in your provider config. The table above highlights providers with dedicated media-understanding support; most LLM providers with multimodal models (Anthropic, Google, OpenAI, etc.) can also understand inbound media when configured as the active reply model.
@@ -51,6 +51,11 @@ Media understanding uses any vision-capable or audio-capable model registered in
Video and music generation run as background tasks because provider processing typically takes 30 seconds to several minutes. When the agent calls `video_generate` or `music_generate`, OpenClaw submits the request to the provider, returns a task ID immediately, and tracks the job in the task ledger. The agent continues responding to other messages while the job runs. When the provider finishes, OpenClaw wakes the agent so it can post the finished media back into the original channel. Image generation and TTS are synchronous and complete inline with the reply.
xAI currently maps to OpenClaw's image, video, search, code-execution, and
batch TTS surfaces. xAI STT and Realtime voice are upstream capabilities, but
they are not registered in OpenClaw until the shared transcription and realtime
voice contracts can represent them.
## Quick links
- [Image Generation](/tools/image-generation) -- generating and editing images

View File

@@ -9,7 +9,7 @@ title: "Text-to-Speech"
# Text-to-speech (TTS)
OpenClaw can convert outbound replies into audio using ElevenLabs, Google Gemini, Microsoft, MiniMax, or OpenAI.
OpenClaw can convert outbound replies into audio using ElevenLabs, Google Gemini, Microsoft, MiniMax, OpenAI, or xAI.
It works anywhere OpenClaw can send audio.
## Supported services
@@ -19,6 +19,7 @@ It works anywhere OpenClaw can send audio.
- **Microsoft** (primary or fallback provider; current bundled implementation uses `node-edge-tts`)
- **MiniMax** (primary or fallback provider; uses the T2A v2 API)
- **OpenAI** (primary or fallback provider; also used for summaries)
- **xAI** (primary or fallback provider; uses the xAI TTS API)
### Microsoft speech notes
@@ -35,12 +36,13 @@ or ElevenLabs.
## Optional keys
If you want OpenAI, ElevenLabs, Google Gemini, or MiniMax:
If you want OpenAI, ElevenLabs, Google Gemini, MiniMax, or xAI:
- `ELEVENLABS_API_KEY` (or `XI_API_KEY`)
- `GEMINI_API_KEY` (or `GOOGLE_API_KEY`)
- `MINIMAX_API_KEY`
- `OPENAI_API_KEY`
- `XAI_API_KEY`
Microsoft speech does **not** require an API key.
@@ -57,6 +59,7 @@ so that provider must also be authenticated if you enable summaries.
- [MiniMax T2A v2 API](https://platform.minimaxi.com/document/T2A%20V2)
- [node-edge-tts](https://github.com/SchneeHertz/node-edge-tts)
- [Microsoft Speech output formats](https://learn.microsoft.com/azure/ai-services/speech-service/rest-text-to-speech#audio-outputs)
- [xAI Text to Speech](https://docs.x.ai/developers/rest-api-reference/inference/voice#text-to-speech-rest)
## Is it enabled by default?
@@ -198,6 +201,33 @@ by the bundled Google image-generation provider. Resolution order is
`messages.tts.providers.google.apiKey` -> `models.providers.google.apiKey` ->
`GEMINI_API_KEY` -> `GOOGLE_API_KEY`.
### xAI primary
```json5
{
messages: {
tts: {
auto: "always",
provider: "xai",
providers: {
xai: {
apiKey: "xai_api_key",
voiceId: "eve",
language: "en",
responseFormat: "mp3",
speed: 1.0,
},
},
},
},
}
```
xAI TTS uses the same `XAI_API_KEY` path as the bundled Grok model provider.
Resolution order is `messages.tts.providers.xai.apiKey` -> `XAI_API_KEY`.
Current live voices are `ara`, `eve`, `leo`, `rex`, `sal`, and `una`; `eve` is
the default. `language` accepts a BCP-47 tag or `auto`.
### Disable Microsoft speech
```json5
@@ -300,6 +330,12 @@ Then run:
- `providers.google.voiceName`: Gemini prebuilt voice name (default `Kore`; `voice` is also accepted).
- `providers.google.baseUrl`: override the Gemini API base URL. Only `https://generativelanguage.googleapis.com` is accepted.
- If `messages.tts.providers.google.apiKey` is omitted, TTS can reuse `models.providers.google.apiKey` before env fallback.
- `providers.xai.apiKey`: xAI TTS API key (env: `XAI_API_KEY`).
- `providers.xai.baseUrl`: override the xAI TTS base URL (default `https://api.x.ai/v1`, env: `XAI_BASE_URL`).
- `providers.xai.voiceId`: xAI voice id (default `eve`; current live voices: `ara`, `eve`, `leo`, `rex`, `sal`, `una`).
- `providers.xai.language`: BCP-47 language code or `auto` (default `en`).
- `providers.xai.responseFormat`: `mp3`, `wav`, `pcm`, `mulaw`, or `alaw` (default `mp3`).
- `providers.xai.speed`: provider-native speed override.
- `providers.microsoft.enabled`: allow Microsoft speech usage (default `true`; no API key).
- `providers.microsoft.voice`: Microsoft neural voice name (e.g. `en-US-MichelleNeural`).
- `providers.microsoft.lang`: language code (e.g. `en-US`).
@@ -335,7 +371,7 @@ Here you go.
Available directive keys (when enabled):
- `provider` (registered speech provider id, for example `openai`, `elevenlabs`, `google`, `minimax`, or `microsoft`; requires `allowProvider: true`)
- `voice` (OpenAI voice), `voiceName` / `voice_name` / `google_voice` (Google voice), or `voiceId` (ElevenLabs / MiniMax)
- `voice` (OpenAI voice), `voiceName` / `voice_name` / `google_voice` (Google voice), or `voiceId` (ElevenLabs / MiniMax / xAI)
- `model` (OpenAI TTS model, ElevenLabs model id, or MiniMax model) or `google_model` (Google TTS model)
- `stability`, `similarityBoost`, `style`, `speed`, `useSpeakerBoost`
- `vol` / `volume` (MiniMax volume, 0-10)
@@ -397,6 +433,7 @@ These override `messages.tts.*` for that host.
- 44.1kHz / 128kbps is the default balance for speech clarity.
- **MiniMax**: MP3 (`speech-2.8-hd` model, 32kHz sample rate). Voice-note format not natively supported; use OpenAI or ElevenLabs for guaranteed Opus voice messages.
- **Google Gemini**: Gemini API TTS returns raw 24kHz PCM. OpenClaw wraps it as WAV for audio attachments and returns PCM directly for Talk/telephony. Native Opus voice-note format is not supported by this path.
- **xAI**: MP3 by default; `responseFormat` may be `mp3`, `wav`, `pcm`, `mulaw`, or `alaw`. OpenClaw uses xAI's batch REST TTS endpoint and returns a complete audio attachment; xAI's streaming TTS WebSocket is not used by this provider path. Native Opus voice-note format is not supported by this path.
- **Microsoft**: uses `microsoft.outputFormat` (default `audio-24khz-48kbitrate-mono-mp3`).
- The bundled transport accepts an `outputFormat`, but not all formats are available from the service.
- Output format values follow Microsoft Speech output formats (including Ogg/WebM Opus).