Add xAI image generation and text-to-speech provider support with docs, live tests, and guarded provider HTTP handling.\n\nThanks @KateWilkins.
5.2 KiB
summary, read_when, title
| summary | read_when | title | |||
|---|---|---|---|---|---|
| Unified landing page for media generation, understanding, and speech capabilities |
|
Media Overview |
Media Generation and Understanding
OpenClaw generates images, videos, and music, understands inbound media (images, audio, video), and speaks replies aloud with text-to-speech. All media capabilities are tool-driven: the agent decides when to use them based on the conversation, and each tool only appears when at least one backing provider is configured.
Capabilities at a glance
| Capability | Tool | Providers | What it does |
|---|---|---|---|
| Image generation | image_generate |
ComfyUI, fal, Google, MiniMax, OpenAI, Vydra, xAI | Creates or edits images from text prompts or references |
| Video generation | video_generate |
Alibaba, BytePlus, ComfyUI, fal, Google, MiniMax, OpenAI, Qwen, Runway, Together, Vydra, xAI | Creates videos from text, images, or existing videos |
| Music generation | music_generate |
ComfyUI, Google, MiniMax | Creates music or audio tracks from text prompts |
| Text-to-speech (TTS) | tts |
ElevenLabs, Microsoft, MiniMax, OpenAI, xAI | Converts outbound replies to spoken audio |
| Media understanding | (automatic) | Any vision/audio-capable model provider, plus CLI fallbacks | Summarizes inbound images, audio, and video |
Provider capability matrix
This table shows which providers support which media capabilities across the platform.
| Provider | Image | Video | Music | TTS | STT / Transcription | Media Understanding |
|---|---|---|---|---|---|---|
| Alibaba | Yes | |||||
| BytePlus | Yes | |||||
| ComfyUI | Yes | Yes | Yes | |||
| Deepgram | Yes | |||||
| ElevenLabs | Yes | |||||
| fal | Yes | Yes | ||||
| Yes | Yes | Yes | Yes | |||
| Microsoft | Yes | |||||
| MiniMax | Yes | Yes | Yes | Yes | ||
| OpenAI | Yes | Yes | Yes | Yes | Yes | |
| Qwen | Yes | |||||
| Runway | Yes | |||||
| Together | Yes | |||||
| Vydra | Yes | Yes | ||||
| xAI | Yes | Yes | Yes |
How async generation works
Video and music generation run as background tasks because provider processing typically takes 30 seconds to several minutes. When the agent calls video_generate or music_generate, OpenClaw submits the request to the provider, returns a task ID immediately, and tracks the job in the task ledger. The agent continues responding to other messages while the job runs. When the provider finishes, OpenClaw wakes the agent so it can post the finished media back into the original channel. Image generation and TTS are synchronous and complete inline with the reply.
xAI currently maps to OpenClaw's image, video, search, code-execution, and batch TTS surfaces. xAI STT and Realtime voice are upstream capabilities, but they are not registered in OpenClaw until the shared transcription and realtime voice contracts can represent them.
Quick links
- Image Generation -- generating and editing images
- Video Generation -- text-to-video, image-to-video, and video-to-video
- Music Generation -- creating music and audio tracks
- Text-to-Speech -- converting replies to spoken audio
- Media Understanding -- understanding inbound images, audio, and video