openclaw/docs/tools/media-overview.md

---
summary: "Image, video, music, speech, and media-understanding capabilities at a glance"
read_when:
  - Looking for an overview of OpenClaw's media capabilities
  - Deciding which media provider to configure
  - Understanding how async media generation works
title: "Media overview"
sidebarTitle: "Media overview"
---

OpenClaw generates images, videos, and music, understands inbound media
(images, audio, video), and speaks replies aloud with text-to-speech. All
media capabilities are tool-driven: the agent decides when to use them based
on the conversation, and each tool only appears when at least one backing
provider is configured.

## Capabilities

<CardGroup cols={2}>
  <Card title="Image generation" href="/tools/image-generation" icon="image">
    Create and edit images from text prompts or reference images via
    `image_generate`. Synchronous — completes inline with the reply.
  </Card>
  <Card title="Video generation" href="/tools/video-generation" icon="video">
    Text-to-video, image-to-video, and video-to-video via `video_generate`.
    Async — runs in the background and posts the result when ready.
  </Card>
  <Card title="Music generation" href="/tools/music-generation" icon="music">
    Generate music or audio tracks via `music_generate`. Async on shared
    providers; ComfyUI workflow path runs synchronously.
  </Card>
  <Card title="Text-to-speech" href="/tools/tts" icon="microphone">
    Convert outbound replies to spoken audio via the `tts` tool plus
    `messages.tts` config. Synchronous.
  </Card>
  <Card title="Media understanding" href="/nodes/media-understanding" icon="eye">
    Summarize inbound images, audio, and video using vision-capable model
    providers and dedicated media-understanding plugins.
  </Card>
  <Card title="Speech-to-text" href="/nodes/audio" icon="ear-listen">
    Transcribe inbound voice messages through batch STT or Voice Call
    streaming STT providers.
  </Card>
</CardGroup>

## Provider capability matrix

| Provider    | Image | Video | Music | TTS | STT | Realtime voice | Media understanding |
| ----------- | :---: | :---: | :---: | :-: | :-: | :------------: | :-----------------: |
| Alibaba     |       |   ✓   |       |     |     |                |                     |
| BytePlus    |       |   ✓   |       |     |     |                |                     |
| ComfyUI     |   ✓   |   ✓   |   ✓   |     |     |                |                     |
| DeepInfra   |   ✓   |   ✓   |       |  ✓  |  ✓  |                |          ✓          |
| Deepgram    |       |       |       |     |  ✓  |       ✓        |                     |
| ElevenLabs  |       |       |       |  ✓  |  ✓  |                |                     |
| fal         |   ✓   |   ✓   |       |     |     |                |                     |
| Google      |   ✓   |   ✓   |   ✓   |  ✓  |     |       ✓        |          ✓          |
| Gradium     |       |       |       |  ✓  |     |                |                     |
| Local CLI   |       |       |       |  ✓  |     |                |                     |
| Microsoft   |       |       |       |  ✓  |     |                |                     |
| MiniMax     |   ✓   |   ✓   |   ✓   |  ✓  |     |                |                     |
| Mistral     |       |       |       |     |  ✓  |                |                     |
| OpenAI      |   ✓   |   ✓   |       |  ✓  |  ✓  |       ✓        |          ✓          |
| OpenRouter  |   ✓   |   ✓   |       |  ✓  |     |                |          ✓          |
| Qwen        |       |   ✓   |       |     |     |                |                     |
| Runway      |       |   ✓   |       |     |     |                |                     |
| SenseAudio  |       |       |       |     |  ✓  |                |                     |
| Together    |       |   ✓   |       |     |     |                |                     |
| Vydra       |   ✓   |   ✓   |       |  ✓  |     |                |                     |
| xAI         |   ✓   |   ✓   |       |  ✓  |  ✓  |                |          ✓          |
| Xiaomi MiMo |   ✓   |       |       |  ✓  |     |                |          ✓          |

<Note>
Media understanding uses any vision-capable or audio-capable model registered
in your provider config. The matrix above lists providers with dedicated
media-understanding support; most multimodal LLM providers (Anthropic, Google,
OpenAI, etc.) can also understand inbound media when configured as the active
reply model.
</Note>

## Async vs synchronous

| Capability      | Mode         | Why                                                                |
| --------------- | ------------ | ------------------------------------------------------------------ |
| Image           | Synchronous  | Provider responses return in seconds; completes inline with reply. |
| Text-to-speech  | Synchronous  | Provider responses return in seconds; attached to the reply audio. |
| Video           | Asynchronous | Provider processing takes 30 s to several minutes.                 |
| Music (shared)  | Asynchronous | Same provider-processing characteristic as video.                  |
| Music (ComfyUI) | Synchronous  | Local workflow runs inline against the configured ComfyUI server.  |

For async tools, OpenClaw submits the request to the provider, returns a task
id immediately, and tracks the job in the task ledger. The agent continues
responding to other messages while the job runs. When the provider finishes,
OpenClaw wakes the agent so it can post the finished media back into the
original channel.

## Speech-to-text and Voice Call

Deepgram, DeepInfra, ElevenLabs, Mistral, OpenAI, SenseAudio, and xAI can all transcribe
inbound audio through the batch `tools.media.audio` path when configured.
Channel plugins that preflight a voice note for mention gating or command
parsing mark the transcribed attachment on the inbound context, so the shared
media-understanding pass reuses that transcript instead of making a second
STT call for the same audio.

Deepgram, ElevenLabs, Mistral, OpenAI, and xAI also register Voice Call
streaming STT providers, so live phone audio can be forwarded to the selected
vendor without waiting for a completed recording.

## Provider mappings (how vendors split across surfaces)

<AccordionGroup>
  <Accordion title="Google">
    Image, video, music, batch TTS, backend realtime voice, and
    media-understanding surfaces.
  </Accordion>
  <Accordion title="OpenAI">
    Image, video, batch TTS, batch STT, Voice Call streaming STT, backend
    realtime voice, and memory-embedding surfaces.
  </Accordion>
  <Accordion title="DeepInfra">
    Chat/model routing, image generation/editing, text-to-video, batch TTS,
    batch STT, image media understanding, and memory-embedding surfaces.
    DeepInfra-native rerank/classification/object-detection models are not
    registered until OpenClaw has dedicated provider contracts for those
    categories.
  </Accordion>
  <Accordion title="xAI">
    Image, video, search, code-execution, batch TTS, batch STT, and Voice
    Call streaming STT. xAI Realtime voice is an upstream capability but is
    not registered in OpenClaw until the shared realtime-voice contract can
    represent it.
  </Accordion>
</AccordionGroup>

## Related

- [Image generation](/tools/image-generation)
- [Video generation](/tools/video-generation)
- [Music generation](/tools/music-generation)
- [Text-to-speech](/tools/tts)
- [Media understanding](/nodes/media-understanding)
- [Audio nodes](/nodes/audio)