openclaw/docs/tools/media-overview.md at v2026.4.25-beta.9

mirror of https://github.com/openclaw/openclaw.git synced 2026-05-06 14:30:45 +00:00

Files

Vincent Koc d1502c2ba1 docs(media-overview): rewrite around CardGroup, sync/async split, and AZ providers

The media overview was a 91-line page that opened with a redundant
Title-Case body H1 ('# Media Generation and Understanding'), then
mixed a capability table, a Yes/Yes/Yes provider matrix, dense prose
about async behaviour and STT/Voice Call surfaces, plus duplicate
'Quick links' and 'Related' sections at the end.

Restructure for scan-first reading without losing any content:

- Drop the redundant body H1; lead with a one-paragraph summary.
- Replace the 'Capabilities at a glance' table with a CardGroup of six
  entry cards (Image / Video / Music / TTS / Media understanding / STT)
  each linking directly to its dedicated page. Mode (sync/async) is
  noted on the card so readers see latency expectations up front.
- Convert the provider matrix to checkmarks for readability and align
  the column header names. Provider rows already alphabetized.
- Pull async vs synchronous behaviour into a 5-row table that names
  why each capability is sync or async, then keep the operator-facing
  paragraph that explains task-id handoff.
- Move the long 'Google maps to ... OpenAI maps to ... xAI maps to ...'
  paragraph into a per-vendor AccordionGroup so each mapping is a
  collapsible panel instead of one large prose block.
- Drop duplicate 'Quick links' section in favour of a single Related
  list, sentence-cased to match the rest of the docs.

2026-04-25 22:20:35 -07:00

6.9 KiB

Raw Permalink Blame History

summary, read_when, title, sidebarTitle

summary

read_when

title

sidebarTitle

Image, video, music, speech, and media-understanding capabilities at a glance

Looking for an overview of OpenClaw's media capabilities

Deciding which media provider to configure

Understanding how async media generation works

Media overview

OpenClaw generates images, videos, and music, understands inbound media (images, audio, video), and speaks replies aloud with text-to-speech. All media capabilities are tool-driven: the agent decides when to use them based on the conversation, and each tool only appears when at least one backing provider is configured.

Capabilities

Create and edit images from text prompts or reference images via `image_generate`. Synchronous — completes inline with the reply. Text-to-video, image-to-video, and video-to-video via `video_generate`. Async — runs in the background and posts the result when ready. Generate music or audio tracks via `music_generate`. Async on shared providers; ComfyUI workflow path runs synchronously. Convert outbound replies to spoken audio via the `tts` tool plus `messages.tts` config. Synchronous. Summarize inbound images, audio, and video using vision-capable model providers and dedicated media-understanding plugins. Transcribe inbound voice messages through batch STT or Voice Call streaming STT providers.

Provider capability matrix

Provider	Image	Video	Music	TTS	STT	Realtime voice	Media understanding
Alibaba		✓
BytePlus		✓
ComfyUI	✓	✓	✓
Deepgram					✓	✓
ElevenLabs				✓	✓
fal	✓	✓
Google	✓	✓	✓	✓		✓	✓
Gradium				✓
Local CLI				✓
Microsoft				✓
MiniMax	✓	✓	✓	✓
Mistral					✓
OpenAI	✓	✓		✓	✓	✓	✓
Qwen		✓
Runway		✓
SenseAudio					✓
Together		✓
Vydra	✓	✓		✓
xAI	✓	✓		✓	✓		✓
Xiaomi MiMo	✓			✓			✓

Media understanding uses any vision-capable or audio-capable model registered in your provider config. The matrix above lists providers with dedicated media-understanding support; most multimodal LLM providers (Anthropic, Google, OpenAI, etc.) can also understand inbound media when configured as the active reply model.

Async vs synchronous

Capability	Mode	Why
Image	Synchronous	Provider responses return in seconds; completes inline with reply.
Text-to-speech	Synchronous	Provider responses return in seconds; attached to the reply audio.
Video	Asynchronous	Provider processing takes 30 s to several minutes.
Music (shared)	Asynchronous	Same provider-processing characteristic as video.
Music (ComfyUI)	Synchronous	Local workflow runs inline against the configured ComfyUI server.

For async tools, OpenClaw submits the request to the provider, returns a task id immediately, and tracks the job in the task ledger. The agent continues responding to other messages while the job runs. When the provider finishes, OpenClaw wakes the agent so it can post the finished media back into the original channel.

Speech-to-text and Voice Call

Deepgram, ElevenLabs, Mistral, OpenAI, SenseAudio, and xAI can all transcribe inbound audio through the batch tools.media.audio path when configured. Channel plugins that preflight a voice note for mention gating or command parsing mark the transcribed attachment on the inbound context, so the shared media-understanding pass reuses that transcript instead of making a second STT call for the same audio.

Deepgram, ElevenLabs, Mistral, OpenAI, and xAI also register Voice Call streaming STT providers, so live phone audio can be forwarded to the selected vendor without waiting for a completed recording.

Provider mappings (how vendors split across surfaces)

Image, video, music, batch TTS, backend realtime voice, and media-understanding surfaces. Image, video, batch TTS, batch STT, Voice Call streaming STT, backend realtime voice, and memory-embedding surfaces. Image, video, search, code-execution, batch TTS, batch STT, and Voice Call streaming STT. xAI Realtime voice is an upstream capability but is not registered in OpenClaw until the shared realtime-voice contract can represent it.

6.9 KiB Raw Permalink Blame History