From d1502c2ba18acfe187e63bb59f13f8849d5a6c49 Mon Sep 17 00:00:00 2001 From: Vincent Koc Date: Sat, 25 Apr 2026 22:20:28 -0700 Subject: [PATCH] docs(media-overview): rewrite around CardGroup, sync/async split, and AZ providers The media overview was a 91-line page that opened with a redundant Title-Case body H1 ('# Media Generation and Understanding'), then mixed a capability table, a Yes/Yes/Yes provider matrix, dense prose about async behaviour and STT/Voice Call surfaces, plus duplicate 'Quick links' and 'Related' sections at the end. Restructure for scan-first reading without losing any content: - Drop the redundant body H1; lead with a one-paragraph summary. - Replace the 'Capabilities at a glance' table with a CardGroup of six entry cards (Image / Video / Music / TTS / Media understanding / STT) each linking directly to its dedicated page. Mode (sync/async) is noted on the card so readers see latency expectations up front. - Convert the provider matrix to checkmarks for readability and align the column header names. Provider rows already alphabetized. - Pull async vs synchronous behaviour into a 5-row table that names why each capability is sync or async, then keep the operator-facing paragraph that explains task-id handoff. - Move the long 'Google maps to ... OpenAI maps to ... xAI maps to ...' paragraph into a per-vendor AccordionGroup so each mapping is a collapsible panel instead of one large prose block. - Drop duplicate 'Quick links' section in favour of a single Related list, sentence-cased to match the rest of the docs. --- docs/tools/media-overview.md | 157 ++++++++++++++++++++++------------- 1 file changed, 100 insertions(+), 57 deletions(-) diff --git a/docs/tools/media-overview.md b/docs/tools/media-overview.md index 1e29a92e5fa..1c3f69683dc 100644 --- a/docs/tools/media-overview.md +++ b/docs/tools/media-overview.md @@ -1,87 +1,128 @@ --- -summary: "Unified landing page for media generation, understanding, and speech capabilities" +summary: "Image, video, music, speech, and media-understanding capabilities at a glance" read_when: - - Looking for an overview of media capabilities + - Looking for an overview of OpenClaw's media capabilities - Deciding which media provider to configure - Understanding how async media generation works title: "Media overview" +sidebarTitle: "Media overview" --- -# Media Generation and Understanding +OpenClaw generates images, videos, and music, understands inbound media +(images, audio, video), and speaks replies aloud with text-to-speech. All +media capabilities are tool-driven: the agent decides when to use them based +on the conversation, and each tool only appears when at least one backing +provider is configured. -OpenClaw generates images, videos, and music, understands inbound media (images, audio, video), and speaks replies aloud with text-to-speech. All media capabilities are tool-driven: the agent decides when to use them based on the conversation, and each tool only appears when at least one backing provider is configured. +## Capabilities -## Capabilities at a glance - -| Capability | Tool | Providers | What it does | -| -------------------- | ---------------- | -------------------------------------------------------------------------------------------- | ------------------------------------------------------- | -| Image generation | `image_generate` | ComfyUI, fal, Google, MiniMax, OpenAI, Vydra, xAI | Creates or edits images from text prompts or references | -| Video generation | `video_generate` | Alibaba, BytePlus, ComfyUI, fal, Google, MiniMax, OpenAI, Qwen, Runway, Together, Vydra, xAI | Creates videos from text, images, or existing videos | -| Music generation | `music_generate` | ComfyUI, Google, MiniMax | Creates music or audio tracks from text prompts | -| Text-to-speech (TTS) | `tts` | ElevenLabs, Google, Gradium, Local CLI, Microsoft, MiniMax, OpenAI, Vydra, xAI, Xiaomi MiMo | Converts outbound replies to spoken audio | -| Media understanding | (automatic) | Any vision/audio-capable model provider, plus CLI fallbacks | Summarizes inbound images, audio, and video | + + + Create and edit images from text prompts or reference images via + `image_generate`. Synchronous — completes inline with the reply. + + + Text-to-video, image-to-video, and video-to-video via `video_generate`. + Async — runs in the background and posts the result when ready. + + + Generate music or audio tracks via `music_generate`. Async on shared + providers; ComfyUI workflow path runs synchronously. + + + Convert outbound replies to spoken audio via the `tts` tool plus + `messages.tts` config. Synchronous. + + + Summarize inbound images, audio, and video using vision-capable model + providers and dedicated media-understanding plugins. + + + Transcribe inbound voice messages through batch STT or Voice Call + streaming STT providers. + + ## Provider capability matrix -This table shows which providers support which media capabilities across the platform. - -| Provider | Image | Video | Music | TTS | STT / Transcription | Realtime Voice | Media Understanding | -| ----------- | ----- | ----- | ----- | --- | ------------------- | -------------- | ------------------- | -| Alibaba | | Yes | | | | | | -| BytePlus | | Yes | | | | | | -| ComfyUI | Yes | Yes | Yes | | | | | -| Deepgram | | | | | Yes | Yes | | -| ElevenLabs | | | | Yes | Yes | | | -| fal | Yes | Yes | | | | | | -| Google | Yes | Yes | Yes | Yes | | Yes | Yes | -| Gradium | | | | Yes | | | | -| Local CLI | | | | Yes | | | | -| Microsoft | | | | Yes | | | | -| MiniMax | Yes | Yes | Yes | Yes | | | | -| Mistral | | | | | Yes | | | -| OpenAI | Yes | Yes | | Yes | Yes | Yes | Yes | -| Qwen | | Yes | | | | | | -| Runway | | Yes | | | | | | -| SenseAudio | | | | | Yes | | | -| Together | | Yes | | | | | | -| Vydra | Yes | Yes | | Yes | | | | -| xAI | Yes | Yes | | Yes | Yes | | Yes | -| Xiaomi MiMo | Yes | | | Yes | | | Yes | +| Provider | Image | Video | Music | TTS | STT | Realtime voice | Media understanding | +| ----------- | :---: | :---: | :---: | :-: | :-: | :------------: | :-----------------: | +| Alibaba | | ✓ | | | | | | +| BytePlus | | ✓ | | | | | | +| ComfyUI | ✓ | ✓ | ✓ | | | | | +| Deepgram | | | | | ✓ | ✓ | | +| ElevenLabs | | | | ✓ | ✓ | | | +| fal | ✓ | ✓ | | | | | | +| Google | ✓ | ✓ | ✓ | ✓ | | ✓ | ✓ | +| Gradium | | | | ✓ | | | | +| Local CLI | | | | ✓ | | | | +| Microsoft | | | | ✓ | | | | +| MiniMax | ✓ | ✓ | ✓ | ✓ | | | | +| Mistral | | | | | ✓ | | | +| OpenAI | ✓ | ✓ | | ✓ | ✓ | ✓ | ✓ | +| Qwen | | ✓ | | | | | | +| Runway | | ✓ | | | | | | +| SenseAudio | | | | | ✓ | | | +| Together | | ✓ | | | | | | +| Vydra | ✓ | ✓ | | ✓ | | | | +| xAI | ✓ | ✓ | | ✓ | ✓ | | ✓ | +| Xiaomi MiMo | ✓ | | | ✓ | | | ✓ | -Media understanding uses any vision-capable or audio-capable model registered in your provider config. The table above highlights providers with dedicated media-understanding support; most LLM providers with multimodal models (Anthropic, Google, OpenAI, etc.) can also understand inbound media when configured as the active reply model. +Media understanding uses any vision-capable or audio-capable model registered +in your provider config. The matrix above lists providers with dedicated +media-understanding support; most multimodal LLM providers (Anthropic, Google, +OpenAI, etc.) can also understand inbound media when configured as the active +reply model. -## How async generation works +## Async vs synchronous -Video and music generation run as background tasks because provider processing typically takes 30 seconds to several minutes. When the agent calls `video_generate` or `music_generate`, OpenClaw submits the request to the provider, returns a task ID immediately, and tracks the job in the task ledger. The agent continues responding to other messages while the job runs. When the provider finishes, OpenClaw wakes the agent so it can post the finished media back into the original channel. Image generation and TTS are synchronous and complete inline with the reply. +| Capability | Mode | Why | +| --------------- | ------------ | ------------------------------------------------------------------ | +| Image | Synchronous | Provider responses return in seconds; completes inline with reply. | +| Text-to-speech | Synchronous | Provider responses return in seconds; attached to the reply audio. | +| Video | Asynchronous | Provider processing takes 30 s to several minutes. | +| Music (shared) | Asynchronous | Same provider-processing characteristic as video. | +| Music (ComfyUI) | Synchronous | Local workflow runs inline against the configured ComfyUI server. | + +For async tools, OpenClaw submits the request to the provider, returns a task +id immediately, and tracks the job in the task ledger. The agent continues +responding to other messages while the job runs. When the provider finishes, +OpenClaw wakes the agent so it can post the finished media back into the +original channel. + +## Speech-to-text and Voice Call Deepgram, ElevenLabs, Mistral, OpenAI, SenseAudio, and xAI can all transcribe inbound audio through the batch `tools.media.audio` path when configured. Channel plugins that preflight a voice note for mention gating or command parsing mark the transcribed attachment on the inbound context, so the shared -media-understanding pass reuses that transcript instead of making a second STT -call for the same audio. +media-understanding pass reuses that transcript instead of making a second +STT call for the same audio. + Deepgram, ElevenLabs, Mistral, OpenAI, and xAI also register Voice Call streaming STT providers, so live phone audio can be forwarded to the selected vendor without waiting for a completed recording. -Google maps to OpenClaw's image, video, music, batch TTS, backend realtime -voice, and media-understanding surfaces. OpenAI maps to OpenClaw's image, -video, batch TTS, batch STT, Voice Call streaming STT, backend realtime voice, -and memory embedding surfaces. xAI currently maps to OpenClaw's image, video, -search, code-execution, batch TTS, batch STT, and Voice Call streaming STT -surfaces. xAI Realtime voice is an upstream capability, but it is not -registered in OpenClaw until the shared realtime voice contract can represent -it. +## Provider mappings (how vendors split across surfaces) -## Quick links - -- [Image Generation](/tools/image-generation) -- generating and editing images -- [Video Generation](/tools/video-generation) -- text-to-video, image-to-video, and video-to-video -- [Music Generation](/tools/music-generation) -- creating music and audio tracks -- [Text-to-Speech](/tools/tts) -- converting replies to spoken audio -- [Media Understanding](/nodes/media-understanding) -- understanding inbound images, audio, and video + + + Image, video, music, batch TTS, backend realtime voice, and + media-understanding surfaces. + + + Image, video, batch TTS, batch STT, Voice Call streaming STT, backend + realtime voice, and memory-embedding surfaces. + + + Image, video, search, code-execution, batch TTS, batch STT, and Voice + Call streaming STT. xAI Realtime voice is an upstream capability but is + not registered in OpenClaw until the shared realtime-voice contract can + represent it. + + ## Related @@ -89,3 +130,5 @@ it. - [Video generation](/tools/video-generation) - [Music generation](/tools/music-generation) - [Text-to-speech](/tools/tts) +- [Media understanding](/nodes/media-understanding) +- [Audio nodes](/nodes/audio)