Summary: - The PR changes generated-media duplicate guards, completion delivery fallback behavior, transcript write-lock reuse, task-registry fresh owner reads, docs, changelog, and regression coverage. - Reproducibility: yes. with source and artifact evidence rather than a local rerun: current main completes me ... e task and one successful video task after the patch. I did not run tests because this review is read-only. Automerge notes: - PR branch already contained follow-up commit before automerge: fix: dedupe media completion delivery - PR branch already contained follow-up commit before automerge: fix: avoid music provider lookup for explicit models - PR branch already contained follow-up commit before automerge: fix: narrow detached media task handles - PR branch already contained follow-up commit before automerge: fix: close media completion review gaps - PR branch already contained follow-up commit before automerge: fix: tolerate media delivery mirrors during session lock - PR branch already contained follow-up commit before automerge: Fix media completion duplicate delivery Validation: - ClawSweeper review passed for headf83e3bf143. - Required merge gates passed before the squash merge. Prepared head SHA:f83e3bf143Review: https://github.com/openclaw/openclaw/pull/84006#issuecomment-4484835103 Co-authored-by: fuller-stack-dev <263060202+fuller-stack-dev@users.noreply.github.com> Co-authored-by: FullerStackDev <263060202+fuller-stack-dev@users.noreply.github.com> Co-authored-by: clawsweeper <274271284+clawsweeper[bot]@users.noreply.github.com> Co-authored-by: clawsweeper[bot] <274271284+clawsweeper[bot]@users.noreply.github.com> Approved-by: takhoffman Co-authored-by: takhoffman <781889+takhoffman@users.noreply.github.com>
8.6 KiB
summary, read_when, title, sidebarTitle
| summary | read_when | title | sidebarTitle | |||
|---|---|---|---|---|---|---|
| Image, video, music, speech, and media-understanding capabilities at a glance |
|
Media overview | Media overview |
OpenClaw generates images, videos, and music, understands inbound media (images, audio, video), and speaks replies aloud with text-to-speech. All media capabilities are tool-driven: the agent decides when to use them based on the conversation, and each tool only appears when at least one backing provider is configured.
Live speech uses the Talk session contract instead of the one-shot media tool
path. Talk has three modes: provider-native realtime, local or streaming
stt-tts, and transcription for observe-only speech capture. Those modes
share provider catalogs, event envelopes, and cancellation semantics with
telephony, meetings, browser realtime, and native push-to-talk clients.
Capabilities
Create and edit images from text prompts or reference images via `image_generate`. Async in chat sessions — runs in the background and posts the result when ready. Text-to-video, image-to-video, and video-to-video via `video_generate`. Async — runs in the background and posts the result when ready. Generate music or audio tracks via `music_generate`. Async in chat sessions on the shared media-generation task lifecycle. Convert outbound replies to spoken audio via the `tts` tool plus `messages.tts` config. Synchronous. Summarize inbound images, audio, and video using vision-capable model providers and dedicated media-understanding plugins. Transcribe inbound voice messages through batch STT or Voice Call streaming STT providers.Provider capability matrix
| Provider | Image | Video | Music | TTS | STT | Realtime voice | Media understanding |
|---|---|---|---|---|---|---|---|
| Alibaba | ✓ | ||||||
| BytePlus | ✓ | ||||||
| ComfyUI | ✓ | ✓ | ✓ | ||||
| DeepInfra | ✓ | ✓ | ✓ | ✓ | ✓ | ||
| Deepgram | ✓ | ✓ | |||||
| ElevenLabs | ✓ | ✓ | |||||
| fal | ✓ | ✓ | ✓ | ||||
| ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | ||
| Gradium | ✓ | ||||||
| Local CLI | ✓ | ||||||
| Microsoft | ✓ | ||||||
| MiniMax | ✓ | ✓ | ✓ | ✓ | |||
| Mistral | ✓ | ||||||
| OpenAI | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | |
| OpenRouter | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | |
| Qwen | ✓ | ||||||
| Runway | ✓ | ||||||
| SenseAudio | ✓ | ||||||
| Together | ✓ | ||||||
| Vydra | ✓ | ✓ | ✓ | ||||
| xAI | ✓ | ✓ | ✓ | ✓ | ✓ | ||
| Xiaomi MiMo | ✓ | ✓ | ✓ |
Async vs synchronous
| Capability | Mode | Why |
|---|---|---|
| Image | Asynchronous | Provider processing can outlive a chat turn; generated attachments use the shared completion path. |
| Text-to-speech | Synchronous | Provider responses return in seconds; attached to the reply audio. |
| Video | Asynchronous | Provider processing takes 30 s to several minutes; slow queues can run up to the configured timeout. |
| Music | Asynchronous | Same provider-processing characteristic as video. |
For async tools, OpenClaw submits the request to the provider, returns a task id immediately, and tracks the job in the task ledger. The agent continues responding to other messages while the job runs. When the provider finishes, OpenClaw wakes the agent with the generated media paths so it can tell the user and relay the result through the message tool. If the requester session is inactive and some generated media is still missing from message-tool delivery, OpenClaw sends an idempotent direct fallback with only the missing media. Media already delivered through the message tool is not posted again.
Speech-to-text and Voice Call
Deepgram, DeepInfra, ElevenLabs, Mistral, OpenAI, OpenRouter, SenseAudio, and xAI can all transcribe
inbound audio through the batch tools.media.audio path when configured.
Channel plugins that preflight a voice note for mention gating or command
parsing mark the transcribed attachment on the inbound context, so the shared
media-understanding pass reuses that transcript instead of making a second
STT call for the same audio.
Deepgram, ElevenLabs, Mistral, OpenAI, and xAI also register Voice Call streaming STT providers, so live phone audio can be forwarded to the selected vendor without waiting for a completed recording.
For live user conversations, prefer Talk mode. Batch audio attachments stay on the media path; browser realtime, native push-to-talk, telephony, and meeting audio should use Talk events and the session-scoped catalogs returned by the Gateway.