Files
openclaw/docs/tools/video-generation.md
xieyongliang 2c57ec7b5f video_generate: add providerOptions, inputAudios, and imageRoles (#61987)
* video_generate: add providerOptions, inputAudios, and imageRoles

- VideoGenerationSourceAsset gains an optional `role` field (e.g.
  "first_frame", "last_frame"); core treats it as opaque and forwards it
  to the provider unchanged.

- VideoGenerationRequest gains `inputAudios` (reference audio assets,
  e.g. background music) and `providerOptions` (arbitrary
  provider-specific key/value pairs forwarded as-is).

- VideoGenerationProviderCapabilities gains `maxInputAudios`.

- video_generate tool schema adds:
  - `imageRoles` array (parallel to `images`, sets role per asset)
  - `audioRef` / `audioRefs` (single/multi reference audio inputs)
  - `providerOptions` (JSON object passed through to the provider)
  - `MAX_INPUT_IMAGES` bumped 5 → 9; `MAX_INPUT_AUDIOS` = 3

- Capability validation extended to gate on `maxInputAudios`.

- runtime.ts threads `inputAudios` and `providerOptions` through to
  `provider.generateVideo`.

- Docs and runtime tests updated.

Made-with: Cursor

* docs: fix BytePlus Seedance capability table — split 1.5 and 2.0 rows

1.5 Pro supports at most 2 input images (first_frame + last_frame);
2.0 supports up to 9 reference images, 3 videos, and 3 audios.
Provider notes section updated accordingly.

Made-with: Cursor

* docs: list all Seedance 1.0 models in video-generation provider table

- Default model updated to seedance-1-0-pro-250528 (was the T2V lite)
- Provider notes now enumerate all five 1.0 model IDs with T2V/I2V capability notes

Made-with: Cursor

* video_generate: address review feedback (P1/P2)

P1: Add "adaptive" to SUPPORTED_ASPECT_RATIOS so provider-specific ratio
passthrough (used by Seedance 1.5/2.0) is accepted instead of throwing.
Update error message to include "adaptive" in the allowed list.

P1: Fix audio input capability default — when a provider does not declare
maxInputAudios, default to 0 (no audio support) instead of MAX_INPUT_AUDIOS.
Providers must explicitly opt in via maxInputAudios to accept audio inputs.

P2: Remove unnecessary type cast in imageRoles assignment; VideoGenerationSourceAsset
already declares role?: string so a non-null assertion suffices.

P2: Add videoRoles and audioRoles tool parameters, parallel to imageRoles,
so callers can assign semantic role hints to reference video and audio assets
(e.g. "reference_video", "reference_audio" for Seedance 2.0).

Made-with: Cursor

* video_generate: fix check-docs formatting and snake_case param reading

Made-with: Cursor

* video_generate: clarify *Roles are parallel to combined input list (P2)

Made-with: Cursor

* video_generate: add missing duration import; fix corrupted docs section

Made-with: Cursor

* video_generate: pass mode inputs to duration resolver; note plugin requirement (P2)

Made-with: Cursor

* plugin-sdk: sync new video-gen fields — role, inputAudios, providerOptions, maxInputAudios

Add fields introduced by core in the PR1 batch to the public plugin-sdk
mirror so TypeScript provider plugins can declare and consume them
without type assertions:
- VideoGenerationSourceAsset.role?: string
- VideoGenerationRequest.inputAudios and .providerOptions
- VideoGenerationModeCapabilities.maxInputAudios

The AssertAssignable bidirectional checks still pass because all new
fields are optional; this change makes the SDK surface complete.

Made-with: Cursor

* video-gen runtime: skip failover candidates lacking audio capability

Made-with: Cursor

* video-gen: fall back to flat capabilities.maxInputAudios in failover and tool validation

Made-with: Cursor

* video-gen: defer audio-count check to runtime, enabling fallback for audio-capable candidates

Made-with: Cursor

* video-gen: defer maxDurationSeconds check to runtime, enabling fallback for higher-cap candidates

Made-with: Cursor

* video-gen: add VideoGenerationAssetRole union and typed providerOptions capability

Introduces a canonical VideoGenerationAssetRole union (first_frame,
last_frame, reference_image, reference_video, reference_audio) for the
source-asset role hint, and a VideoGenerationProviderOptionType tag
('number' | 'boolean' | 'string') plus a new capabilities.providerOptions
schema that providers use to declare which opaque providerOptions keys
they accept and with what primitive type.

Types are additive and backwards compatible. The role field accepts both
canonical union values and arbitrary provider-specific strings via a
`VideoGenerationAssetRole | (string & {})` union, so autocomplete works
for the common case without blocking provider-specific extensions.

Runtime enforcement of providerOptions (skip-in-fallback, unknown key
and type mismatch) lands in a follow-up commit.

Co-authored-by: yongliang.xie <yongliang.xie@bytedance.com>

* video-gen: enforce typed providerOptions schema via skip-in-fallback

Adds `validateProviderOptionsAgainstDeclaration` in the video-generation
runtime and wires it into the `generateVideo` candidate loop alongside
the existing audio-count and duration-cap skip guards.

Behavior:
  - Candidates with no declared `capabilities.providerOptions` skip any
    non-empty providerOptions payload with a clear skip reason, so a
    provider that would ignore `{seed: 42}` and succeed without the
    caller's intent never gets reached.
  - Candidates that declare a schema reject unknown keys with the list
    of accepted keys in the error.
  - Candidates that declare a schema reject type mismatches (expected
    number/boolean/string) with the declared type in the error.
  - All skip reasons push into `attempts` so the aggregated failure
    message at the end of the fallback chain explains exactly why each
    candidate was rejected.

Also hardens the tool boundary: `providerOptions` that is not a plain
JSON object (including bogus arrays like `["seed", 42]`) now throws a
`ToolInputError` up front instead of being cast to `Record` and
forwarded with numeric-string keys.

Consistent with the audio/duration skip-in-fallback pattern introduced
by yongliang.xie in earlier commits on this branch.

Co-authored-by: yongliang.xie <yongliang.xie@bytedance.com>

* video-gen: harden *Roles parity + document canonical role values

Replaces the inline `parseRolesArg` lambda with a dedicated
`parseRoleArray` helper that throws a ToolInputError when the caller
supplies more roles than assets. Off-by-one alignment mistakes in
`imageRoles` / `videoRoles` / `audioRoles` now fail loudly at the tool
boundary instead of silently dropping trailing roles.

Also tightens the schema descriptions to document the canonical
VideoGenerationAssetRole values (first_frame, last_frame, reference_*)
and the skip-in-fallback contract on providerOptions, and rejects
non-array inputs to any `*Roles` field early rather than coercing them
to an empty list.

Co-authored-by: yongliang.xie <yongliang.xie@bytedance.com>

* video-gen: surface dropped aspectRatio sentinels in ignoredOverrides

"adaptive" and other provider-specific sentinel aspect ratios are
unparseable as numeric ratios, so when the active provider does not
declare the sentinel in caps.aspectRatios, `resolveClosestAspectRatio`
returns undefined and the previous code silently nulled out
`aspectRatio` without surfacing a warning.

Push the dropped value into `ignoredOverrides` so the tool result
warning path ("Ignored unsupported overrides for …") picks it up, and
the caller gets visible feedback that the request was dropped instead
of a silent no-op. Also corrects the tool-side comment on
SUPPORTED_ASPECT_RATIOS to describe actual behavior.

Co-authored-by: yongliang.xie <yongliang.xie@bytedance.com>

* video-gen: surface declared providerOptions + maxInputAudios in action=list

`video_generate action=list` now includes the declared providerOptions
schema (key:type) per provider, so agents can discover which opaque
keys each provider accepts without trial and error. Both mode-level and
flat-provider providerOptions declarations are merged, matching the
runtime lookup order in `generateVideo`.

Also surfaces `maxInputAudios` alongside the other max-input counts for
completeness — previously the list output did not expose the audio cap
at all, even though the tool validates against it.

Co-authored-by: yongliang.xie <yongliang.xie@bytedance.com>

* video-gen: warn once per request when runtime skips a fallback candidate

The skip-in-fallback guards (audio cap, duration cap, providerOptions)
all logged at debug level, which meant operators had no visible signal
when the primary provider was silently passed over in favor of a
fallback. Add a first-skip log.warn in the runtime loop so the reason
for the first rejection is surfaced once per request, and leave the
rest of the skip events at debug to avoid flooding on long chains.

Co-authored-by: yongliang.xie <yongliang.xie@bytedance.com>

* video-gen: cover new tool-level behavior with regression tests

Adds regression tests for:
  - providerOptions shape rejection (arrays, strings)
  - providerOptions happy-path forwarding to runtime
  - imageRoles length-parity guard
  - *Roles non-array rejection
  - positional role attachment to loaded reference images
  - audio data: URL templated rejection branch
  - aspectRatio='adaptive' acceptance and forwarding
  - unsupported aspectRatio rejection (mentions 'adaptive' in the error)

All eight new cases run in the existing video-generate-tool suite and
use the same provider-mock pattern already established in the file.

Co-authored-by: yongliang.xie <yongliang.xie@bytedance.com>

* video-gen: cover runtime providerOptions skip-in-fallback branches

Adds runtime regression tests for the new typed-providerOptions guard:
  - candidates without a declared providerOptions schema are skipped
    when any providerOptions is supplied (prevents silent drop)
  - candidates that declare a schema skip on unknown keys with the
    accepted-key list surfaced in the error
  - candidates that declare a schema skip on type mismatches with the
    declared type surfaced in the error
  - end-to-end fallback: openai (no providerOptions) is skipped and
    byteplus (declared schema) accepts the same request, with an
    attempt entry recording the first skip reason

Also updates the existing 'forwards providerOptions to the provider
unchanged' case so the destination provider declares the matching
typed schema, and wires a `warn` stub into the hoisted logger mock
so the new first-skip log.warn call path does not blow up.

Co-authored-by: yongliang.xie <yongliang.xie@bytedance.com>

* changelog: note video_generate providerOptions / inputAudios / role hints

Adds an Unreleased Changes entry describing the user-visible surface
expansion for video_generate: typed providerOptions capability,
inputAudios reference audio, per-asset role hints via the canonical
VideoGenerationAssetRole union, the 'adaptive' aspect-ratio sentinel,
maxInputAudios capability, and the relaxed 9-image cap.

Credits the original PR author.

Co-authored-by: yongliang.xie <yongliang.xie@bytedance.com>

* byteplus: declare providerOptions schema (seed, draft, camerafixed) and forward to API

Made-with: Cursor

* byteplus: fix camera_fixed body field (API uses underscore, not camerafixed)

Made-with: Cursor

* fix(byteplus): normalize resolution to lowercase before API call

The Seedance API rejects resolution values with uppercase letters —
"480P", "720P" etc return InvalidParameter, while "480p", "720p"
are accepted. This was breaking the video generation live test
(resolveLiveVideoResolution returns "480P").

Normalize req.resolution to lowercase at the provider layer before
setting body.resolution, so any caller-supplied casing is corrected
without requiring changes to the VideoGenerationResolution type or
live-test helpers.

Verified via direct API call:
  body.resolution = "480P" → HTTP 400 InvalidParameter
  body.resolution = "480p" → task created successfully
  body.resolution = "720p" → task created successfully (t2v, i2v, 1.5-pro)
  body.resolution = "1080p" → task created successfully

Made-with: Cursor

* video-gen/byteplus: auto-select i2v model when input images provided with t2v model

Seedance 1.0 uses separate model IDs for T2V (seedance-1-0-lite-t2v-250428)
and I2V (seedance-1-0-lite-i2v-250428). When the caller requests a T2V model
but also provides inputImages, the API rejects with task_type i2v not supported
on t2v model.

Fix: when inputImages are present and the requested model contains "-t2v-",
auto-substitute "-i2v-" so the API receives the correct model. Seedance 1.5 Pro
uses a single model ID for both modes and is unaffected by this substitution.

Verified via live test: both mode=generate and mode=imageToVideo pass for
byteplus/seedance-1-0-lite-t2v-250428 with no failures.

Co-authored-by: odysseus0 <odysseus0@example.com>
Made-with: Cursor

* video-gen: fix duration rounding + align BytePlus (1.0) docs (P2)

Made-with: Cursor

* video-gen: relax providerOptions gate for undeclared-schema providers (P1)

Distinguish undefined (not declared = backward-compat pass-through) from
{} (explicitly declared empty = no options accepted) in
validateProviderOptionsAgainstDeclaration. Providers without a declared
schema receive providerOptions as-is; providers with an explicit empty
schema still skip. Typed schemas continue to validate key names and types.

Also: restore camera_fixed (underscore) in BytePlus provider schema and
body key (regression from earlier rebase), remove duplicate local
readBooleanToolParam definition now imported from media-tool-shared,
update tests and docs accordingly.

Made-with: Cursor

* video_generate: add landing follow-up coverage

* video_generate: finalize plugin-sdk baseline (#61987) (thanks @xieyongliang)

---------

Co-authored-by: yongliang.xie <yongliang.xie@bytedance.com>
Co-authored-by: George Zhang <georgezhangtj97@gmail.com>
Co-authored-by: odysseus0 <odysseus0@example.com>
2026-04-11 02:23:14 -07:00

33 KiB

summary, read_when, title
summary read_when title
Generate videos from text, images, or existing videos using 14 provider backends
Generating videos via the agent
Configuring video generation providers and models
Understanding the video_generate tool parameters
Video Generation

Video Generation

OpenClaw agents can generate videos from text prompts, reference images, or existing videos. Fourteen provider backends are supported, each with different model options, input modes, and feature sets. The agent picks the right provider automatically based on your configuration and available API keys.

The `video_generate` tool only appears when at least one video-generation provider is available. If you do not see it in your agent tools, set a provider API key or configure `agents.defaults.videoGenerationModel`.

OpenClaw treats video generation as three runtime modes:

  • generate for text-to-video requests with no reference media
  • imageToVideo when the request includes one or more reference images
  • videoToVideo when the request includes one or more reference videos

Providers can support any subset of those modes. The tool validates the active mode before submission and reports supported modes in action=list.

Quick start

  1. Set an API key for any supported provider:
export GEMINI_API_KEY="your-key"
  1. Optionally pin a default model:
openclaw config set agents.defaults.videoGenerationModel.primary "google/veo-3.1-fast-generate-preview"
  1. Ask the agent:

Generate a 5-second cinematic video of a friendly lobster surfing at sunset.

The agent calls video_generate automatically. No tool allowlisting is needed.

What happens when you generate a video

Video generation is asynchronous. When the agent calls video_generate in a session:

  1. OpenClaw submits the request to the provider and immediately returns a task ID.
  2. The provider processes the job in the background (typically 30 seconds to 5 minutes depending on the provider and resolution).
  3. When the video is ready, OpenClaw wakes the same session with an internal completion event.
  4. The agent posts the finished video back into the original conversation.

While a job is in flight, duplicate video_generate calls in the same session return the current task status instead of starting another generation. Use openclaw tasks list or openclaw tasks show <taskId> to check progress from the CLI.

Outside of session-backed agent runs (for example, direct tool invocations), the tool falls back to inline generation and returns the final media path in the same turn.

Task lifecycle

Each video_generate request moves through four states:

  1. queued -- task created, waiting for the provider to accept it.
  2. running -- provider is processing (typically 30 seconds to 5 minutes depending on provider and resolution).
  3. succeeded -- video ready; the agent wakes and posts it to the conversation.
  4. failed -- provider error or timeout; the agent wakes with error details.

Check status from the CLI:

openclaw tasks list
openclaw tasks show <taskId>
openclaw tasks cancel <taskId>

Duplicate prevention: if a video task is already queued or running for the current session, video_generate returns the existing task status instead of starting a new one. Use action: "status" to check explicitly without triggering a new generation.

Supported providers

Provider Default model Text Image ref Video ref API key
Alibaba wan2.6-t2v Yes Yes (remote URL) Yes (remote URL) MODELSTUDIO_API_KEY
BytePlus (1.0) seedance-1-0-pro-250528 Yes Up to 2 images (I2V models only; first + last frame) No BYTEPLUS_API_KEY
BytePlus Seedance 1.5 seedance-1-5-pro-251215 Yes Up to 2 images (first + last frame via role) No BYTEPLUS_API_KEY
BytePlus Seedance 2.0 dreamina-seedance-2-0-260128 Yes Up to 9 reference images Up to 3 videos BYTEPLUS_API_KEY
ComfyUI workflow Yes 1 image No COMFY_API_KEY or COMFY_CLOUD_API_KEY
fal fal-ai/minimax/video-01-live Yes 1 image No FAL_KEY
Google veo-3.1-fast-generate-preview Yes 1 image 1 video GEMINI_API_KEY
MiniMax MiniMax-Hailuo-2.3 Yes 1 image No MINIMAX_API_KEY
OpenAI sora-2 Yes 1 image 1 video OPENAI_API_KEY
Qwen wan2.6-t2v Yes Yes (remote URL) Yes (remote URL) QWEN_API_KEY
Runway gen4.5 Yes 1 image 1 video RUNWAYML_API_SECRET
Together Wan-AI/Wan2.2-T2V-A14B Yes 1 image No TOGETHER_API_KEY
Vydra veo3 Yes 1 image (kling) No VYDRA_API_KEY
xAI grok-imagine-video Yes 1 image 1 video XAI_API_KEY

Some providers accept additional or alternate API key env vars. See individual provider pages for details.

Run video_generate action=list to inspect available providers, models, and runtime modes at runtime.

Declared capability matrix

This is the explicit mode contract used by video_generate, contract tests, and the shared live sweep.

Provider generate imageToVideo videoToVideo Shared live lanes today
Alibaba Yes Yes Yes generate, imageToVideo; videoToVideo skipped because this provider needs remote http(s) video URLs
BytePlus Yes Yes No generate, imageToVideo
ComfyUI Yes Yes No Not in the shared sweep; workflow-specific coverage lives with Comfy tests
fal Yes Yes No generate, imageToVideo
Google Yes Yes Yes generate, imageToVideo; shared videoToVideo skipped because the current buffer-backed Gemini/Veo sweep does not accept that input
MiniMax Yes Yes No generate, imageToVideo
OpenAI Yes Yes Yes generate, imageToVideo; shared videoToVideo skipped because this org/input path currently needs provider-side inpaint/remix access
Qwen Yes Yes Yes generate, imageToVideo; videoToVideo skipped because this provider needs remote http(s) video URLs
Runway Yes Yes Yes generate, imageToVideo; videoToVideo runs only when the selected model is runway/gen4_aleph
Together Yes Yes No generate, imageToVideo
Vydra Yes Yes No generate; shared imageToVideo skipped because bundled veo3 is text-only and bundled kling requires a remote image URL
xAI Yes Yes Yes generate, imageToVideo; videoToVideo skipped because this provider currently needs a remote MP4 URL

Tool parameters

Required

Parameter Type Description
prompt string Text description of the video to generate (required for action: "generate")

Content inputs

Parameter Type Description
image string Single reference image (path or URL)
images string[] Multiple reference images (up to 9)
imageRoles string[] Optional per-position role hints parallel to the combined image list. Canonical values: first_frame, last_frame, reference_image
video string Single reference video (path or URL)
videos string[] Multiple reference videos (up to 4)
videoRoles string[] Optional per-position role hints parallel to the combined video list. Canonical value: reference_video
audioRef string Single reference audio (path or URL). Used for e.g. background music or voice reference when the provider supports audio inputs
audioRefs string[] Multiple reference audios (up to 3)
audioRoles string[] Optional per-position role hints parallel to the combined audio list. Canonical value: reference_audio

Role hints are forwarded to the provider as-is. Canonical values come from the VideoGenerationAssetRole union but providers may accept additional role strings. *Roles arrays must not have more entries than the corresponding reference list; off-by-one mistakes fail with a clear error. Use an empty string to leave a slot unset.

Style controls

Parameter Type Description
aspectRatio string 1:1, 2:3, 3:2, 3:4, 4:3, 4:5, 5:4, 9:16, 16:9, 21:9, or adaptive
resolution string 480P, 720P, 768P, or 1080P
durationSeconds number Target duration in seconds (rounded to nearest provider-supported value)
size string Size hint when the provider supports it
audio boolean Enable generated audio in the output when supported. Distinct from audioRef* (inputs)
watermark boolean Toggle provider watermarking when supported

adaptive is a provider-specific sentinel: it is forwarded as-is to providers that declare adaptive in their capabilities (e.g. BytePlus Seedance uses it to auto-detect the ratio from the input image dimensions). Providers that do not declare it surface the value via details.ignoredOverrides in the tool result so the drop is visible.

Advanced

Parameter Type Description
action string "generate" (default), "status", or "list"
model string Provider/model override (e.g. runway/gen4.5)
filename string Output filename hint
providerOptions object Provider-specific options as a JSON object (e.g. {"seed": 42, "draft": true}). Providers that declare a typed schema validate the keys and types; unknown keys or mismatches skip the candidate during fallback. Providers without a declared schema receive the options as-is. Run video_generate action=list to see what each provider accepts

Not all providers support all parameters. OpenClaw already normalizes duration to the closest provider-supported value, and it also remaps translated geometry hints such as size-to-aspect-ratio when a fallback provider exposes a different control surface. Truly unsupported overrides are ignored on a best-effort basis and reported as warnings in the tool result. Hard capability limits (such as too many reference inputs) fail before submission.

Tool results report the applied settings. When OpenClaw remaps duration or geometry during provider fallback, the returned durationSeconds, size, aspectRatio, and resolution values reflect what was submitted, and details.normalization captures the requested-to-applied translation.

Reference inputs also select the runtime mode:

  • No reference media: generate
  • Any image reference: imageToVideo
  • Any video reference: videoToVideo
  • Reference audio inputs do not change the resolved mode; they apply on top of whatever mode the image/video references select, and only work with providers that declare maxInputAudios

Mixed image and video references are not a stable shared capability surface. Prefer one reference type per request.

Fallback and typed options

Some capability checks are applied at the fallback layer rather than the tool boundary so that a request that exceeds the primary provider's limits can still run on a capable fallback:

  • If the active candidate declares no maxInputAudios (or declares it as 0), it is skipped when the request contains audio references, and the next candidate is tried.
  • If the active candidate's maxDurationSeconds is below the requested durationSeconds and the candidate does not declare a supportedDurationSeconds list, it is skipped.
  • If the request contains providerOptions and the active candidate explicitly declares a typed providerOptions schema, the candidate is skipped when the supplied keys are not in the schema or the value types do not match. Providers that have not yet declared a schema receive the options as-is (backward-compatible pass-through). A provider can explicitly opt out of all provider options by declaring an empty schema (capabilities.providerOptions: {}), which causes the same skip as a type mismatch.

The first skip reason in a request is logged at warn so operators see when their primary provider was passed over; subsequent skips log at debug to keep long fallback chains quiet. If every candidate is skipped, the aggregated error includes the skip reason for each.

Actions

  • generate (default) -- create a video from the given prompt and optional reference inputs.
  • status -- check the state of the in-flight video task for the current session without starting another generation.
  • list -- show available providers, models, and their capabilities.

Model selection

When generating a video, OpenClaw resolves the model in this order:

  1. model tool parameter -- if the agent specifies one in the call.
  2. videoGenerationModel.primary -- from config.
  3. videoGenerationModel.fallbacks -- tried in order.
  4. Auto-detection -- uses providers that have valid auth, starting with the current default provider, then remaining providers in alphabetical order.

If a provider fails, the next candidate is tried automatically. If all candidates fail, the error includes details from each attempt.

Set agents.defaults.mediaGenerationAutoProviderFallback: false if you want video generation to use only the explicit model, primary, and fallbacks entries.

{
  agents: {
    defaults: {
      videoGenerationModel: {
        primary: "google/veo-3.1-fast-generate-preview",
        fallbacks: ["runway/gen4.5", "qwen/wan2.6-t2v"],
      },
    },
  },
}

Provider notes

Provider Notes
Alibaba Uses DashScope/Model Studio async endpoint. Reference images and videos must be remote http(s) URLs.
BytePlus (1.0) Provider id byteplus. Models: seedance-1-0-pro-250528 (default), seedance-1-0-pro-t2v-250528, seedance-1-0-pro-fast-251015, seedance-1-0-lite-t2v-250428, seedance-1-0-lite-i2v-250428. T2V models (*-t2v-*) do not accept image inputs; I2V models and general *-pro-* models support a single reference image (first frame). Pass the image positionally or set role: "first_frame". T2V model IDs are automatically switched to the corresponding I2V variant when an image is provided. Supported providerOptions keys: seed (number), draft (boolean, forces 480p), camera_fixed (boolean).
BytePlus Seedance 1.5 Requires the @openclaw/byteplus-modelark plugin. Provider id byteplus-seedance15. Model: seedance-1-5-pro-251215. Uses the unified content[] API. Supports at most 2 input images (first_frame + last_frame). All inputs must be remote https:// URLs. Set role: "first_frame" / "last_frame" on each image, or pass images positionally. aspectRatio: "adaptive" auto-detects ratio from the input image. audio: true maps to generate_audio. providerOptions.seed (number) is forwarded.
BytePlus Seedance 2.0 Requires the @openclaw/byteplus-modelark plugin. Provider id byteplus-seedance2. Models: dreamina-seedance-2-0-260128, dreamina-seedance-2-0-fast-260128. Uses the unified content[] API. Supports up to 9 reference images, 3 reference videos, and 3 reference audios. All inputs must be remote https:// URLs. Set role on each asset — supported values: "first_frame", "last_frame", "reference_image", "reference_video", "reference_audio". aspectRatio: "adaptive" auto-detects ratio from the input image. audio: true maps to generate_audio. providerOptions.seed (number) is forwarded.
ComfyUI Workflow-driven local or cloud execution. Supports text-to-video and image-to-video through the configured graph.
fal Uses queue-backed flow for long-running jobs. Single image reference only.
Google Uses Gemini/Veo. Supports one image or one video reference.
MiniMax Single image reference only.
OpenAI Only size override is forwarded. Other style overrides (aspectRatio, resolution, audio, watermark) are ignored with a warning.
Qwen Same DashScope backend as Alibaba. Reference inputs must be remote http(s) URLs; local files are rejected upfront.
Runway Supports local files via data URIs. Video-to-video requires runway/gen4_aleph. Text-only runs expose 16:9 and 9:16 aspect ratios.
Together Single image reference only.
Vydra Uses https://www.vydra.ai/api/v1 directly to avoid auth-dropping redirects. veo3 is bundled as text-to-video only; kling requires a remote image URL.
xAI Supports text-to-video, image-to-video, and remote video edit/extend flows.

Provider capability modes

The shared video-generation contract now lets providers declare mode-specific capabilities instead of only flat aggregate limits. New provider implementations should prefer explicit mode blocks:

capabilities: {
  generate: {
    maxVideos: 1,
    maxDurationSeconds: 10,
    supportsResolution: true,
  },
  imageToVideo: {
    enabled: true,
    maxVideos: 1,
    maxInputImages: 1,
    maxDurationSeconds: 5,
  },
  videoToVideo: {
    enabled: true,
    maxVideos: 1,
    maxInputVideos: 1,
    maxDurationSeconds: 5,
  },
}

Flat aggregate fields such as maxInputImages and maxInputVideos are not enough to advertise transform-mode support. Providers should declare generate, imageToVideo, and videoToVideo explicitly so live tests, contract tests, and the shared video_generate tool can validate mode support deterministically.

Live tests

Opt-in live coverage for the shared bundled providers:

OPENCLAW_LIVE_TEST=1 pnpm test:live -- extensions/video-generation-providers.live.test.ts

Repo wrapper:

pnpm test:live:media video

This live file loads missing provider env vars from ~/.profile, prefers live/env API keys ahead of stored auth profiles by default, and runs the declared modes it can exercise safely with local media:

  • generate for every provider in the sweep
  • imageToVideo when capabilities.imageToVideo.enabled
  • videoToVideo when capabilities.videoToVideo.enabled and the provider/model accepts buffer-backed local video input in the shared sweep

Today the shared videoToVideo live lane covers:

  • runway only when you select runway/gen4_aleph

Configuration

Set the default video generation model in your OpenClaw config:

{
  agents: {
    defaults: {
      videoGenerationModel: {
        primary: "qwen/wan2.6-t2v",
        fallbacks: ["qwen/wan2.6-r2v-flash"],
      },
    },
  },
}

Or via the CLI:

openclaw config set agents.defaults.videoGenerationModel.primary "qwen/wan2.6-t2v"