From 4ef8f4f53c2afec642db0b0eceb8e1e85b0dbddf Mon Sep 17 00:00:00 2001 From: Vincent Koc Date: Mon, 6 Apr 2026 16:18:22 +0100 Subject: [PATCH] docs: add media overview page and consolidate TTS duplicate --- docs/docs.json | 1 + docs/tools/media-overview.md | 60 +++++ docs/tts.md | 452 +---------------------------------- 3 files changed, 64 insertions(+), 449 deletions(-) create mode 100644 docs/tools/media-overview.md diff --git a/docs/docs.json b/docs/docs.json index 4d0caa3959a..7bfd3e84a17 100644 --- a/docs/docs.json +++ b/docs/docs.json @@ -1158,6 +1158,7 @@ { "group": "Tools", "pages": [ + "tools/media-overview", "tools/apply-patch", { "group": "Web Browser", diff --git a/docs/tools/media-overview.md b/docs/tools/media-overview.md new file mode 100644 index 00000000000..e5ca76f8dd2 --- /dev/null +++ b/docs/tools/media-overview.md @@ -0,0 +1,60 @@ +--- +summary: "Unified landing page for media generation, understanding, and speech capabilities" +read_when: + - Looking for an overview of media capabilities + - Deciding which media provider to configure + - Understanding how async media generation works +title: "Media Overview" +--- + +# Media Generation and Understanding + +OpenClaw generates images, videos, and music, understands inbound media (images, audio, video), and speaks replies aloud with text-to-speech. All media capabilities are tool-driven: the agent decides when to use them based on the conversation, and each tool only appears when at least one backing provider is configured. + +## Capabilities at a glance + +| Capability | Tool | Providers | What it does | +| -------------------- | ---------------- | -------------------------------------------------------------------------------------------- | ------------------------------------------------------- | +| Image generation | `image_generate` | ComfyUI, fal, Google, MiniMax, OpenAI, Vydra | Creates or edits images from text prompts or references | +| Video generation | `video_generate` | Alibaba, BytePlus, ComfyUI, fal, Google, MiniMax, OpenAI, Qwen, Runway, Together, Vydra, xAI | Creates videos from text, images, or existing videos | +| Music generation | `music_generate` | ComfyUI, Google, MiniMax | Creates music or audio tracks from text prompts | +| Text-to-speech (TTS) | `tts` | ElevenLabs, Microsoft, MiniMax, OpenAI | Converts outbound replies to spoken audio | +| Media understanding | (automatic) | Any vision/audio-capable model provider, plus CLI fallbacks | Summarizes inbound images, audio, and video | + +## Provider capability matrix + +This table shows which providers support which media capabilities across the platform. + +| Provider | Image | Video | Music | TTS | STT / Transcription | Media Understanding | +| ---------- | ----- | ----- | ----- | --- | ------------------- | ------------------- | +| Alibaba | | Yes | | | | | +| BytePlus | | Yes | | | | | +| ComfyUI | Yes | Yes | Yes | | | | +| Deepgram | | | | | Yes | | +| ElevenLabs | | | | Yes | | | +| fal | Yes | Yes | | | | | +| Google | Yes | Yes | Yes | | | Yes | +| Microsoft | | | | Yes | | | +| MiniMax | Yes | Yes | Yes | Yes | | | +| OpenAI | Yes | Yes | | Yes | Yes | Yes | +| Qwen | | Yes | | | | | +| Runway | | Yes | | | | | +| Together | | Yes | | | | | +| Vydra | Yes | Yes | | | | | +| xAI | | Yes | | | | | + + +Media understanding uses any vision-capable or audio-capable model registered in your provider config. The table above highlights providers with dedicated media-understanding support; most LLM providers with multimodal models (Anthropic, Google, OpenAI, etc.) can also understand inbound media when configured as the active reply model. + + +## How async generation works + +Video and music generation run as background tasks because provider processing typically takes 30 seconds to several minutes. When the agent calls `video_generate` or `music_generate`, OpenClaw submits the request to the provider, returns a task ID immediately, and tracks the job in the task ledger. The agent continues responding to other messages while the job runs. When the provider finishes, OpenClaw wakes the agent so it can post the finished media back into the original channel. Image generation and TTS are synchronous and complete inline with the reply. + +## Quick links + +- [Image Generation](/tools/image-generation) -- generating and editing images +- [Video Generation](/tools/video-generation) -- text-to-video, image-to-video, and video-to-video +- [Music Generation](/tools/music-generation) -- creating music and audio tracks +- [Text-to-Speech](/tools/tts) -- converting replies to spoken audio +- [Media Understanding](/nodes/media-understanding) -- understanding inbound images, audio, and video diff --git a/docs/tts.md b/docs/tts.md index e7f9a51ac04..e212b3cb32a 100644 --- a/docs/tts.md +++ b/docs/tts.md @@ -1,452 +1,6 @@ --- -summary: "Text-to-speech (TTS) for outbound replies" -read_when: - - Enabling text-to-speech for replies - - Configuring TTS providers or limits - - Using /tts commands -title: "Text-to-Speech (legacy path)" +title: "Text-to-Speech" +redirect: /tools/tts --- -# Text-to-speech (TTS) - -OpenClaw can convert outbound replies into audio using ElevenLabs, Microsoft, MiniMax, or OpenAI. -It works anywhere OpenClaw can send audio. - -## Supported services - -- **ElevenLabs** (primary or fallback provider) -- **Microsoft** (primary or fallback provider; current bundled implementation uses `node-edge-tts`) -- **MiniMax** (primary or fallback provider; uses the T2A v2 API) -- **OpenAI** (primary or fallback provider; also used for summaries) - -### Microsoft speech notes - -The bundled Microsoft speech provider currently uses Microsoft Edge's online -neural TTS service via the `node-edge-tts` library. It's a hosted service (not -local), uses Microsoft endpoints, and does not require an API key. -`node-edge-tts` exposes speech configuration options and output formats, but -not all options are supported by the service. Legacy config and directive input -using `edge` still works and is normalized to `microsoft`. - -Because this path is a public web service without a published SLA or quota, -treat it as best-effort. If you need guaranteed limits and support, use OpenAI -or ElevenLabs. - -## Optional keys - -If you want OpenAI, ElevenLabs, or MiniMax: - -- `ELEVENLABS_API_KEY` (or `XI_API_KEY`) -- `MINIMAX_API_KEY` -- `OPENAI_API_KEY` - -Microsoft speech does **not** require an API key. - -If multiple providers are configured, the selected provider is used first and the others are fallback options. -Auto-summary uses the configured `summaryModel` (or `agents.defaults.model.primary`), -so that provider must also be authenticated if you enable summaries. - -## Service links - -- [OpenAI Text-to-Speech guide](https://platform.openai.com/docs/guides/text-to-speech) -- [OpenAI Audio API reference](https://platform.openai.com/docs/api-reference/audio) -- [ElevenLabs Text to Speech](https://elevenlabs.io/docs/api-reference/text-to-speech) -- [ElevenLabs Authentication](https://elevenlabs.io/docs/api-reference/authentication) -- [MiniMax T2A v2 API](https://platform.minimaxi.com/document/T2A%20V2) -- [node-edge-tts](https://github.com/SchneeHertz/node-edge-tts) -- [Microsoft Speech output formats](https://learn.microsoft.com/azure/ai-services/speech-service/rest-text-to-speech#audio-outputs) - -## Is it enabled by default? - -No. Auto‑TTS is **off** by default. Enable it in config with -`messages.tts.auto` or per session with `/tts always` (alias: `/tts on`). - -When `messages.tts.provider` is unset, OpenClaw picks the first configured -speech provider in registry auto-select order. - -## Config - -TTS config lives under `messages.tts` in `openclaw.json`. -Full schema is in [Gateway configuration](/gateway/configuration). - -### Minimal config (enable + provider) - -```json5 -{ - messages: { - tts: { - auto: "always", - provider: "elevenlabs", - }, - }, -} -``` - -### OpenAI primary with ElevenLabs fallback - -```json5 -{ - messages: { - tts: { - auto: "always", - provider: "openai", - summaryModel: "openai/gpt-4.1-mini", - modelOverrides: { - enabled: true, - }, - providers: { - openai: { - apiKey: "openai_api_key", - baseUrl: "https://api.openai.com/v1", - model: "gpt-4o-mini-tts", - voice: "alloy", - }, - elevenlabs: { - apiKey: "elevenlabs_api_key", - baseUrl: "https://api.elevenlabs.io", - voiceId: "voice_id", - modelId: "eleven_multilingual_v2", - seed: 42, - applyTextNormalization: "auto", - languageCode: "en", - voiceSettings: { - stability: 0.5, - similarityBoost: 0.75, - style: 0.0, - useSpeakerBoost: true, - speed: 1.0, - }, - }, - }, - }, - }, -} -``` - -### Microsoft primary (no API key) - -```json5 -{ - messages: { - tts: { - auto: "always", - provider: "microsoft", - providers: { - microsoft: { - enabled: true, - voice: "en-US-MichelleNeural", - lang: "en-US", - outputFormat: "audio-24khz-48kbitrate-mono-mp3", - rate: "+10%", - pitch: "-5%", - }, - }, - }, - }, -} -``` - -### MiniMax primary - -```json5 -{ - messages: { - tts: { - auto: "always", - provider: "minimax", - providers: { - minimax: { - apiKey: "minimax_api_key", - baseUrl: "https://api.minimax.io", - model: "speech-2.8-hd", - voiceId: "English_expressive_narrator", - speed: 1.0, - vol: 1.0, - pitch: 0, - }, - }, - }, - }, -} -``` - -### Disable Microsoft speech - -```json5 -{ - messages: { - tts: { - providers: { - microsoft: { - enabled: false, - }, - }, - }, - }, -} -``` - -### Custom limits + prefs path - -```json5 -{ - messages: { - tts: { - auto: "always", - maxTextLength: 4000, - timeoutMs: 30000, - prefsPath: "~/.openclaw/settings/tts.json", - }, - }, -} -``` - -### Only reply with audio after an inbound voice message - -```json5 -{ - messages: { - tts: { - auto: "inbound", - }, - }, -} -``` - -### Disable auto-summary for long replies - -```json5 -{ - messages: { - tts: { - auto: "always", - }, - }, -} -``` - -Then run: - -``` -/tts summary off -``` - -### Notes on fields - -- `auto`: auto‑TTS mode (`off`, `always`, `inbound`, `tagged`). - - `inbound` only sends audio after an inbound voice message. - - `tagged` only sends audio when the reply includes `[[tts]]` tags. -- `enabled`: legacy toggle (doctor migrates this to `auto`). -- `mode`: `"final"` (default) or `"all"` (includes tool/block replies). -- `provider`: speech provider id such as `"elevenlabs"`, `"microsoft"`, `"minimax"`, or `"openai"` (fallback is automatic). -- If `provider` is **unset**, OpenClaw uses the first configured speech provider in registry auto-select order. -- Legacy `provider: "edge"` still works and is normalized to `microsoft`. -- `summaryModel`: optional cheap model for auto-summary; defaults to `agents.defaults.model.primary`. - - Accepts `provider/model` or a configured model alias. -- `modelOverrides`: allow the model to emit TTS directives (on by default). - - `allowProvider` defaults to `false` (provider switching is opt-in). -- `providers.`: provider-owned settings keyed by speech provider id. -- Legacy direct provider blocks (`messages.tts.openai`, `messages.tts.elevenlabs`, `messages.tts.microsoft`, `messages.tts.edge`) are auto-migrated to `messages.tts.providers.` on load. -- `maxTextLength`: hard cap for TTS input (chars). `/tts audio` fails if exceeded. -- `timeoutMs`: request timeout (ms). -- `prefsPath`: override the local prefs JSON path (provider/limit/summary). -- `apiKey` values fall back to env vars (`ELEVENLABS_API_KEY`/`XI_API_KEY`, `MINIMAX_API_KEY`, `OPENAI_API_KEY`). -- `providers.elevenlabs.baseUrl`: override ElevenLabs API base URL. -- `providers.openai.baseUrl`: override the OpenAI TTS endpoint. - - Resolution order: `messages.tts.providers.openai.baseUrl` -> `OPENAI_TTS_BASE_URL` -> `https://api.openai.com/v1` - - Non-default values are treated as OpenAI-compatible TTS endpoints, so custom model and voice names are accepted. -- `providers.elevenlabs.voiceSettings`: - - `stability`, `similarityBoost`, `style`: `0..1` - - `useSpeakerBoost`: `true|false` - - `speed`: `0.5..2.0` (1.0 = normal) -- `providers.elevenlabs.applyTextNormalization`: `auto|on|off` -- `providers.elevenlabs.languageCode`: 2-letter ISO 639-1 (e.g. `en`, `de`) -- `providers.elevenlabs.seed`: integer `0..4294967295` (best-effort determinism) -- `providers.minimax.baseUrl`: override MiniMax API base URL (default `https://api.minimax.io`, env: `MINIMAX_API_HOST`). -- `providers.minimax.model`: TTS model (default `speech-2.8-hd`, env: `MINIMAX_TTS_MODEL`). -- `providers.minimax.voiceId`: voice identifier (default `English_expressive_narrator`, env: `MINIMAX_TTS_VOICE_ID`). -- `providers.minimax.speed`: playback speed `0.5..2.0` (default 1.0). -- `providers.minimax.vol`: volume `(0, 10]` (default 1.0; must be greater than 0). -- `providers.minimax.pitch`: pitch shift `-12..12` (default 0). -- `providers.microsoft.enabled`: allow Microsoft speech usage (default `true`; no API key). -- `providers.microsoft.voice`: Microsoft neural voice name (e.g. `en-US-MichelleNeural`). -- `providers.microsoft.lang`: language code (e.g. `en-US`). -- `providers.microsoft.outputFormat`: Microsoft output format (e.g. `audio-24khz-48kbitrate-mono-mp3`). - - See Microsoft Speech output formats for valid values; not all formats are supported by the bundled Edge-backed transport. -- `providers.microsoft.rate` / `providers.microsoft.pitch` / `providers.microsoft.volume`: percent strings (e.g. `+10%`, `-5%`). -- `providers.microsoft.saveSubtitles`: write JSON subtitles alongside the audio file. -- `providers.microsoft.proxy`: proxy URL for Microsoft speech requests. -- `providers.microsoft.timeoutMs`: request timeout override (ms). -- `edge.*`: legacy alias for the same Microsoft settings. - -## Model-driven overrides (default on) - -By default, the model **can** emit TTS directives for a single reply. -When `messages.tts.auto` is `tagged`, these directives are required to trigger audio. - -When enabled, the model can emit `[[tts:...]]` directives to override the voice -for a single reply, plus an optional `[[tts:text]]...[[/tts:text]]` block to -provide expressive tags (laughter, singing cues, etc) that should only appear in -the audio. - -`provider=...` directives are ignored unless `modelOverrides.allowProvider: true`. - -Example reply payload: - -``` -Here you go. - -[[tts:voiceId=pMsXgVXv3BLzUgSXRplE model=eleven_v3 speed=1.1]] -[[tts:text]](laughs) Read the song once more.[[/tts:text]] -``` - -Available directive keys (when enabled): - -- `provider` (registered speech provider id, for example `openai`, `elevenlabs`, `minimax`, or `microsoft`; requires `allowProvider: true`) -- `voice` (OpenAI voice) or `voiceId` (ElevenLabs / MiniMax) -- `model` (OpenAI TTS model, ElevenLabs model id, or MiniMax model) -- `stability`, `similarityBoost`, `style`, `speed`, `useSpeakerBoost` -- `vol` / `volume` (MiniMax volume, 0-10) -- `pitch` (MiniMax pitch, -12 to 12) -- `applyTextNormalization` (`auto|on|off`) -- `languageCode` (ISO 639-1) -- `seed` - -Disable all model overrides: - -```json5 -{ - messages: { - tts: { - modelOverrides: { - enabled: false, - }, - }, - }, -} -``` - -Optional allowlist (enable provider switching while keeping other knobs configurable): - -```json5 -{ - messages: { - tts: { - modelOverrides: { - enabled: true, - allowProvider: true, - allowSeed: false, - }, - }, - }, -} -``` - -## Per-user preferences - -Slash commands write local overrides to `prefsPath` (default: -`~/.openclaw/settings/tts.json`, override with `OPENCLAW_TTS_PREFS` or -`messages.tts.prefsPath`). - -Stored fields: - -- `enabled` -- `provider` -- `maxLength` (summary threshold; default 1500 chars) -- `summarize` (default `true`) - -These override `messages.tts.*` for that host. - -## Output formats (fixed) - -- **Feishu / Matrix / Telegram / WhatsApp**: Opus voice message (`opus_48000_64` from ElevenLabs, `opus` from OpenAI). - - 48kHz / 64kbps is a good voice message tradeoff. -- **Other channels**: MP3 (`mp3_44100_128` from ElevenLabs, `mp3` from OpenAI). - - 44.1kHz / 128kbps is the default balance for speech clarity. -- **MiniMax**: MP3 (`speech-2.8-hd` model, 32kHz sample rate). Voice-note format not natively supported; use OpenAI or ElevenLabs for guaranteed Opus voice messages. -- **Microsoft**: uses `microsoft.outputFormat` (default `audio-24khz-48kbitrate-mono-mp3`). - - The bundled transport accepts an `outputFormat`, but not all formats are available from the service. - - Output format values follow Microsoft Speech output formats (including Ogg/WebM Opus). - - Telegram `sendVoice` accepts OGG/MP3/M4A; use OpenAI/ElevenLabs if you need - guaranteed Opus voice messages. - - If the configured Microsoft output format fails, OpenClaw retries with MP3. - -OpenAI/ElevenLabs output formats are fixed per channel (see above). - -## Auto-TTS behavior - -When enabled, OpenClaw: - -- skips TTS if the reply already contains media or a `MEDIA:` directive. -- skips very short replies (< 10 chars). -- summarizes long replies when enabled using `agents.defaults.model.primary` (or `summaryModel`). -- attaches the generated audio to the reply. - -If the reply exceeds `maxLength` and summary is off (or no API key for the -summary model), audio -is skipped and the normal text reply is sent. - -## Flow diagram - -``` -Reply -> TTS enabled? - no -> send text - yes -> has media / MEDIA: / short? - yes -> send text - no -> length > limit? - no -> TTS -> attach audio - yes -> summary enabled? - no -> send text - yes -> summarize (summaryModel or agents.defaults.model.primary) - -> TTS -> attach audio -``` - -## Slash command usage - -There is a single command: `/tts`. -See [Slash commands](/tools/slash-commands) for enablement details. - -Discord note: `/tts` is a built-in Discord command, so OpenClaw registers -`/voice` as the native command there. Text `/tts ...` still works. - -``` -/tts off -/tts always -/tts inbound -/tts tagged -/tts status -/tts provider openai -/tts limit 2000 -/tts summary off -/tts audio Hello from OpenClaw -``` - -Notes: - -- Commands require an authorized sender (allowlist/owner rules still apply). -- `commands.text` or native command registration must be enabled. -- `off|always|inbound|tagged` are per‑session toggles (`/tts on` is an alias for `/tts always`). -- `limit` and `summary` are stored in local prefs, not the main config. -- `/tts audio` generates a one-off audio reply (does not toggle TTS on). -- `/tts status` includes fallback visibility for the latest attempt: - - success fallback: `Fallback: -> ` plus `Attempts: ...` - - failure: `Error: ...` plus `Attempts: ...` - - detailed diagnostics: `Attempt details: provider:outcome(reasonCode) latency` -- OpenAI and ElevenLabs API failures now include parsed provider error detail and request id (when returned by the provider), which is surfaced in TTS errors/logs. - -## Agent tool - -The `tts` tool converts text to speech and returns an audio attachment for -reply delivery. When the channel is Feishu, Matrix, Telegram, or WhatsApp, -the audio is delivered as a voice message rather than a file attachment. - -## Gateway RPC - -Gateway methods: - -- `tts.status` -- `tts.enable` -- `tts.disable` -- `tts.convert` -- `tts.setProvider` -- `tts.providers` +This page has moved to [Text-to-Speech](/tools/tts).