diff --git a/docs/tools/tts.md b/docs/tools/tts.md index 12bd04d7069..cb17494312d 100644 --- a/docs/tools/tts.md +++ b/docs/tools/tts.md @@ -1,120 +1,122 @@ --- -summary: "Text-to-speech (TTS) for outbound replies" +summary: "Text-to-speech for outbound replies — providers, personas, slash commands, and per-channel output" read_when: - Enabling text-to-speech for replies - - Configuring TTS providers or limits - - Using /tts commands + - Configuring a TTS provider, fallback chain, or persona + - Using /tts commands or directives title: "Text-to-speech" --- -OpenClaw can convert outbound replies into audio using Azure Speech, ElevenLabs, Google Gemini, Gradium, Inworld, Local CLI, Microsoft, MiniMax, OpenAI, Volcengine, Vydra, xAI, or Xiaomi MiMo. -It works anywhere OpenClaw can send audio. +OpenClaw can convert outbound replies into audio across **13 speech providers** +and deliver native voice messages on Feishu, Matrix, Telegram, and WhatsApp, +audio attachments everywhere else, and PCM/Ulaw streams for telephony and Talk. -## Supported services +## Quick start -- **Azure Speech** (primary or fallback provider; uses the Azure AI Speech REST API) -- **ElevenLabs** (primary or fallback provider) -- **Google Gemini** (primary or fallback provider; uses Gemini API TTS) -- **Gradium** (primary or fallback provider; supports voice-note and telephony output) -- **Inworld** (primary or fallback provider; uses the Inworld streaming TTS API) -- **Local CLI** (primary or fallback provider; runs a configured local TTS command) -- **Microsoft** (primary or fallback provider; current bundled implementation uses `node-edge-tts`) -- **MiniMax** (primary or fallback provider; uses the T2A v2 API) -- **OpenAI** (primary or fallback provider; also used for summaries) -- **Volcengine** (primary or fallback provider; uses the BytePlus Seed Speech HTTP API) -- **Vydra** (primary or fallback provider; shared image, video, and speech provider) -- **xAI** (primary or fallback provider; uses the xAI TTS API) -- **Xiaomi MiMo** (primary or fallback provider; uses MiMo TTS through Xiaomi chat completions) + + + OpenAI and ElevenLabs are the most reliable hosted options. Microsoft and + Local CLI work without an API key. See the [provider matrix](#supported-providers) + for the full list. + + + Export the env var for your provider (for example `OPENAI_API_KEY`, + `ELEVENLABS_API_KEY`). Microsoft and Local CLI need no key. + + + Set `messages.tts.auto: "always"` and `messages.tts.provider`: -### Microsoft speech notes + ```json5 + { + messages: { + tts: { + auto: "always", + provider: "elevenlabs", + }, + }, + } + ``` -The bundled Microsoft speech provider currently uses Microsoft Edge's online -neural TTS service via the `node-edge-tts` library. It's a hosted service (not -local), uses Microsoft endpoints, and does not require an API key. -`node-edge-tts` exposes speech configuration options and output formats, but -not all options are supported by the service. Legacy config and directive input -using `edge` still works and is normalized to `microsoft`. + + + `/tts status` shows the current state. `/tts audio Hello from OpenClaw` + sends a one-off audio reply. + + -Because this path is a public web service without a published SLA or quota, -treat it as best-effort. If you need guaranteed limits and support, use OpenAI -or ElevenLabs. + +Auto-TTS is **off** by default. When `messages.tts.provider` is unset, +OpenClaw picks the first configured provider in registry auto-select order. + -## Optional keys +## Supported providers -If you want Azure Speech, ElevenLabs, Google Gemini, Gradium, Inworld, MiniMax, OpenAI, Volcengine, Vydra, xAI, or Xiaomi MiMo: +| Provider | Auth | Notes | +| ----------------- | ---------------------------------------------------------------------------------------------------------------- | ----------------------------------------------------------------------- | +| **OpenAI** | `OPENAI_API_KEY` | Also used for auto-summary; supports persona `instructions`. | +| **ElevenLabs** | `ELEVENLABS_API_KEY` or `XI_API_KEY` | Voice cloning, multilingual, deterministic via `seed`. | +| **Google Gemini** | `GEMINI_API_KEY` or `GOOGLE_API_KEY` | Gemini API TTS; persona-aware via `promptTemplate: "audio-profile-v1"`. | +| **Azure Speech** | `AZURE_SPEECH_KEY` + `AZURE_SPEECH_REGION` (also `AZURE_SPEECH_API_KEY`, `SPEECH_KEY`, `SPEECH_REGION`) | Native Ogg/Opus voice-note output and telephony. | +| **Microsoft** | none | Public Edge neural TTS via `node-edge-tts`. Best-effort, no SLA. | +| **MiniMax** | `MINIMAX_API_KEY` (or Token Plan: `MINIMAX_OAUTH_TOKEN`, `MINIMAX_CODE_PLAN_KEY`, `MINIMAX_CODING_API_KEY`) | T2A v2 API. Defaults to `speech-2.8-hd`. | +| **Inworld** | `INWORLD_API_KEY` | Streaming TTS API. Native Opus voice-note and PCM telephony. | +| **xAI** | `XAI_API_KEY` | xAI batch TTS. Native Opus voice-note is **not** supported. | +| **Volcengine** | `VOLCENGINE_TTS_API_KEY` or `BYTEPLUS_SEED_SPEECH_API_KEY` (legacy AppID/token: `VOLCENGINE_TTS_APPID`/`_TOKEN`) | BytePlus Seed Speech HTTP API. | +| **Xiaomi MiMo** | `XIAOMI_API_KEY` | MiMo TTS through Xiaomi chat completions. | +| **OpenRouter** | `OPENROUTER_API_KEY` (can reuse `models.providers.openrouter.apiKey`) | Default model `hexgrad/kokoro-82m`. | +| **Gradium** | `GRADIUM_API_KEY` | Voice-note and telephony output. | +| **Vydra** | `VYDRA_API_KEY` | Shared image, video, and speech provider. | +| **Local CLI** | none | Runs a configured local TTS command. | -- `AZURE_SPEECH_KEY` plus `AZURE_SPEECH_REGION` (also accepts - `AZURE_SPEECH_API_KEY`, `SPEECH_KEY`, and `SPEECH_REGION`) -- `ELEVENLABS_API_KEY` (or `XI_API_KEY`) -- `GEMINI_API_KEY` (or `GOOGLE_API_KEY`) -- `GRADIUM_API_KEY` -- `INWORLD_API_KEY` -- `MINIMAX_API_KEY`; MiniMax TTS also accepts Token Plan auth via - `MINIMAX_OAUTH_TOKEN`, `MINIMAX_CODE_PLAN_KEY`, or - `MINIMAX_CODING_API_KEY` -- `OPENAI_API_KEY` -- `VOLCENGINE_TTS_API_KEY` (or `BYTEPLUS_SEED_SPEECH_API_KEY`); - legacy AppID/token auth also accepts `VOLCENGINE_TTS_APPID` and - `VOLCENGINE_TTS_TOKEN` -- `VYDRA_API_KEY` -- `XAI_API_KEY` -- `XIAOMI_API_KEY` +If multiple providers are configured, the selected one is used first and the +others are fallback options. Auto-summary uses `summaryModel` (or +`agents.defaults.model.primary`), so that provider must also be authenticated +if you keep summaries enabled. -Local CLI and Microsoft speech do **not** require an API key. + +The bundled **Microsoft** provider uses Microsoft Edge's online neural TTS +service via `node-edge-tts`. It is a public web service without a published +SLA or quota — treat it as best-effort. The legacy provider id `edge` is +normalized to `microsoft` and `openclaw doctor --fix` rewrites persisted +config; new configs should always use `microsoft`. + -If multiple providers are configured, the selected provider is used first and the others are fallback options. -Auto-summary uses the configured `summaryModel` (or `agents.defaults.model.primary`), -so that provider must also be authenticated if you enable summaries. +## Configuration -## Service links - -- [OpenAI Text-to-Speech guide](https://platform.openai.com/docs/guides/text-to-speech) -- [OpenAI Audio API reference](https://platform.openai.com/docs/api-reference/audio) -- [Azure Speech REST text-to-speech](https://learn.microsoft.com/azure/ai-services/speech-service/rest-text-to-speech) -- [Azure Speech provider](/providers/azure-speech) -- [ElevenLabs Text to Speech](https://elevenlabs.io/docs/api-reference/text-to-speech) -- [ElevenLabs Authentication](https://elevenlabs.io/docs/api-reference/authentication) -- [Gradium](/providers/gradium) -- [Inworld TTS API](https://docs.inworld.ai/tts/tts) -- [MiniMax T2A v2 API](https://platform.minimaxi.com/document/T2A%20V2) -- [Volcengine TTS HTTP API](/providers/volcengine#text-to-speech) -- [Xiaomi MiMo speech synthesis](/providers/xiaomi#text-to-speech) -- [node-edge-tts](https://github.com/SchneeHertz/node-edge-tts) -- [Microsoft Speech output formats](https://learn.microsoft.com/azure/ai-services/speech-service/rest-text-to-speech#audio-outputs) -- [xAI Text to Speech](https://docs.x.ai/developers/rest-api-reference/inference/voice#text-to-speech-rest) - -## Is it enabled by default? - -No. Auto‑TTS is **off** by default. Enable it in config with -`messages.tts.auto` or locally with `/tts on`. - -When `messages.tts.provider` is unset, OpenClaw picks the first configured -speech provider in registry auto-select order. - -## Config - -TTS config lives under `messages.tts` in `openclaw.json`. -Full schema is in [Gateway configuration](/gateway/configuration). - -### Minimal config (enable + provider) +TTS config lives under `messages.tts` in `~/.openclaw/openclaw.json`. Pick a +preset and adapt the provider block: + + ```json5 { messages: { tts: { auto: "always", - provider: "elevenlabs", + provider: "openai", + summaryModel: "openai/gpt-4.1-mini", + modelOverrides: { enabled: true }, + providers: { + openai: { + apiKey: "${OPENAI_API_KEY}", + model: "gpt-4o-mini-tts", + voice: "alloy", + }, + elevenlabs: { + apiKey: "${ELEVENLABS_API_KEY}", + model: "eleven_multilingual_v2", + voiceId: "EXAVITQu4vr4xnSDxMaL", + voiceSettings: { stability: 0.5, similarityBoost: 0.75, style: 0.0, useSpeakerBoost: true, speed: 1.0 }, + applyTextNormalization: "auto", + languageCode: "en", + }, + }, }, }, } ``` - -### Per-agent voice overrides - -Use `agents.list[].tts` when one agent should speak with a different provider, -voice, model, style, or auto-TTS mode. The agent block deep-merges over -`messages.tts`, so provider credentials can stay in the global provider config. - + + ```json5 { messages: { @@ -125,78 +127,37 @@ voice, model, style, or auto-TTS mode. The agent block deep-merges over elevenlabs: { apiKey: "${ELEVENLABS_API_KEY}", model: "eleven_multilingual_v2", + voiceId: "EXAVITQu4vr4xnSDxMaL", }, }, }, }, - agents: { - list: [ - { - id: "reader", - tts: { - providers: { - elevenlabs: { - voiceId: "EXAVITQu4vr4xnSDxMaL", - }, - }, - }, - }, - ], - }, } ``` - -Precedence for automatic replies, `/tts audio`, `/tts status`, and the `tts` -agent tool is: - -1. `messages.tts` -2. active `agents.list[].tts` -3. local `/tts` preferences for this host -4. inline `[[tts:...]]` directives when model overrides are enabled - -### OpenAI primary with ElevenLabs fallback - + + ```json5 { messages: { tts: { auto: "always", - provider: "openai", - summaryModel: "openai/gpt-4.1-mini", - modelOverrides: { - enabled: true, - }, + provider: "google", providers: { - openai: { - apiKey: "openai_api_key", - baseUrl: "https://api.openai.com/v1", - model: "gpt-4o-mini-tts", - voice: "alloy", - }, - elevenlabs: { - apiKey: "elevenlabs_api_key", - baseUrl: "https://api.elevenlabs.io", - voiceId: "voice_id", - modelId: "eleven_multilingual_v2", - seed: 42, - applyTextNormalization: "auto", - languageCode: "en", - voiceSettings: { - stability: 0.5, - similarityBoost: 0.75, - style: 0.0, - useSpeakerBoost: true, - speed: 1.0, - }, + google: { + apiKey: "${GEMINI_API_KEY}", + model: "gemini-3.1-flash-tts-preview", + voiceName: "Kore", + // Optional natural-language style prompts: + // audioProfile: "Speak in a calm, podcast-host tone.", + // speakerName: "Alex", }, }, }, }, } ``` - -### Azure Speech primary - + + ```json5 { messages: { @@ -205,8 +166,8 @@ agent tool is: provider: "azure-speech", providers: { "azure-speech": { - // apiKey falls back to AZURE_SPEECH_KEY. - // region falls back to AZURE_SPEECH_REGION. + apiKey: "${AZURE_SPEECH_KEY}", + region: "eastus", voice: "en-US-JennyNeural", lang: "en-US", outputFormat: "audio-24khz-48kbitrate-mono-mp3", @@ -217,16 +178,8 @@ agent tool is: }, } ``` - -Azure Speech uses a Speech resource key, not an Azure OpenAI key. Resolution -order is `messages.tts.providers.azure-speech.apiKey` -> -`AZURE_SPEECH_KEY` -> `AZURE_SPEECH_API_KEY` -> `SPEECH_KEY`, plus -`messages.tts.providers.azure-speech.region` -> `AZURE_SPEECH_REGION` -> -`SPEECH_REGION` for the region. New config should use `azure-speech`; `azure` -is accepted as a provider alias. - -### Microsoft primary (no API key) - + + ```json5 { messages: { @@ -239,17 +192,16 @@ is accepted as a provider alias. voice: "en-US-MichelleNeural", lang: "en-US", outputFormat: "audio-24khz-48kbitrate-mono-mp3", - rate: "+10%", - pitch: "-5%", + rate: "+0%", + pitch: "+0%", }, }, }, }, } ``` - -### MiniMax primary - + + ```json5 { messages: { @@ -258,8 +210,7 @@ is accepted as a provider alias. provider: "minimax", providers: { minimax: { - apiKey: "minimax_api_key", - baseUrl: "https://api.minimax.io", + apiKey: "${MINIMAX_API_KEY}", model: "speech-2.8-hd", voiceId: "English_expressive_narrator", speed: 1.0, @@ -271,42 +222,8 @@ is accepted as a provider alias. }, } ``` - -MiniMax TTS auth resolution is `messages.tts.providers.minimax.apiKey`, then -stored `minimax-portal` OAuth/token profiles, then Token Plan environment keys -(`MINIMAX_OAUTH_TOKEN`, `MINIMAX_CODE_PLAN_KEY`, -`MINIMAX_CODING_API_KEY`), then `MINIMAX_API_KEY`. When no explicit TTS -`baseUrl` is set, OpenClaw can reuse the configured `minimax-portal` OAuth -host for Token Plan speech. - -### Google Gemini primary - -```json5 -{ - messages: { - tts: { - auto: "always", - provider: "google", - providers: { - google: { - apiKey: "gemini_api_key", - model: "gemini-3.1-flash-tts-preview", - voiceName: "Kore", - }, - }, - }, - }, -} -``` - -Google Gemini TTS uses the Gemini API key path. A Google Cloud Console API key -restricted to the Gemini API is valid here, and it is the same style of key used -by the bundled Google image-generation provider. Resolution order is -`messages.tts.providers.google.apiKey` -> `models.providers.google.apiKey` -> -`GEMINI_API_KEY` -> `GOOGLE_API_KEY`. - -### Inworld primary - + + ```json5 { messages: { @@ -315,56 +232,18 @@ by the bundled Google image-generation provider. Resolution order is provider: "inworld", providers: { inworld: { - apiKey: "inworld_api_key", - baseUrl: "https://api.inworld.ai", - voiceId: "Sarah", + apiKey: "${INWORLD_API_KEY}", modelId: "inworld-tts-1.5-max", - temperature: 0.8, + voiceId: "Sarah", + temperature: 0.7, }, }, }, }, } ``` - -The `apiKey` value must be the Base64-encoded credential string copied -verbatim from the Inworld dashboard (Workspace > API Keys). The provider -sends it as `Authorization: Basic ` without any additional -encoding, so do not pass a raw bearer token and do not Base64-encode it -yourself. The key falls back to the `INWORLD_API_KEY` env var. See -[Inworld provider](/providers/inworld) for full setup. - -### Volcengine primary - -```json5 -{ - messages: { - tts: { - auto: "always", - provider: "volcengine", - providers: { - volcengine: { - apiKey: "byteplus_seed_speech_api_key", - resourceId: "seed-tts-1.0", - voice: "en_female_anna_mars_bigtts", - speedRatio: 1.0, - }, - }, - }, - }, -} -``` - -Volcengine TTS uses the BytePlus Seed Speech API key from the Speech Console, -not the OpenAI-compatible `VOLCANO_ENGINE_API_KEY` used for Doubao model -providers. Resolution order is `messages.tts.providers.volcengine.apiKey` -> -`VOLCENGINE_TTS_API_KEY` -> `BYTEPLUS_SEED_SPEECH_API_KEY`. Legacy AppID/token -auth still works through `messages.tts.providers.volcengine.appId` / `token` or -`VOLCENGINE_TTS_APPID` / `VOLCENGINE_TTS_TOKEN`. Voice-note targets request -provider-native `ogg_opus`; normal audio-file targets request `mp3`. - -### xAI primary - + + ```json5 { messages: { @@ -373,25 +252,37 @@ provider-native `ogg_opus`; normal audio-file targets request `mp3`. provider: "xai", providers: { xai: { - apiKey: "xai_api_key", + apiKey: "${XAI_API_KEY}", voiceId: "eve", language: "en", responseFormat: "mp3", - speed: 1.0, }, }, }, }, } ``` - -xAI TTS uses the same `XAI_API_KEY` path as the bundled Grok model provider. -Resolution order is `messages.tts.providers.xai.apiKey` -> `XAI_API_KEY`. -Current live voices are `ara`, `eve`, `leo`, `rex`, `sal`, and `una`; `eve` is -the default. `language` accepts a BCP-47 tag or `auto`. - -### Xiaomi MiMo primary - + + +```json5 +{ + messages: { + tts: { + auto: "always", + provider: "volcengine", + providers: { + volcengine: { + apiKey: "${VOLCENGINE_TTS_API_KEY}", + resourceId: "seed-tts-1.0", + voice: "en_female_anna_mars_bigtts", + }, + }, + }, + }, +} +``` + + ```json5 { messages: { @@ -400,26 +291,18 @@ the default. `language` accepts a BCP-47 tag or `auto`. provider: "xiaomi", providers: { xiaomi: { - apiKey: "xiaomi_api_key", - baseUrl: "https://api.xiaomimimo.com/v1", + apiKey: "${XIAOMI_API_KEY}", model: "mimo-v2.5-tts", voice: "mimo_default", format: "mp3", - style: "Bright, natural, conversational tone.", }, }, }, }, } ``` - -Xiaomi MiMo TTS uses the same `XIAOMI_API_KEY` path as the bundled Xiaomi model -provider. The speech provider id is `xiaomi`; `mimo` is accepted as an alias. -The target text is sent as the assistant message, matching Xiaomi's TTS -contract. Optional `style` is sent as a user instruction and is not spoken. - -### OpenRouter primary - + + ```json5 { messages: { @@ -428,7 +311,7 @@ contract. Optional `style` is sent as a user instruction and is not spoken. provider: "openrouter", providers: { openrouter: { - apiKey: "openrouter_api_key", + apiKey: "${OPENROUTER_API_KEY}", model: "hexgrad/kokoro-82m", voice: "af_alloy", responseFormat: "mp3", @@ -438,14 +321,26 @@ contract. Optional `style` is sent as a user instruction and is not spoken. }, } ``` - -OpenRouter TTS uses the same `OPENROUTER_API_KEY` path as the bundled -OpenRouter model provider. Resolution order is -`messages.tts.providers.openrouter.apiKey` -> -`models.providers.openrouter.apiKey` -> `OPENROUTER_API_KEY`. - -### Local CLI primary - + + +```json5 +{ + messages: { + tts: { + auto: "always", + provider: "gradium", + providers: { + gradium: { + apiKey: "${GRADIUM_API_KEY}", + voiceId: "YTpq7expH9539ERJ", + }, + }, + }, + }, +} +``` + + ```json5 { messages: { @@ -464,28 +359,74 @@ OpenRouter model provider. Resolution order is }, } ``` + + -Local CLI TTS runs the configured command on the gateway host. `{{Text}}`, -`{{OutputPath}}`, `{{OutputDir}}`, and `{{OutputBase}}` placeholders are -expanded in `args`; if no `{{Text}}` placeholder is present, OpenClaw writes the -spoken text to stdin. `outputFormat` accepts `mp3`, `opus`, or `wav`. -Voice-note targets are transcoded to Ogg/Opus and telephony output is -transcoded to raw 16 kHz mono PCM with `ffmpeg`. The legacy provider alias -`cli` still works, but new config should use `tts-local-cli`. +### Per-agent voice overrides -### Gradium primary +Use `agents.list[].tts` when one agent should speak with a different provider, +voice, model, persona, or auto-TTS mode. The agent block deep-merges over +`messages.tts`, so provider credentials can stay in the global provider config: ```json5 { messages: { tts: { auto: "always", - provider: "gradium", + provider: "elevenlabs", providers: { - gradium: { - apiKey: "gradium_api_key", - baseUrl: "https://api.gradium.ai", - voiceId: "YTpq7expH9539ERJ", + elevenlabs: { apiKey: "${ELEVENLABS_API_KEY}", model: "eleven_multilingual_v2" }, + }, + }, + }, + agents: { + list: [ + { + id: "reader", + tts: { + providers: { + elevenlabs: { voiceId: "EXAVITQu4vr4xnSDxMaL" }, + }, + }, + }, + ], + }, +} +``` + +To pin a per-agent persona, set `agents.list[].tts.persona` alongside provider +config — it overrides the global `messages.tts.persona` for that agent only. + +Precedence order for automatic replies, `/tts audio`, `/tts status`, and the +`tts` agent tool: + +1. `messages.tts` +2. active `agents.list[].tts` +3. local `/tts` preferences for this host +4. inline `[[tts:...]]` directives when [model overrides](#model-driven-directives) are enabled + +## Personas + +A **persona** is a stable spoken identity that can be applied deterministically +across providers. It can prefer one provider, define provider-neutral prompt +intent, and carry provider-specific bindings for voices, models, prompt +templates, seeds, and voice settings. + +### Minimal persona + +```json5 +{ + messages: { + tts: { + auto: "always", + persona: "narrator", + personas: { + narrator: { + label: "Narrator", + provider: "elevenlabs", + providers: { + elevenlabs: { voiceId: "EXAVITQu4vr4xnSDxMaL", modelId: "eleven_multilingual_v2" }, + }, }, }, }, @@ -493,12 +434,7 @@ transcoded to raw 16 kHz mono PCM with `ffmpeg`. The legacy provider alias } ``` -### TTS personas - -Use `messages.tts.personas` when you want a stable spoken identity that can be -applied deterministically across providers. A persona can prefer one provider, -define provider-neutral prompt intent, and carry provider-specific bindings for -voices, models, prompt templates, seeds, and voice settings. +### Full persona (provider-neutral prompt) ```json5 { @@ -527,10 +463,7 @@ voices, models, prompt templates, seeds, and voice settings. voiceName: "Algieba", promptTemplate: "audio-profile-v1", }, - openai: { - model: "gpt-4o-mini-tts", - voice: "cedar", - }, + openai: { model: "gpt-4o-mini-tts", voice: "cedar" }, elevenlabs: { voiceId: "voice_id", modelId: "eleven_multilingual_v2", @@ -551,376 +484,184 @@ voices, models, prompt templates, seeds, and voice settings. } ``` -Resolution is deterministic: +### Persona resolution + +The active persona is selected deterministically: 1. `/tts persona ` local preference, if set. 2. `messages.tts.persona`, if set. 3. No persona. -Provider selection is explicit-first: +Provider selection runs explicit-first: -1. Direct provider overrides from CLI, gateway, Talk, or allowed TTS directives. +1. Direct overrides (CLI, gateway, Talk, allowed TTS directives). 2. `/tts provider ` local preference. -3. Active persona `provider`. +3. Active persona's `provider`. 4. `messages.tts.provider`. 5. Registry auto-select. -For each provider attempt, OpenClaw merges: +For each provider attempt, OpenClaw merges configs in this order: 1. `messages.tts.providers.` 2. `messages.tts.personas..providers.` -3. trusted request overrides -4. allowed model-emitted TTS directive overrides +3. Trusted request overrides +4. Allowed model-emitted TTS directive overrides -`fallbackPolicy` controls what happens when an active persona has no binding for -an attempted provider: +### How providers use persona prompts -- `preserve-persona` keeps provider-neutral persona prompt fields available to - providers. This is the default. -- `provider-defaults` omits the persona from provider prompt preparation for - that attempt, so the provider uses its neutral defaults while still allowing - fallback to continue. -- `fail` skips that provider attempt with `reasonCode: "not_configured"` and - `personaBinding: "missing"`. Fallback providers are still tried; the whole TTS - request fails only if every attempted provider is skipped or fails. +Persona prompt fields (`profile`, `scene`, `sampleContext`, `style`, `accent`, +`pacing`, `constraints`) are **provider-neutral**. Each provider decides how +to use them: -Persona prompt fields are provider-neutral. Providers decide how to use them. -Google wraps them only when the effective Google provider config sets -`promptTemplate: "audio-profile-v1"` or `personaPrompt`; its older -`audioProfile` and `speakerName` fields are still prepended as Google-specific -prompt text. OpenAI maps prompt fields to `instructions` when no explicit -OpenAI `instructions` value is configured. Providers without prompt-like -controls use the provider-specific persona bindings only. + + + Wraps persona prompt fields in a Gemini TTS prompt structure **only when** + the effective Google provider config sets `promptTemplate: "audio-profile-v1"` + or `personaPrompt`. The older `audioProfile` and `speakerName` fields are + still prepended as Google-specific prompt text. Inline audio tags such as + `[whispers]` or `[laughs]` inside a `[[tts:text]]` block are preserved + inside the Gemini transcript; OpenClaw does not generate these tags. + + + Maps persona prompt fields to the request `instructions` field **only when** + no explicit OpenAI `instructions` is configured. Explicit `instructions` + always wins. + + + Use only the provider-specific persona bindings under + `personas..providers.`. Persona prompt fields are ignored + unless the provider implements its own persona-prompt mapping. + + -Gemini inline audio tags are transcript content, not persona config. If the -assistant or an explicit `[[tts:text]]` block includes tags such as `[whispers]` -or `[laughs]`, OpenClaw preserves them inside the Gemini transcript. OpenClaw -does not generate configured start tags. +### Fallback policy -### Disable Microsoft speech +`fallbackPolicy` controls behavior when a persona has **no binding** for the +attempted provider: -```json5 -{ - messages: { - tts: { - providers: { - microsoft: { - enabled: false, - }, - }, - }, - }, -} -``` +| Policy | Behavior | +| ------------------- | ------------------------------------------------------------------------------------------------------------------------------------------------ | +| `preserve-persona` | **Default.** Provider-neutral prompt fields stay available; the provider may use them or ignore them. | +| `provider-defaults` | Persona is omitted from prompt preparation for that attempt; the provider uses its neutral defaults while fallback to other providers continues. | +| `fail` | Skip that provider attempt with `reasonCode: "not_configured"` and `personaBinding: "missing"`. Fallback providers are still tried. | -### Custom limits + prefs path +The whole TTS request only fails when **every** attempted provider is skipped +or fails. -```json5 -{ - messages: { - tts: { - auto: "always", - maxTextLength: 4000, - timeoutMs: 30000, - prefsPath: "~/.openclaw/settings/tts.json", - }, - }, -} -``` +## Model-driven directives -### Only reply with audio after an inbound voice message +By default, the assistant **can** emit `[[tts:...]]` directives to override +voice, model, or speed for a single reply, plus an optional +`[[tts:text]]...[[/tts:text]]` block for expressive cues that should appear in +audio only: -```json5 -{ - messages: { - tts: { - auto: "inbound", - }, - }, -} -``` - -### Disable auto-summary for long replies - -```json5 -{ - messages: { - tts: { - auto: "always", - }, - }, -} -``` - -Then run: - -``` -/tts summary off -``` - -### Notes on fields - -- `auto`: auto‑TTS mode (`off`, `always`, `inbound`, `tagged`). - - `inbound` only sends audio after an inbound voice message. - - `tagged` only sends audio when the reply includes `[[tts:key=value]]` directives or a `[[tts:text]]...[[/tts:text]]` block. -- `enabled`: legacy toggle (doctor migrates this to `auto`). -- `mode`: `"final"` (default) or `"all"` (includes tool/block replies). -- `provider`: speech provider id such as `"elevenlabs"`, `"google"`, `"gradium"`, `"inworld"`, `"microsoft"`, `"minimax"`, `"openai"`, `"volcengine"`, `"vydra"`, `"xai"`, or `"xiaomi"` (fallback is automatic). -- If `provider` is **unset**, OpenClaw uses the first configured speech provider in registry auto-select order. -- Legacy `provider: "edge"` config is repaired by `openclaw doctor --fix` and - rewritten to `provider: "microsoft"`. -- `persona`: default TTS persona id from `personas`. -- `personas.`: stable spoken identity. The id is normalized to lowercase. -- `personas..provider`: preferred speech provider for the persona. Explicit provider overrides and local provider prefs still win. -- `personas..fallbackPolicy`: `preserve-persona` (default), `provider-defaults`, or `fail`; see [TTS personas](#tts-personas). -- `personas..prompt`: provider-neutral persona prompt fields (`profile`, `scene`, `sampleContext`, `style`, `accent`, `pacing`, `constraints`). -- `personas..providers.`: provider-specific persona binding merged over `providers.`. -- `summaryModel`: optional cheap model for auto-summary; defaults to `agents.defaults.model.primary`. - - Accepts `provider/model` or a configured model alias. -- `modelOverrides`: allow the model to emit TTS directives (on by default). - - `allowProvider` defaults to `false` (provider switching is opt-in). -- `providers.`: provider-owned settings keyed by speech provider id. -- Legacy direct provider blocks (`messages.tts.openai`, `messages.tts.elevenlabs`, `messages.tts.microsoft`, `messages.tts.edge`) are repaired by `openclaw doctor --fix`; committed config should use `messages.tts.providers.`. -- Legacy `messages.tts.providers.edge` is also repaired by `openclaw doctor --fix`; committed config should use `messages.tts.providers.microsoft`. -- `maxTextLength`: hard cap for TTS input (chars). `/tts audio` fails if exceeded. -- `timeoutMs`: request timeout (ms). -- `prefsPath`: override the local prefs JSON path (provider/limit/summary). -- `apiKey` values fall back to env vars (`AZURE_SPEECH_KEY`/`AZURE_SPEECH_API_KEY`/`SPEECH_KEY`, `ELEVENLABS_API_KEY`/`XI_API_KEY`, `GEMINI_API_KEY`/`GOOGLE_API_KEY`, `GRADIUM_API_KEY`, `INWORLD_API_KEY`, `MINIMAX_API_KEY`, `OPENAI_API_KEY`, `VYDRA_API_KEY`, `XAI_API_KEY`, `XIAOMI_API_KEY`). Volcengine uses `appId`/`token` instead. -- `providers.azure-speech.apiKey`: Azure Speech resource key (env: - `AZURE_SPEECH_KEY`, `AZURE_SPEECH_API_KEY`, or `SPEECH_KEY`). -- `providers.azure-speech.region`: Azure Speech region such as `eastus` (env: - `AZURE_SPEECH_REGION` or `SPEECH_REGION`). -- `providers.azure-speech.endpoint` / `providers.azure-speech.baseUrl`: optional - Azure Speech endpoint/base URL override. -- `providers.azure-speech.voice`: Azure voice ShortName (default - `en-US-JennyNeural`). -- `providers.azure-speech.lang`: SSML language code (default `en-US`). -- `providers.azure-speech.outputFormat`: Azure `X-Microsoft-OutputFormat` for - standard audio output (default `audio-24khz-48kbitrate-mono-mp3`). -- `providers.azure-speech.voiceNoteOutputFormat`: Azure - `X-Microsoft-OutputFormat` for voice-note output (default - `ogg-24khz-16bit-mono-opus`). -- `providers.elevenlabs.baseUrl`: override ElevenLabs API base URL. -- `providers.openai.baseUrl`: override the OpenAI TTS endpoint. - - Resolution order: `messages.tts.providers.openai.baseUrl` -> `OPENAI_TTS_BASE_URL` -> `https://api.openai.com/v1` - - Non-default values are treated as OpenAI-compatible TTS endpoints, so custom model and voice names are accepted. -- `providers.elevenlabs.voiceSettings`: - - `stability`, `similarityBoost`, `style`: `0..1` - - `useSpeakerBoost`: `true|false` - - `speed`: `0.5..2.0` (1.0 = normal) -- `providers.elevenlabs.applyTextNormalization`: `auto|on|off` -- `providers.elevenlabs.languageCode`: 2-letter ISO 639-1 (e.g. `en`, `de`) -- `providers.elevenlabs.seed`: integer `0..4294967295` (best-effort determinism) -- `providers.minimax.baseUrl`: override MiniMax API base URL (default `https://api.minimax.io`, env: `MINIMAX_API_HOST`). -- `providers.minimax.model`: TTS model (default `speech-2.8-hd`, env: `MINIMAX_TTS_MODEL`). -- `providers.minimax.voiceId`: voice identifier (default `English_expressive_narrator`, env: `MINIMAX_TTS_VOICE_ID`). -- `providers.minimax.speed`: playback speed `0.5..2.0` (default 1.0). -- `providers.minimax.vol`: volume `(0, 10]` (default 1.0; must be greater than 0). -- `providers.minimax.pitch`: integer pitch shift `-12..12` (default 0). Fractional values are truncated before calling MiniMax T2A because the API rejects non-integer pitch values. -- `providers.tts-local-cli.command`: local executable or command string for CLI TTS. -- `providers.tts-local-cli.args`: command arguments; supports `{{Text}}`, `{{OutputPath}}`, `{{OutputDir}}`, and `{{OutputBase}}` placeholders. -- `providers.tts-local-cli.outputFormat`: expected CLI output format (`mp3`, `opus`, or `wav`; default `mp3` for audio attachments). -- `providers.tts-local-cli.timeoutMs`: command timeout in milliseconds (default `120000`). -- `providers.tts-local-cli.cwd`: optional command working directory. -- `providers.tts-local-cli.env`: optional string environment overrides for the command. -- `providers.inworld.baseUrl`: override Inworld API base URL (default `https://api.inworld.ai`). -- `providers.inworld.voiceId`: Inworld voice identifier (default `Sarah`). -- `providers.inworld.modelId`: Inworld TTS model (default `inworld-tts-1.5-max`; also supports `inworld-tts-1.5-mini`, `inworld-tts-1-max`, `inworld-tts-1`). -- `providers.inworld.temperature`: sampling temperature `0..2` (optional). -- `providers.google.model`: Gemini TTS model (default `gemini-3.1-flash-tts-preview`). -- `providers.google.voiceName`: Gemini prebuilt voice name (default `Kore`; `voice` is also accepted). -- `providers.google.audioProfile`: natural-language style prompt prepended before the spoken text. -- `providers.google.speakerName`: optional speaker label prepended before the spoken text when your TTS prompt uses a named speaker. -- `providers.google.promptTemplate`: set to `audio-profile-v1` to wrap active persona prompt fields in a deterministic Gemini TTS prompt structure. -- `providers.google.personaPrompt`: Google-specific extra persona prompt text appended to the template's Director's Notes. -- `providers.google.baseUrl`: override the Gemini API base URL. Only `https://generativelanguage.googleapis.com` is accepted. - - If `messages.tts.providers.google.apiKey` is omitted, TTS can reuse `models.providers.google.apiKey` before env fallback. -- `providers.gradium.baseUrl`: override Gradium API base URL (default `https://api.gradium.ai`). -- `providers.gradium.voiceId`: Gradium voice identifier (default Emma, `YTpq7expH9539ERJ`). -- `providers.volcengine.apiKey`: BytePlus Seed Speech API key (env: - `VOLCENGINE_TTS_API_KEY` or `BYTEPLUS_SEED_SPEECH_API_KEY`). -- `providers.volcengine.resourceId`: BytePlus Seed Speech resource id (default - `seed-tts-1.0`, env: `VOLCENGINE_TTS_RESOURCE_ID`; use `seed-tts-2.0` when - your BytePlus project has TTS 2.0 entitlement). -- `providers.volcengine.appKey`: BytePlus Seed Speech app key header (default - `aGjiRDfUWi`, env: `VOLCENGINE_TTS_APP_KEY`). -- `providers.volcengine.baseUrl`: override the Seed Speech TTS HTTP endpoint - (env: `VOLCENGINE_TTS_BASE_URL`). -- `providers.volcengine.appId`: legacy Volcengine Speech Console application id (env: `VOLCENGINE_TTS_APPID`). -- `providers.volcengine.token`: legacy Volcengine Speech Console access token (env: `VOLCENGINE_TTS_TOKEN`). -- `providers.volcengine.cluster`: legacy Volcengine TTS cluster (default `volcano_tts`, env: `VOLCENGINE_TTS_CLUSTER`). -- `providers.volcengine.voice`: voice type (default `en_female_anna_mars_bigtts`, env: `VOLCENGINE_TTS_VOICE`). -- `providers.volcengine.speedRatio`: provider-native speed ratio. -- `providers.volcengine.emotion`: provider-native emotion tag. -- `providers.xai.apiKey`: xAI TTS API key (env: `XAI_API_KEY`). -- `providers.xai.baseUrl`: override the xAI TTS base URL (default `https://api.x.ai/v1`, env: `XAI_BASE_URL`). -- `providers.xai.voiceId`: xAI voice id (default `eve`; current live voices: `ara`, `eve`, `leo`, `rex`, `sal`, `una`). -- `providers.xai.language`: BCP-47 language code or `auto` (default `en`). -- `providers.xai.responseFormat`: `mp3`, `wav`, `pcm`, `mulaw`, or `alaw` (default `mp3`). -- `providers.xai.speed`: provider-native speed override. -- `providers.xiaomi.apiKey`: Xiaomi MiMo API key (env: `XIAOMI_API_KEY`). -- `providers.xiaomi.baseUrl`: override the Xiaomi MiMo API base URL (default `https://api.xiaomimimo.com/v1`, env: `XIAOMI_BASE_URL`). -- `providers.xiaomi.model`: TTS model (default `mimo-v2.5-tts`, env: `XIAOMI_TTS_MODEL`; `mimo-v2-tts` is also supported). -- `providers.xiaomi.voice`: MiMo voice id (default `mimo_default`, env: `XIAOMI_TTS_VOICE`). -- `providers.xiaomi.format`: `mp3` or `wav` (default `mp3`, env: `XIAOMI_TTS_FORMAT`). -- `providers.xiaomi.style`: optional natural-language style instruction sent as the user message; it is not spoken. -- `providers.openrouter.apiKey`: OpenRouter API key (env: `OPENROUTER_API_KEY`; can reuse `models.providers.openrouter.apiKey`). -- `providers.openrouter.baseUrl`: override the OpenRouter TTS base URL (default `https://openrouter.ai/api/v1`; legacy `https://openrouter.ai/v1` is normalized). -- `providers.openrouter.model`: OpenRouter TTS model id (default `hexgrad/kokoro-82m`; `modelId` is also accepted). -- `providers.openrouter.voice`: provider-specific voice id (default `af_alloy`; `voiceId` is also accepted). -- `providers.openrouter.responseFormat`: `mp3` or `pcm` (default `mp3`). -- `providers.openrouter.speed`: provider-native speed override. -- `providers.microsoft.enabled`: allow Microsoft speech usage (default `true`; no API key). -- `providers.microsoft.voice`: Microsoft neural voice name (e.g. `en-US-MichelleNeural`). -- `providers.microsoft.lang`: language code (e.g. `en-US`). -- `providers.microsoft.outputFormat`: Microsoft output format (e.g. `audio-24khz-48kbitrate-mono-mp3`). - - See Microsoft Speech output formats for valid values; not all formats are supported by the bundled Edge-backed transport. -- `providers.microsoft.rate` / `providers.microsoft.pitch` / `providers.microsoft.volume`: percent strings (e.g. `+10%`, `-5%`). -- `providers.microsoft.saveSubtitles`: write JSON subtitles alongside the audio file. -- `providers.microsoft.proxy`: proxy URL for Microsoft speech requests. -- `providers.microsoft.timeoutMs`: request timeout override (ms). -- `edge.*`: legacy alias for the same Microsoft settings. Run - `openclaw doctor --fix` to rewrite persisted config to `providers.microsoft`. - -## Model-driven overrides (default on) - -By default, the model **can** emit TTS directives for a single reply. -When `messages.tts.auto` is `tagged`, these directives are required to trigger audio. - -When enabled, the model can emit `[[tts:...]]` directives to override the voice -for a single reply, plus an optional `[[tts:text]]...[[/tts:text]]` block to -provide expressive tags (laughter, singing cues, etc) that should only appear in -the audio. - -Streaming block delivery strips these directives from visible text before the -channel sees them, even when a directive is split across adjacent blocks. Final -mode still parses the accumulated raw reply for TTS synthesis. - -`provider=...` directives are ignored unless `modelOverrides.allowProvider: true`. -When a reply declares `provider=...`, the other keys in that directive are -parsed only by that provider. Unsupported keys are stripped from visible text -and reported as TTS directive warnings instead of being routed to another -provider. - -Example reply payload: - -``` +```text Here you go. [[tts:voiceId=pMsXgVXv3BLzUgSXRplE model=eleven_v3 speed=1.1]] [[tts:text]](laughs) Read the song once more.[[/tts:text]] ``` -Available directive keys (when enabled): +When `messages.tts.auto` is `"tagged"`, **directives are required** to trigger +audio. Streaming block delivery strips directives from visible text before the +channel sees them, even when split across adjacent blocks. -- `provider` (registered speech provider id, for example `openai`, `elevenlabs`, `google`, `gradium`, `minimax`, `microsoft`, `volcengine`, `vydra`, `xai`, or `xiaomi`; requires `allowProvider: true`) -- `voice` (OpenAI, Gradium, Volcengine, or Xiaomi voice), `voiceName` / `voice_name` / `google_voice` (Google voice), or `voiceId` (ElevenLabs / Gradium / MiniMax / xAI) -- `model` (OpenAI TTS model, ElevenLabs model id, MiniMax model, or Xiaomi MiMo TTS model) or `google_model` (Google TTS model) +`provider=...` is ignored unless `modelOverrides.allowProvider: true`. When a +reply declares `provider=...`, the other keys in that directive are parsed +only by that provider; unsupported keys are stripped and reported as TTS +directive warnings. + +**Available directive keys:** + +- `provider` (registered provider id; requires `allowProvider: true`) +- `voice` / `voiceName` / `voice_name` / `google_voice` / `voiceId` +- `model` / `google_model` - `stability`, `similarityBoost`, `style`, `speed`, `useSpeakerBoost` -- `vol` / `volume` (MiniMax volume, 0-10) -- `pitch` (MiniMax integer pitch, -12 to 12; fractional values are truncated before the MiniMax request) +- `vol` / `volume` (MiniMax volume, 0–10) +- `pitch` (MiniMax integer pitch, −12 to 12; fractional values are truncated) - `emotion` (Volcengine emotion tag) - `applyTextNormalization` (`auto|on|off`) - `languageCode` (ISO 639-1) - `seed` -Disable all model overrides: +**Disable model overrides entirely:** ```json5 -{ - messages: { - tts: { - modelOverrides: { - enabled: false, - }, - }, - }, -} +{ messages: { tts: { modelOverrides: { enabled: false } } } } ``` -Optional allowlist (enable provider switching while keeping other knobs configurable): +**Allow provider switching while keeping other knobs configurable:** ```json5 -{ - messages: { - tts: { - modelOverrides: { - enabled: true, - allowProvider: true, - allowSeed: false, - }, - }, - }, -} +{ messages: { tts: { modelOverrides: { enabled: true, allowProvider: true, allowSeed: false } } } } ``` +## Slash commands + +Single command `/tts`. On Discord, OpenClaw also registers `/voice` because +`/tts` is a built-in Discord command — text `/tts ...` still works. + +```text +/tts off | on | status +/tts chat on | off | default +/tts latest +/tts provider +/tts persona | off +/tts limit +/tts summary off +/tts audio +``` + + +Commands require an authorized sender (allowlist/owner rules apply) and either +`commands.text` or native command registration must be enabled. + + +Behavior notes: + +- `/tts on` writes the local TTS preference to `always`; `/tts off` writes it to `off`. +- `/tts chat on|off|default` writes a session-scoped auto-TTS override for the current chat. +- `/tts persona ` writes the local persona preference; `/tts persona off` clears it. +- `/tts latest` reads the latest assistant reply from the current session transcript and sends it as audio once. It stores only a hash of that reply on the session entry to suppress duplicate voice sends. +- `/tts audio` generates a one-off audio reply (does **not** toggle TTS on). +- `limit` and `summary` are stored in **local prefs**, not the main config. +- `/tts status` includes fallback diagnostics for the latest attempt — `Fallback: -> `, `Attempts: ...`, and per-attempt detail (`provider:outcome(reasonCode) latency`). +- `/status` shows the active TTS mode plus configured provider, model, voice, and sanitized custom endpoint metadata when TTS is enabled. + ## Per-user preferences -Slash commands write local overrides to `prefsPath` (default: -`~/.openclaw/settings/tts.json`, override with `OPENCLAW_TTS_PREFS` or -`messages.tts.prefsPath`). +Slash commands write local overrides to `prefsPath`. The default is +`~/.openclaw/settings/tts.json`; override with the `OPENCLAW_TTS_PREFS` env var +or `messages.tts.prefsPath`. -Stored fields: - -- `auto` -- `provider` -- `persona` -- `maxLength` (summary threshold; default 1500 chars) -- `summarize` (default `true`) +| Stored field | Effect | +| ------------ | -------------------------------------------- | +| `auto` | Local auto-TTS override (`always`, `off`, …) | +| `provider` | Local primary provider override | +| `persona` | Local persona override | +| `maxLength` | Summary threshold (default `1500` chars) | +| `summarize` | Summary toggle (default `true`) | These override the effective config from `messages.tts` plus the active `agents.list[].tts` block for that host. -## Output formats (fixed) - -- **Feishu / Matrix / Telegram / WhatsApp**: voice-note replies prefer Opus (`opus_48000_64` from ElevenLabs, `opus` from OpenAI). - - 48kHz / 64kbps is a good voice message tradeoff. -- **Feishu / WhatsApp**: when a voice-note reply is produced as MP3/WebM/WAV/M4A - or another likely audio file, the channel plugin transcodes it to 48kHz - Ogg/Opus with `ffmpeg` before sending the native voice message. WhatsApp sends - the result through the Baileys `audio` payload with `ptt: true` and - `audio/ogg; codecs=opus`. If conversion fails, Feishu receives the original - file as an attachment; WhatsApp send fails rather than posting an incompatible - PTT payload. -- **Other channels**: MP3 (`mp3_44100_128` from ElevenLabs, `mp3` from OpenAI). - - 44.1kHz / 128kbps is the default balance for speech clarity. -- **MiniMax**: MP3 (`speech-2.8-hd` model, 32kHz sample rate) for normal audio attachments. For voice-note targets such as Feishu, Telegram, and WhatsApp, OpenClaw transcodes the MiniMax MP3 to 48kHz Opus with `ffmpeg` before delivery. -- **Xiaomi MiMo**: MP3 by default, or WAV when configured. For voice-note targets such as Feishu, Telegram, and WhatsApp, OpenClaw transcodes Xiaomi output to 48kHz Opus with `ffmpeg` before delivery. -- **Local CLI**: uses the configured `outputFormat`. Voice-note targets are - converted to Ogg/Opus and telephony output is converted to raw 16 kHz mono PCM - with `ffmpeg`. -- **Google Gemini**: Gemini API TTS returns raw 24kHz PCM. OpenClaw wraps it as WAV for audio attachments, transcodes it to 48kHz Opus for voice-note targets, and returns PCM directly for Talk/telephony. -- **Gradium**: WAV for audio attachments, Opus for voice-note targets, and `ulaw_8000` at 8 kHz for telephony. -- **Inworld**: MP3 for normal audio attachments, native `OGG_OPUS` for voice-note targets, and raw `PCM` at 22050 Hz for Talk/telephony. -- **xAI**: MP3 by default; `responseFormat` may be `mp3`, `wav`, `pcm`, `mulaw`, or `alaw`. OpenClaw uses xAI's batch REST TTS endpoint and returns a complete audio attachment; xAI's streaming TTS WebSocket is not used by this provider path. Native Opus voice-note format is not supported by this path. -- **Microsoft**: uses `microsoft.outputFormat` (default `audio-24khz-48kbitrate-mono-mp3`). - - The bundled transport accepts an `outputFormat`, but not all formats are available from the service. - - Output format values follow Microsoft Speech output formats (including Ogg/WebM Opus). - - Telegram `sendVoice` accepts OGG/MP3/M4A; use OpenAI/ElevenLabs if you need - guaranteed Opus voice messages. - - If the configured Microsoft output format fails, OpenClaw retries with MP3. - -OpenAI/ElevenLabs output formats are fixed per channel (see above). - ## Auto-TTS behavior -When enabled, OpenClaw: +When `messages.tts.auto` is enabled, OpenClaw: -- skips TTS if the reply already contains media or a `MEDIA:` directive. -- skips very short replies (< 10 chars). -- summarizes long replies when enabled using `agents.defaults.model.primary` (or `summaryModel`). -- attaches the generated audio to the reply. -- in `mode: "final"`, still sends audio-only TTS for streamed final replies +- Skips TTS if the reply already contains media or a `MEDIA:` directive. +- Skips very short replies (under 10 chars). +- Summarizes long replies when summaries are enabled, using + `summaryModel` (or `agents.defaults.model.primary`). +- Attaches the generated audio to the reply. +- In `mode: "final"`, still sends audio-only TTS for streamed final replies after the text stream completes; the generated media goes through the same channel media normalization as normal reply attachments. If the reply exceeds `maxLength` and summary is off (or no API key for the -summary model), audio -is skipped and the normal text reply is sent. +summary model), audio is skipped and the normal text reply is sent. -## Flow diagram - -``` +```text Reply -> TTS enabled? no -> send text yes -> has media / MEDIA: / short? @@ -929,80 +670,247 @@ Reply -> TTS enabled? no -> TTS -> attach audio yes -> summary enabled? no -> send text - yes -> summarize (summaryModel or agents.defaults.model.primary) - -> TTS -> attach audio + yes -> summarize -> TTS -> attach audio ``` -## Slash command usage +## Output formats by channel -There is a single command: `/tts`. -See [Slash commands](/tools/slash-commands) for enablement details. +| Target | Format | +| ------------------------------------- | ------------------------------------------------------------------------------------------------------------------------------------- | +| Feishu / Matrix / Telegram / WhatsApp | Voice-note replies prefer **Opus** (`opus_48000_64` from ElevenLabs, `opus` from OpenAI). 48 kHz / 64 kbps balances clarity and size. | +| Other channels | **MP3** (`mp3_44100_128` from ElevenLabs, `mp3` from OpenAI). 44.1 kHz / 128 kbps default for speech. | +| Talk / telephony | Provider-native **PCM** (Inworld 22050 Hz, Google 24 kHz), or `ulaw_8000` from Gradium for telephony. | -Discord note: `/tts` is a built-in Discord command, so OpenClaw registers -`/voice` as the native command there. Text `/tts ...` still works. +Per-provider notes: -``` -/tts off -/tts on -/tts status -/tts chat on -/tts chat off -/tts chat default -/tts latest -/tts provider openai -/tts persona alfred -/tts limit 2000 -/tts summary off -/tts audio Hello from OpenClaw -``` +- **Feishu / WhatsApp transcoding:** When a voice-note reply lands as MP3/WebM/WAV/M4A, the channel plugin transcodes to 48 kHz Ogg/Opus with `ffmpeg`. WhatsApp sends through Baileys with `ptt: true` and `audio/ogg; codecs=opus`. If conversion fails: Feishu falls back to attaching the original file; WhatsApp send fails rather than posting an incompatible PTT payload. +- **MiniMax / Xiaomi MiMo:** Default MP3 (32 kHz for MiniMax `speech-2.8-hd`); transcoded to 48 kHz Opus for voice-note targets via `ffmpeg`. +- **Local CLI:** Uses configured `outputFormat`. Voice-note targets are converted to Ogg/Opus and telephony output to raw 16 kHz mono PCM. +- **Google Gemini:** Returns raw 24 kHz PCM. OpenClaw wraps as WAV for attachments, transcodes to 48 kHz Opus for voice-note targets, returns PCM directly for Talk/telephony. +- **Inworld:** MP3 attachments, native `OGG_OPUS` voice-note, raw `PCM` 22050 Hz for Talk/telephony. +- **xAI:** MP3 by default; `responseFormat` may be `mp3|wav|pcm|mulaw|alaw`. Uses xAI's batch REST endpoint — streaming WebSocket TTS is **not** used. Native Opus voice-note format is **not** supported. +- **Microsoft:** Uses `microsoft.outputFormat` (default `audio-24khz-48kbitrate-mono-mp3`). Telegram `sendVoice` accepts OGG/MP3/M4A; use OpenAI/ElevenLabs if you need guaranteed Opus voice messages. If the configured Microsoft format fails, OpenClaw retries with MP3. -Notes: +OpenAI and ElevenLabs output formats are fixed per channel as listed above. -- Commands require an authorized sender (allowlist/owner rules still apply). -- `commands.text` or native command registration must be enabled. -- Config `messages.tts.auto` accepts `off|always|inbound|tagged`. -- `/tts on` writes the local TTS preference to `always`; `/tts off` writes it to `off`. -- `/tts chat on|off|default` writes a session-scoped auto-TTS override for the current chat. -- Use config when you want `inbound` or `tagged` defaults. -- `/tts persona ` writes the local persona preference; `/tts persona off` clears it. -- `limit` and `summary` are stored in local prefs, not the main config. -- `/tts audio` generates a one-off audio reply (does not toggle TTS on). -- `/tts latest` reads the latest assistant reply from the current session transcript and sends it as audio once. It stores only a hash of that reply on the session entry to suppress duplicate voice sends. -- `/tts status` includes fallback visibility for the latest attempt: - - success fallback: `Fallback: -> ` plus `Attempts: ...` - - failure: `Error: ...` plus `Attempts: ...` - - detailed diagnostics: `Attempt details: provider:outcome(reasonCode) latency` -- `/status` shows the active TTS mode plus configured provider, model, voice, - and sanitized custom endpoint metadata when TTS is enabled. -- OpenAI and ElevenLabs API failures now include parsed provider error detail and request id (when returned by the provider), which is surfaced in TTS errors/logs. +## Field reference + + + + + Auto-TTS mode. `inbound` only sends audio after an inbound voice message; `tagged` only sends audio when the reply includes `[[tts:...]]` directives or a `[[tts:text]]` block. + + + Legacy toggle. `openclaw doctor --fix` migrates this to `auto`. + + + `"all"` includes tool/block replies in addition to final replies. + + + Speech provider id. When unset, OpenClaw uses the first configured provider in registry auto-select order. Legacy `provider: "edge"` is rewritten to `"microsoft"` by `openclaw doctor --fix`. + + + Active persona id from `personas`. Normalized to lowercase. + + + Stable spoken identity. Fields: `label`, `description`, `provider`, `fallbackPolicy`, `prompt`, `providers.`. See [Personas](#personas). + + + Cheap model for auto-summary; defaults to `agents.defaults.model.primary`. Accepts `provider/model` or a configured model alias. + + + Allow the model to emit TTS directives. `enabled` defaults to `true`; `allowProvider` defaults to `false`. + + + Provider-owned settings keyed by speech provider id. Legacy direct blocks (`messages.tts.openai`, `.elevenlabs`, `.microsoft`, `.edge`) are rewritten by `openclaw doctor --fix`; commit only `messages.tts.providers.`. + + + Hard cap for TTS input characters. `/tts audio` fails if exceeded. + + + Request timeout in milliseconds. + + + Override the local prefs JSON path (provider/limit/summary). Default `~/.openclaw/settings/tts.json`. + + + + + Falls back to `OPENAI_API_KEY`. + OpenAI TTS model id (e.g. `gpt-4o-mini-tts`). + Voice name (e.g. `alloy`, `cedar`). + Explicit OpenAI `instructions` field. When set, persona prompt fields are **not** auto-mapped. + + Override the OpenAI TTS endpoint. Resolution order: config → `OPENAI_TTS_BASE_URL` → `https://api.openai.com/v1`. Non-default values are treated as OpenAI-compatible TTS endpoints, so custom model and voice names are accepted. + + + + + Falls back to `ELEVENLABS_API_KEY` or `XI_API_KEY`. + Model id (e.g. `eleven_multilingual_v2`, `eleven_v3`). + ElevenLabs voice id. + + `stability`, `similarityBoost`, `style` (each `0..1`), `useSpeakerBoost` (`true|false`), `speed` (`0.5..2.0`, `1.0` = normal). + + Text normalization mode. + 2-letter ISO 639-1 (e.g. `en`, `de`). + Integer `0..4294967295` for best-effort determinism. + Override ElevenLabs API base URL. + + + + Falls back to `GEMINI_API_KEY` / `GOOGLE_API_KEY`. If omitted, TTS can reuse `models.providers.google.apiKey` before env fallback. + Gemini TTS model. Default `gemini-3.1-flash-tts-preview`. + Gemini prebuilt voice name. Default `Kore`. Alias: `voice`. + Natural-language style prompt prepended before spoken text. + Optional speaker label prepended before spoken text when your prompt uses a named speaker. + Set to `audio-profile-v1` to wrap active persona prompt fields in a deterministic Gemini TTS prompt structure. + Google-specific extra persona prompt text appended to the template's Director's Notes. + Only `https://generativelanguage.googleapis.com` is accepted. + + + + Env: `AZURE_SPEECH_KEY`, `AZURE_SPEECH_API_KEY`, or `SPEECH_KEY`. + Azure Speech region (e.g. `eastus`). Env: `AZURE_SPEECH_REGION` or `SPEECH_REGION`. + Optional Azure Speech endpoint override (alias `baseUrl`). + Azure voice ShortName. Default `en-US-JennyNeural`. + SSML language code. Default `en-US`. + Azure `X-Microsoft-OutputFormat` for standard audio. Default `audio-24khz-48kbitrate-mono-mp3`. + Azure `X-Microsoft-OutputFormat` for voice-note output. Default `ogg-24khz-16bit-mono-opus`. + + + + Allow Microsoft speech usage. + Microsoft neural voice name (e.g. `en-US-MichelleNeural`). + Language code (e.g. `en-US`). + Microsoft output format. Default `audio-24khz-48kbitrate-mono-mp3`. Not all formats are supported by the bundled Edge-backed transport. + Percent strings (e.g. `+10%`, `-5%`). + Write JSON subtitles alongside the audio file. + Proxy URL for Microsoft speech requests. + Request timeout override (ms). + Legacy alias. Run `openclaw doctor --fix` to rewrite persisted config to `providers.microsoft`. + + + + Falls back to `MINIMAX_API_KEY`. Token Plan auth via `MINIMAX_OAUTH_TOKEN`, `MINIMAX_CODE_PLAN_KEY`, or `MINIMAX_CODING_API_KEY`. + Default `https://api.minimax.io`. Env: `MINIMAX_API_HOST`. + Default `speech-2.8-hd`. Env: `MINIMAX_TTS_MODEL`. + Default `English_expressive_narrator`. Env: `MINIMAX_TTS_VOICE_ID`. + `0.5..2.0`. Default `1.0`. + `(0, 10]`. Default `1.0`. + Integer `-12..12`. Default `0`. Fractional values are truncated before the request. + + + + Env: `INWORLD_API_KEY`. + Default `https://api.inworld.ai`. + Default `inworld-tts-1.5-max`. Also: `inworld-tts-1.5-mini`, `inworld-tts-1-max`, `inworld-tts-1`. + Default `Sarah`. + Sampling temperature `0..2`. + + + + Env: `XAI_API_KEY`. + Default `https://api.x.ai/v1`. Env: `XAI_BASE_URL`. + Default `eve`. Live voices: `ara`, `eve`, `leo`, `rex`, `sal`, `una`. + BCP-47 language code or `auto`. Default `en`. + Default `mp3`. + Provider-native speed override. + + + + Env: `VOLCENGINE_TTS_API_KEY` or `BYTEPLUS_SEED_SPEECH_API_KEY`. + Default `seed-tts-1.0`. Env: `VOLCENGINE_TTS_RESOURCE_ID`. Use `seed-tts-2.0` when your project has TTS 2.0 entitlement. + App key header. Default `aGjiRDfUWi`. Env: `VOLCENGINE_TTS_APP_KEY`. + Override the Seed Speech TTS HTTP endpoint. Env: `VOLCENGINE_TTS_BASE_URL`. + Voice type. Default `en_female_anna_mars_bigtts`. Env: `VOLCENGINE_TTS_VOICE`. + Provider-native speed ratio. + Provider-native emotion tag. + Legacy Volcengine Speech Console fields. Env: `VOLCENGINE_TTS_APPID`, `VOLCENGINE_TTS_TOKEN`, `VOLCENGINE_TTS_CLUSTER` (default `volcano_tts`). + + + + Env: `XIAOMI_API_KEY`. + Default `https://api.xiaomimimo.com/v1`. Env: `XIAOMI_BASE_URL`. + Default `mimo-v2.5-tts`. Env: `XIAOMI_TTS_MODEL`. Also supports `mimo-v2-tts`. + Default `mimo_default`. Env: `XIAOMI_TTS_VOICE`. + Default `mp3`. Env: `XIAOMI_TTS_FORMAT`. + Optional natural-language style instruction sent as the user message; not spoken. + + + + Env: `OPENROUTER_API_KEY`. Can reuse `models.providers.openrouter.apiKey`. + Default `https://openrouter.ai/api/v1`. Legacy `https://openrouter.ai/v1` is normalized. + Default `hexgrad/kokoro-82m`. Alias: `modelId`. + Default `af_alloy`. Alias: `voiceId`. + Default `mp3`. + Provider-native speed override. + + + + Env: `GRADIUM_API_KEY`. + Default `https://api.gradium.ai`. + Default Emma (`YTpq7expH9539ERJ`). + + + + Local executable or command string for CLI TTS. + Command arguments. Supports `{{Text}}`, `{{OutputPath}}`, `{{OutputDir}}`, `{{OutputBase}}` placeholders. + Expected CLI output format. Default `mp3` for audio attachments. + Command timeout in milliseconds. Default `120000`. + Optional command working directory. + Optional environment overrides for the command. + + ## Agent tool The `tts` tool converts text to speech and returns an audio attachment for -reply delivery. When the channel is Feishu, Matrix, Telegram, or WhatsApp, -the audio is delivered as a voice message rather than a file attachment. -Feishu and WhatsApp can transcode non-Opus TTS output on this path when -`ffmpeg` is available. +reply delivery. On Feishu, Matrix, Telegram, and WhatsApp, the audio is +delivered as a voice message rather than a file attachment. Feishu and +WhatsApp can transcode non-Opus TTS output on this path when `ffmpeg` is +available. + WhatsApp sends audio through Baileys as a PTT voice note (`audio` with -`ptt: true`), and sends visible text separately from PTT audio because clients -do not consistently render captions on voice notes. -It accepts optional `channel` and `timeoutMs` fields; `timeoutMs` is a +`ptt: true`) and sends visible text **separately** from PTT audio because +clients do not consistently render captions on voice notes. + +The tool accepts optional `channel` and `timeoutMs` fields; `timeoutMs` is a per-call provider request timeout in milliseconds. ## Gateway RPC -Gateway methods: +| Method | Purpose | +| ----------------- | ---------------------------------------- | +| `tts.status` | Read current TTS state and last attempt. | +| `tts.enable` | Set local auto preference to `always`. | +| `tts.disable` | Set local auto preference to `off`. | +| `tts.convert` | One-off text → audio. | +| `tts.setProvider` | Set local provider preference. | +| `tts.setPersona` | Set local persona preference. | +| `tts.providers` | List configured providers and status. | -- `tts.status` -- `tts.enable` -- `tts.disable` -- `tts.convert` -- `tts.setProvider` -- `tts.setPersona` -- `tts.providers` +## Service links + +- [OpenAI text-to-speech guide](https://platform.openai.com/docs/guides/text-to-speech) +- [OpenAI Audio API reference](https://platform.openai.com/docs/api-reference/audio) +- [Azure Speech REST text-to-speech](https://learn.microsoft.com/azure/ai-services/speech-service/rest-text-to-speech) +- [Azure Speech provider](/providers/azure-speech) +- [ElevenLabs Text to Speech](https://elevenlabs.io/docs/api-reference/text-to-speech) +- [ElevenLabs Authentication](https://elevenlabs.io/docs/api-reference/authentication) +- [Gradium](/providers/gradium) +- [Inworld TTS API](https://docs.inworld.ai/tts/tts) +- [MiniMax T2A v2 API](https://platform.minimaxi.com/document/T2A%20V2) +- [Volcengine TTS HTTP API](/providers/volcengine#text-to-speech) +- [Xiaomi MiMo speech synthesis](/providers/xiaomi#text-to-speech) +- [node-edge-tts](https://github.com/SchneeHertz/node-edge-tts) +- [Microsoft Speech output formats](https://learn.microsoft.com/azure/ai-services/speech-service/rest-text-to-speech#audio-outputs) +- [xAI text to speech](https://docs.x.ai/developers/rest-api-reference/inference/voice#text-to-speech-rest) ## Related - [Media overview](/tools/media-overview) - [Music generation](/tools/music-generation) - [Video generation](/tools/video-generation) +- [Slash commands](/tools/slash-commands) +- [Voice call plugin](/plugins/voice-call)