openclaw/docs/tools/tts.md at 3de5979bdc8e4e9e9d3fee446eaab53cad2ff605

mirror of https://github.com/openclaw/openclaw.git synced 2026-05-24 03:49:50 +00:00

Files

Peter Steinberger f91de52f0d refactor: move runtime state to SQLite

* refactor: remove stale file-backed shims

* fix: harden sqlite state ci boundaries

* refactor: store matrix idb snapshots in sqlite

* fix: satisfy rebased CI guardrails

* refactor: store current conversation bindings in sqlite table

* refactor: store tui last sessions in sqlite table

* refactor: reset sqlite schema history

* refactor: drop unshipped sqlite table migration

* refactor: remove plugin index file rollback

* refactor: drop unshipped sqlite sidecar migrations

* refactor: remove runtime commitments kv migration

* refactor: preserve kysely sync result types

* refactor: drop unshipped sqlite schema migration table

* test: keep session usage coverage sqlite-backed

* refactor: keep sqlite migration doctor-only

* refactor: isolate device legacy imports

* refactor: isolate push voicewake legacy imports

* refactor: isolate remaining runtime legacy imports

* refactor: tighten sqlite migration guardrails

* test: cover sqlite persisted enum parsing

* refactor: isolate legacy update and tui imports

* refactor: tighten sqlite state ownership

* refactor: move legacy imports behind doctor

* refactor: remove legacy session row lookup

* refactor: canonicalize memory transcript locators

* refactor: drop transcript path scope fallbacks

* refactor: drop runtime legacy session delivery pruning

* refactor: store tts prefs only in sqlite

* refactor: remove cron store path runtime

* refactor: use cron sqlite store keys

* refactor: rename telegram message cache scope

* refactor: read memory dreaming status from sqlite

* refactor: rename cron status store key

* refactor: stop remembering transcript file paths

* test: use sqlite locators in agent fixtures

* refactor: remove file-shaped commitments and cron store surfaces

* refactor: keep compaction transcript handles out of session rows

* refactor: derive transcript handles from session identity

* refactor: derive runtime transcript handles

* refactor: remove gateway session locator reads

* refactor: remove transcript locator from session rows

* refactor: store raw stream diagnostics in sqlite

* refactor: remove file-shaped transcript rotation

* refactor: hide legacy trajectory paths from runtime

* refactor: remove runtime transcript file bridges

* refactor: repair database-first rebase fallout

* refactor: align tests with database-first state

* refactor: remove transcript file handoffs

* refactor: sync post-compaction memory by transcript scope

* refactor: run codex app-server sessions by id

* refactor: bind codex runtime state by session id

* refactor: pass memory transcripts by sqlite scope

* refactor: remove transcript locator cleanup leftovers

* test: remove stale transcript file fixtures

* refactor: remove transcript locator test helper

* test: make cron sqlite keys explicit

* test: remove cron runtime store paths

* test: remove stale session file fixtures

* test: use sqlite cron keys in diagnostics

* refactor: remove runtime delivery queue backfill

* test: drop fake export session file mocks

* refactor: rename acp session read failure flag

* refactor: rename acp row session key

* refactor: remove session store test seams

* refactor: move legacy session parser tests to doctor

* refactor: reindex managed memory in place

* refactor: drop stale session store wording

* refactor: rename session row helpers

* refactor: rename sqlite session entry modules

* refactor: remove transcript locator leftovers

* refactor: trim file-era audit wording

* refactor: clean managed media through sqlite

* fix: prefer explicit agent for exports

* fix: use prepared agent for session resets

* fix: canonicalize legacy codex binding import

* test: rename state cleanup helper

* docs: align backup docs with sqlite state

* refactor: drop legacy Pi usage auth fallback

* refactor: move legacy auth profile imports to doctor

* refactor: keep Pi model discovery auth in memory

* refactor: remove MSTeams legacy learning key fallback

* refactor: store model catalog config in sqlite

* refactor: use sqlite model catalog at runtime

* refactor: remove model json compatibility aliases

* refactor: store auth profiles in sqlite

* refactor: seed copied auth profiles in sqlite

* refactor: make auth profile runtime sqlite-addressed

* refactor: migrate hermes secrets into sqlite auth store

* refactor: move plugin install config migration to doctor

* refactor: rename plugin index audit checks

* test: drop auth file assumptions

* test: remove legacy transcript file assertions

* refactor: drop legacy cli session aliases

* refactor: store skill uploads in sqlite

* refactor: keep subagent attachments in sqlite vfs

* refactor: drop subagent attachment cleanup state

* refactor: move legacy session aliases to doctor

* refactor: require node 24 for sqlite state runtime

* refactor: move provider caches into sqlite state

* fix: harden virtual agent filesystem

* refactor: enforce database-first runtime state

* refactor: rename compaction transcript rotation setting

* test: clean sqlite refactor test types

* refactor: consolidate sqlite runtime state

* refactor: model session conversations in sqlite

* refactor: stop deriving cron delivery from session keys

* refactor: stop classifying sessions from key shape

* refactor: hydrate announce targets from typed delivery

* refactor: route heartbeat delivery from typed sqlite context

* refactor: tighten typed sqlite session routing

* refactor: remove session origin routing shadow

* refactor: drop session origin shadow fixtures

* perf: query sqlite vfs paths by prefix

* refactor: use typed conversation metadata for sessions

* refactor: prefer typed session routing metadata

* refactor: require typed session routing metadata

* refactor: resolve group tool policy from typed sessions

* refactor: delete dead session thread info bridge

* Show Codex subscription reset times in channel errors (#80456)

* feat(plugin-sdk): consolidate session workflow APIs

* fix(agents): allow read-only agent mount reads

* [codex] refresh plugin regression fixtures

* fix(agents): restore compaction gateway logs

* test: tighten gateway startup assertions

* Redact persisted secret-shaped payloads [AI] (#79006)

* test: tighten device pair notify assertions

* test: tighten hermes secret assertions

* test: assert matrix client error shapes

* test: assert config compat warnings

* fix(heartbeat): remap cron-run exec events to session keys (#80214)

* fix(codex): route btw through native side threads

* fix(auth): accept friendly OpenAI order for Codex profiles

* fix(codex): rotate auth profiles inside harness

* fix: keep browser status page probe within timeout

* test: assert agents add outputs

* test: pin cron read status

* fix(agents): avoid Pi resource discovery stalls

Co-authored-by: dataCenter430 <titan032000@gmail.com>

* fix: retire timed-out codex app-server clients

* test: tighten qa lab runtime assertions

* test: check security fix outputs

* test: verify extension runtime messages

* feat(wake): expose typed sessionKey on wake protocol + system event CLI

* fix(gateway): await session_end during shutdown drain and track channel + compaction lifecycle paths (#57790)

* test: guard talk consult call helper

* fix(codex): scale context engine projection (#80761)

* fix(codex): scale context engine projection

* fix: document Codex context projection scaling

* fix: document Codex context projection scaling

* fix: document Codex context projection scaling

* fix: document Codex context projection scaling

* chore: align Codex projection changelog

* chore: realign Codex projection changelog

* fix: isolate Codex projection patch

---------

Co-authored-by: Eva (agent) <eva+agent-78055@100yen.org>
Co-authored-by: Josh Lehman <josh@martian.engineering>

* refactor: move agent runtime state toward piless

* refactor: remove cron session reaper

* refactor: move session management to sqlite

* refactor: finish database-first state migration

* chore: refresh generated sqlite db types

* refactor: remove stale file-backed shims

* test: harden kysely type coverage

# Conflicts:
#	.agents/skills/kysely-database-access/SKILL.md
#	src/infra/kysely-sync.types.test.ts
#	src/proxy-capture/store.sqlite.test.ts
#	src/state/openclaw-agent-db.test.ts
#	src/state/openclaw-state-db.test.ts

* refactor: remove cron store path runtime

* refactor: keep compaction transcript handles out of session rows

* refactor: derive embedded transcripts from sqlite identity

* refactor: remove embedded transcript locator handoff

* refactor: remove runtime transcript file bridges

* refactor: remove transcript file handoffs

* refactor: remove MSTeams legacy learning key fallback

* refactor: store model catalog config in sqlite

* refactor: use sqlite model catalog at runtime

# Conflicts:
#	docs/cli/secrets.md
#	docs/gateway/authentication.md
#	docs/gateway/secrets.md

* fix: keep oauth sibling sync sqlite-local

# Conflicts:
#	src/commands/onboard-auth.test.ts

* refactor: remove task session store maintenance

# Conflicts:
#	src/commands/tasks.ts

* refactor: keep diagnostics in state sqlite

* refactor: enforce database-first runtime state

* refactor: consolidate sqlite runtime state

* Show Codex subscription reset times in channel errors (#80456)

* fix(codex): refresh subscription limit resets

* fix(codex): format reset times for channels

* Update CHANGELOG with latest changes and fixes

Updated CHANGELOG with recent fixes and improvements.

* fix(codex): keep command load failures on codex surface

* fix(codex): format account rate limits as rows

* fix(codex): summarize account limits as usage status

* fix(codex): simplify account limit status

* test: tighten subagent announce queue assertion

* test: tighten session delete lifecycle assertions

* test: tighten cron ops assertions

* fix: track cron execution milestones

* test: tighten hermes secret assertions

* test: assert matrix sync store payloads

* test: assert config compat warnings

* fix(codex): align btw side thread semantics

* fix(codex): honor codex fallback blocking

* fix(agents): avoid Pi resource discovery stalls

* test: tighten codex event assertions

* test: tighten cron assertions

* Fix Codex app-server OAuth harness auth

* refactor: move agent runtime state toward piless

* refactor: move device and push state to sqlite

* refactor: move runtime json state imports to doctor

* refactor: finish database-first state migration

* chore: refresh generated sqlite db types

* refactor: clarify cron sqlite store keys

* refactor: remove stale file-backed shims

* refactor: bind codex runtime state by session id

* test: expect sqlite trajectory branch export

* refactor: rename session row helpers

* fix: keep legacy device identity import in doctor

* refactor: enforce database-first runtime state

* refactor: consolidate sqlite runtime state

* build: align pi contract wrappers

* chore: repair database-first rebase

* refactor: remove session file test contracts

* test: update gateway session expectations

* refactor: stop routing from session compatibility shadows

* refactor: stop persisting session route shadows

* refactor: use typed delivery context in clients

* refactor: stop echoing session route shadows

* refactor: repair embedded runner rebase imports

# Conflicts:
#	src/agents/pi-embedded-runner/run/attempt.tool-call-argument-repair.ts

* refactor: align pi contract imports

* refactor: satisfy kysely sync helper guard

* refactor: remove file transcript bridge remnants

* refactor: remove session locator compatibility

* refactor: remove session file test contracts

* refactor: keep rebase database-first clean

* refactor: remove session file assumptions from e2e

* docs: clarify database-first goal state

* test: remove legacy store markers from sqlite runtime tests

* refactor: remove legacy store assumptions from runtime seams

* refactor: align sqlite runtime helper seams

* test: update memory recall sqlite audit mock

* refactor: align database-first runtime type seams

* test: clarify doctor cron legacy store names

* fix: preserve sqlite session route projections

* test: fix copilot token cache test syntax

* docs: update database-first proof status

* test: align database-first test fixtures

* docs: update database-first proof status

* refactor: clean extension database-first drift

* test: align agent session route proof

* test: clarify doctor legacy path fixtures

* chore: clean database-first changed checks

* chore: repair database-first rebase markers

* build: allow baileys git subdependency

* chore: repair exp-vfs rebase drift

* chore: finish exp-vfs rebase cleanup

* chore: satisfy rebase lint drift

* chore: fix qqbot rebase type seam

* chore: fix rebase drift leftovers

* fix: keep auth profile oauth secrets out of sqlite

* fix: repair rebase drift tests

* test: stabilize pairing request ordering

* test: use source manifests in plugin contract checks

* fix: restore gateway session metadata after rebase

* fix: repair database-first rebase drift

* fix: clean up database-first rebase fallout

* test: stabilize line quick reply receipt time

* fix: repair extension rebase drift

* test: keep transcript redaction tests sqlite-backed

* fix: carry injected transcript redaction through sqlite

* chore: clean database branch rebase residue

* fix: repair database branch CI drift

* fix: repair database branch CI guard drift

* fix: stabilize oauth tls preflight test

* test: align database branch fast guards

* test: repair build artifact boundary guards

* chore: clean changelog rebase markers

---------

Co-authored-by: pashpashpash <nik@vault77.ai>
Co-authored-by: Eva <eva@100yen.org>
Co-authored-by: stainlu <stainlu@newtype-ai.org>
Co-authored-by: Jason Zhou <jason.zhou.design@gmail.com>
Co-authored-by: Ruben Cuevas <hi@rubencu.com>
Co-authored-by: Pavan Kumar Gondhi <pavangondhi@gmail.com>
Co-authored-by: Shakker <shakkerdroid@gmail.com>
Co-authored-by: Kaspre <36520309+Kaspre@users.noreply.github.com>
Co-authored-by: dataCenter430 <titan032000@gmail.com>
Co-authored-by: Kaspre <kaspre@gmail.com>
Co-authored-by: pandadev66 <nova.full.stack@outlook.com>
Co-authored-by: Eva <admin@100yen.org>
Co-authored-by: Eva (agent) <eva+agent-78055@100yen.org>
Co-authored-by: Josh Lehman <josh@martian.engineering>
Co-authored-by: jeffjhunter <support@aipersonamethod.com>

2026-05-13 13:15:12 +01:00

45 KiB

Raw Blame History

summary, read_when, title, sidebarTitle

summary

read_when

title

sidebarTitle

Text-to-speech for outbound replies — providers, personas, slash commands, and per-channel output

Enabling text-to-speech for replies

Configuring a TTS provider, fallback chain, or persona

Using /tts commands or directives

Text-to-speech

Text to speech (TTS)

OpenClaw can convert outbound replies into audio across 14 speech providers and deliver native voice messages on Feishu, Matrix, Telegram, and WhatsApp, audio attachments everywhere else, and PCM/Ulaw streams for telephony and Talk.

TTS is the speech-output half of Talk's stt-tts mode. Provider-native realtime Talk sessions synthesize speech inside the realtime provider instead of calling this TTS path, while transcription sessions do not synthesize an assistant voice response.

Quick start

OpenAI and ElevenLabs are the most reliable hosted options. Microsoft and Local CLI work without an API key. See the [provider matrix](#supported-providers) for the full list. Export the env var for your provider (for example `OPENAI_API_KEY`, `ELEVENLABS_API_KEY`). Microsoft and Local CLI need no key. Set `messages.tts.auto: "always"` and `messages.tts.provider`:

```json5
{
  messages: {
    tts: {
      auto: "always",
      provider: "elevenlabs",
    },
  },
}
```

`/tts status` shows the current state. `/tts audio Hello from OpenClaw` sends a one-off audio reply. Auto-TTS is **off** by default. When `messages.tts.provider` is unset, OpenClaw picks the first configured provider in registry auto-select order. The built-in `tts` agent tool is explicit-intent only: ordinary chat stays text unless the user asks for audio, uses `/tts`, or enables Auto-TTS/directive speech.

Supported providers

Provider	Auth	Notes
Azure Speech	`AZURE_SPEECH_KEY` + `AZURE_SPEECH_REGION` (also `AZURE_SPEECH_API_KEY`, `SPEECH_KEY`, `SPEECH_REGION`)	Native Ogg/Opus voice-note output and telephony.
DeepInfra	`DEEPINFRA_API_KEY`	OpenAI-compatible TTS. Defaults to `hexgrad/Kokoro-82M`.
ElevenLabs	`ELEVENLABS_API_KEY` or `XI_API_KEY`	Voice cloning, multilingual, deterministic via `seed`; streamed for Discord voice playback.
Google Gemini	`GEMINI_API_KEY` or `GOOGLE_API_KEY`	Gemini API batch TTS; persona-aware via `promptTemplate: "audio-profile-v1"`.
Gradium	`GRADIUM_API_KEY`	Voice-note and telephony output.
Inworld	`INWORLD_API_KEY`	Streaming TTS API. Native Opus voice-note and PCM telephony.
Local CLI	none	Runs a configured local TTS command.
Microsoft	none	Public Edge neural TTS via `node-edge-tts`. Best-effort, no SLA.
MiniMax	`MINIMAX_API_KEY` (or Token Plan: `MINIMAX_OAUTH_TOKEN`, `MINIMAX_CODE_PLAN_KEY`, `MINIMAX_CODING_API_KEY`)	T2A v2 API. Defaults to `speech-2.8-hd`.
OpenAI	`OPENAI_API_KEY`	Also used for auto-summary; supports persona `instructions`.
OpenRouter	`OPENROUTER_API_KEY` (can reuse `models.providers.openrouter.apiKey`)	Default model `hexgrad/kokoro-82m`.
Volcengine	`VOLCENGINE_TTS_API_KEY` or `BYTEPLUS_SEED_SPEECH_API_KEY` (legacy AppID/token: `VOLCENGINE_TTS_APPID`/`_TOKEN`)	BytePlus Seed Speech HTTP API.
Vydra	`VYDRA_API_KEY`	Shared image, video, and speech provider.
xAI	`XAI_API_KEY`	xAI batch TTS. Native Opus voice-note is not supported.
Xiaomi MiMo	`XIAOMI_API_KEY`	MiMo TTS through Xiaomi chat completions.

If multiple providers are configured, the selected one is used first and the others are fallback options. Auto-summary uses summaryModel (or agents.defaults.model.primary), so that provider must also be authenticated if you keep summaries enabled.

The bundled **Microsoft** provider uses Microsoft Edge's online neural TTS service via `node-edge-tts`. It is a public web service without a published SLA or quota — treat it as best-effort. The legacy provider id `edge` is normalized to `microsoft` and `openclaw doctor --fix` rewrites persisted config; new configs should always use `microsoft`.

Configuration

TTS config lives under messages.tts in ~/.openclaw/openclaw.json. Pick a preset and adapt the provider block:

```json5 { messages: { tts: { auto: "always", provider: "azure-speech", providers: { "azure-speech": { apiKey: "${AZURE_SPEECH_KEY}", region: "eastus", voice: "en-US-JennyNeural", lang: "en-US", outputFormat: "audio-24khz-48kbitrate-mono-mp3", voiceNoteOutputFormat: "ogg-24khz-16bit-mono-opus", }, }, }, }, } ``` ```json5 { messages: { tts: { auto: "always", provider: "elevenlabs", providers: { elevenlabs: { apiKey: "${ELEVENLABS_API_KEY}", model: "eleven_multilingual_v2", voiceId: "EXAVITQu4vr4xnSDxMaL", }, }, }, }, } ``` ```json5 { messages: { tts: { auto: "always", provider: "google", providers: { google: { apiKey: "${GEMINI_API_KEY}", model: "gemini-3.1-flash-tts-preview", voiceName: "Kore", // Optional natural-language style prompts: // audioProfile: "Speak in a calm, podcast-host tone.", // speakerName: "Alex", }, }, }, }, } ``` ```json5 { messages: { tts: { auto: "always", provider: "gradium", providers: { gradium: { apiKey: "${GRADIUM_API_KEY}", voiceId: "YTpq7expH9539ERJ", }, }, }, }, } ``` ```json5 { messages: { tts: { auto: "always", provider: "inworld", providers: { inworld: { apiKey: "${INWORLD_API_KEY}", modelId: "inworld-tts-1.5-max", voiceId: "Sarah", temperature: 0.7, }, }, }, }, } ``` ```json5 { messages: { tts: { auto: "always", provider: "tts-local-cli", providers: { "tts-local-cli": { command: "say", args: ["-o", "{{OutputPath}}", "{{Text}}"], outputFormat: "wav", timeoutMs: 120000, }, }, }, }, } ``` ```json5 { messages: { tts: { auto: "always", provider: "microsoft", providers: { microsoft: { enabled: true, voice: "en-US-MichelleNeural", lang: "en-US", outputFormat: "audio-24khz-48kbitrate-mono-mp3", rate: "+0%", pitch: "+0%", }, }, }, }, } ``` ```json5 { messages: { tts: { auto: "always", provider: "minimax", providers: { minimax: { apiKey: "${MINIMAX_API_KEY}", model: "speech-2.8-hd", voiceId: "English_expressive_narrator", speed: 1.0, vol: 1.0, pitch: 0, }, }, }, }, } ``` ```json5 { messages: { tts: { auto: "always", provider: "openai", summaryModel: "openai/gpt-4.1-mini", modelOverrides: { enabled: true }, providers: { openai: { apiKey: "${OPENAI_API_KEY}", model: "gpt-4o-mini-tts", voice: "alloy", }, elevenlabs: { apiKey: "${ELEVENLABS_API_KEY}", model: "eleven_multilingual_v2", voiceId: "EXAVITQu4vr4xnSDxMaL", voiceSettings: { stability: 0.5, similarityBoost: 0.75, style: 0.0, useSpeakerBoost: true, speed: 1.0 }, applyTextNormalization: "auto", languageCode: "en", }, }, }, }, } ``` ```json5 { messages: { tts: { auto: "always", provider: "openrouter", providers: { openrouter: { apiKey: "${OPENROUTER_API_KEY}", model: "hexgrad/kokoro-82m", voice: "af_alloy", responseFormat: "mp3", }, }, }, }, } ``` ```json5 { messages: { tts: { auto: "always", provider: "volcengine", providers: { volcengine: { apiKey: "${VOLCENGINE_TTS_API_KEY}", resourceId: "seed-tts-1.0", voice: "en_female_anna_mars_bigtts", }, }, }, }, } ``` ```json5 { messages: { tts: { auto: "always", provider: "xai", providers: { xai: { apiKey: "${XAI_API_KEY}", voiceId: "eve", language: "en", responseFormat: "mp3", }, }, }, }, } ``` ```json5 { messages: { tts: { auto: "always", provider: "xiaomi", providers: { xiaomi: { apiKey: "${XIAOMI_API_KEY}", model: "mimo-v2.5-tts", voice: "mimo_default", format: "mp3", }, }, }, }, } ```

Per-agent voice overrides

Use agents.list[].tts when one agent should speak with a different provider, voice, model, persona, or auto-TTS mode. The agent block deep-merges over messages.tts, so provider credentials can stay in the global provider config:

{
  messages: {
    tts: {
      auto: "always",
      provider: "elevenlabs",
      providers: {
        elevenlabs: { apiKey: "${ELEVENLABS_API_KEY}", model: "eleven_multilingual_v2" },
      },
    },
  },
  agents: {
    list: [
      {
        id: "reader",
        tts: {
          providers: {
            elevenlabs: { voiceId: "EXAVITQu4vr4xnSDxMaL" },
          },
        },
      },
    ],
  },
}

To pin a per-agent persona, set agents.list[].tts.persona alongside provider config — it overrides the global messages.tts.persona for that agent only.

Precedence order for automatic replies, /tts audio, /tts status, and the tts agent tool:

messages.tts
active agents.list[].tts
channel override, when the channel supports channels.<channel>.tts
account override, when the channel passes channels.<channel>.accounts.<id>.tts
local /tts preferences for this host
inline [[tts:...]] directives when model overrides are enabled

Channel and account overrides use the same shape as messages.tts and deep-merge over the earlier layers, so shared provider credentials can stay in messages.tts while a channel or bot account changes only voice, model, persona, or auto mode:

{
  messages: {
    tts: {
      provider: "openai",
      providers: {
        openai: { apiKey: "${OPENAI_API_KEY}", model: "gpt-4o-mini-tts" },
      },
    },
  },
  channels: {
    feishu: {
      accounts: {
        english: {
          tts: {
            providers: {
              openai: { voice: "shimmer" },
            },
          },
        },
      },
    },
  },
}

Personas

A persona is a stable spoken identity that can be applied deterministically across providers. It can prefer one provider, define provider-neutral prompt intent, and carry provider-specific bindings for voices, models, prompt templates, seeds, and voice settings.

Minimal persona

{
  messages: {
    tts: {
      auto: "always",
      persona: "narrator",
      personas: {
        narrator: {
          label: "Narrator",
          provider: "elevenlabs",
          providers: {
            elevenlabs: { voiceId: "EXAVITQu4vr4xnSDxMaL", modelId: "eleven_multilingual_v2" },
          },
        },
      },
    },
  },
}

Full persona (provider-neutral prompt)

{
  messages: {
    tts: {
      auto: "always",
      persona: "alfred",
      personas: {
        alfred: {
          label: "Alfred",
          description: "Dry, warm British butler narrator.",
          provider: "google",
          fallbackPolicy: "preserve-persona",
          prompt: {
            profile: "A brilliant British butler. Dry, witty, warm, charming, emotionally expressive, never generic.",
            scene: "A quiet late-night study. Close-mic narration for a trusted operator.",
            sampleContext: "The speaker is answering a private technical request with concise confidence and dry warmth.",
            style: "Refined, understated, lightly amused.",
            accent: "British English.",
            pacing: "Measured, with short dramatic pauses.",
            constraints: ["Do not read configuration values aloud.", "Do not explain the persona."],
          },
          providers: {
            google: {
              model: "gemini-3.1-flash-tts-preview",
              voiceName: "Algieba",
              promptTemplate: "audio-profile-v1",
            },
            openai: { model: "gpt-4o-mini-tts", voice: "cedar" },
            elevenlabs: {
              voiceId: "voice_id",
              modelId: "eleven_multilingual_v2",
              seed: 42,
              voiceSettings: {
                stability: 0.65,
                similarityBoost: 0.8,
                style: 0.25,
                useSpeakerBoost: true,
                speed: 0.95,
              },
            },
          },
        },
      },
    },
  },
}

Persona resolution

The active persona is selected deterministically:

/tts persona <id> local preference, if set.
messages.tts.persona, if set.
No persona.

Provider selection runs explicit-first:

Direct overrides (CLI, gateway, Talk, allowed TTS directives).
/tts provider <id> local preference.
Active persona's provider.
messages.tts.provider.
Registry auto-select.

For each provider attempt, OpenClaw merges configs in this order:

messages.tts.providers.<id>
messages.tts.personas.<persona>.providers.<id>
Trusted request overrides
Allowed model-emitted TTS directive overrides

How providers use persona prompts

Persona prompt fields (profile, scene, sampleContext, style, accent, pacing, constraints) are provider-neutral. Each provider decides how to use them:

Wraps persona prompt fields in a Gemini TTS prompt structure **only when** the effective Google provider config sets `promptTemplate: "audio-profile-v1"` or `personaPrompt`. The older `audioProfile` and `speakerName` fields are still prepended as Google-specific prompt text. Inline audio tags such as `[whispers]` or `[laughs]` inside a `tts:text` block are preserved inside the Gemini transcript; OpenClaw does not generate these tags. Maps persona prompt fields to the request `instructions` field **only when** no explicit OpenAI `instructions` is configured. Explicit `instructions` always wins. Use only the provider-specific persona bindings under `personas..providers.`. Persona prompt fields are ignored unless the provider implements its own persona-prompt mapping.

Fallback policy

fallbackPolicy controls behavior when a persona has no binding for the attempted provider:

Policy	Behavior
`preserve-persona`	Default. Provider-neutral prompt fields stay available; the provider may use them or ignore them.
`provider-defaults`	Persona is omitted from prompt preparation for that attempt; the provider uses its neutral defaults while fallback to other providers continues.
`fail`	Skip that provider attempt with `reasonCode: "not_configured"` and `personaBinding: "missing"`. Fallback providers are still tried.

The whole TTS request only fails when every attempted provider is skipped or fails.

Talk session provider selection is session-scoped. A Talk client should choose provider ids, model ids, voice ids, and locales from talk.catalog and pass them through the Talk session or handoff request. Opening a voice session should not mutate messages.tts or global Talk provider defaults.

Model-driven directives

By default, the assistant can emit [[tts:...]] directives to override voice, model, or speed for a single reply, plus an optional [[tts:text]]...[[/tts:text]] block for expressive cues that should appear in audio only:

Here you go.

[[tts:voiceId=pMsXgVXv3BLzUgSXRplE model=eleven_v3 speed=1.1]]
[[tts:text]](laughs) Read the song once more.[[/tts:text]]

When messages.tts.auto is "tagged", directives are required to trigger audio. Streaming block delivery strips directives from visible text before the channel sees them, even when split across adjacent blocks.

provider=... is ignored unless modelOverrides.allowProvider: true. When a reply declares provider=..., the other keys in that directive are parsed only by that provider; unsupported keys are stripped and reported as TTS directive warnings.

Available directive keys:

provider (registered provider id; requires allowProvider: true)
voice / voiceName / voice_name / google_voice / voiceId
model / google_model
stability, similarityBoost, style, speed, useSpeakerBoost
vol / volume (MiniMax volume, 0–10)
pitch (MiniMax integer pitch, −12 to 12; fractional values are truncated)
emotion (Volcengine emotion tag)
applyTextNormalization (auto|on|off)
languageCode (ISO 639-1)
seed

Disable model overrides entirely:

{ messages: { tts: { modelOverrides: { enabled: false } } } }

Allow provider switching while keeping other knobs configurable:

{ messages: { tts: { modelOverrides: { enabled: true, allowProvider: true, allowSeed: false } } } }

Slash commands

Single command /tts. On Discord, OpenClaw also registers /voice because /tts is a built-in Discord command — text /tts ... still works.

/tts off | on | status
/tts chat on | off | default
/tts latest
/tts provider <id>
/tts persona <id> | off
/tts limit <chars>
/tts summary off
/tts audio <text>

Commands require an authorized sender (allowlist/owner rules apply) and either `commands.text` or native command registration must be enabled.

Behavior notes:

/tts on writes the local TTS preference to always; /tts off writes it to off.
/tts chat on|off|default writes a session-scoped auto-TTS override for the current chat.
/tts persona <id> writes the local persona preference; /tts persona off clears it.
/tts latest reads the latest assistant reply from the current session transcript and sends it as audio once. It stores only a hash of that reply on the session entry to suppress duplicate voice sends.
/tts audio generates a one-off audio reply (does not toggle TTS on).
limit and summary are stored in local prefs, not the main config.
/tts status includes fallback diagnostics for the latest attempt — Fallback: <primary> -> <used>, Attempts: ..., and per-attempt detail (provider:outcome(reasonCode) latency).
/status shows the active TTS mode plus configured provider, model, voice, and sanitized custom endpoint metadata when TTS is enabled.

Per-user preferences

Slash commands write local overrides to SQLite plugin state by default. Legacy ~/.openclaw/settings/tts.json is imported by openclaw doctor --fix; runtime TTS prefs no longer write JSON files.

Stored field	Effect
`auto`	Local auto-TTS override (`always`, `off`, …)
`provider`	Local primary provider override
`persona`	Local persona override
`maxLength`	Summary threshold (default `1500` chars)
`summarize`	Summary toggle (default `true`)

These override the effective config from messages.tts plus the active agents.list[].tts block for that host.

Output formats (fixed)

TTS voice delivery is channel-capability driven. Channel plugins advertise whether voice-style TTS should ask providers for a native voice-note target or keep normal audio-file synthesis and only mark compatible output for voice delivery.

Voice-note capable channels: voice-note replies prefer Opus (opus_48000_64 from ElevenLabs, opus from OpenAI).
- 48kHz / 64kbps is a good voice message tradeoff.
Feishu / WhatsApp: when a voice-note reply is produced as MP3/WebM/WAV/M4A or another likely audio file, the channel plugin transcodes it to 48kHz Ogg/Opus with ffmpeg before sending the native voice message. WhatsApp sends the result through the Baileys audio payload with ptt: true and audio/ogg; codecs=opus. If conversion fails, Feishu receives the original file as an attachment; WhatsApp send fails rather than posting an incompatible PTT payload.
Other channels: MP3 (mp3_44100_128 from ElevenLabs, mp3 from OpenAI).
- 44.1kHz / 128kbps is the default balance for speech clarity.
MiniMax: MP3 (speech-2.8-hd model, 32kHz sample rate) for normal audio attachments. For channel-advertised voice-note targets, OpenClaw transcodes the MiniMax MP3 to 48kHz Opus with ffmpeg before delivery when the channel advertises transcoding.
Xiaomi MiMo: MP3 by default, or WAV when configured. For channel-advertised voice-note targets, OpenClaw transcodes Xiaomi output to 48kHz Opus with ffmpeg before delivery when the channel advertises transcoding.
Local CLI: uses the configured outputFormat. Voice-note targets are converted to Ogg/Opus and telephony output is converted to raw 16 kHz mono PCM with ffmpeg.
Google Gemini: Gemini API TTS returns raw 24kHz PCM. OpenClaw wraps it as WAV for audio attachments, transcodes it to 48kHz Opus for voice-note targets, and returns PCM directly for Talk/telephony.
Gradium: WAV for audio attachments, Opus for voice-note targets, and ulaw_8000 at 8 kHz for telephony.
Inworld: MP3 for normal audio attachments, native OGG_OPUS for voice-note targets, and raw PCM at 22050 Hz for Talk/telephony.
xAI: MP3 by default; responseFormat may be mp3, wav, pcm, mulaw, or alaw. OpenClaw uses xAI's batch REST TTS endpoint and returns a complete audio attachment; xAI's streaming TTS WebSocket is not used by this provider path. Native Opus voice-note format is not supported by this path.
Microsoft: uses microsoft.outputFormat (default audio-24khz-48kbitrate-mono-mp3).
- The bundled transport accepts an outputFormat, but not all formats are available from the service.
- Output format values follow Microsoft Speech output formats (including Ogg/WebM Opus).
- Telegram sendVoice accepts OGG/MP3/M4A; use OpenAI/ElevenLabs if you need guaranteed Opus voice messages.
- If the configured Microsoft output format fails, OpenClaw retries with MP3.

OpenAI/ElevenLabs output formats are fixed per channel (see above).

Auto-TTS behavior

When messages.tts.auto is enabled, OpenClaw:

Skips TTS if the reply already contains media or a MEDIA: directive.
Skips very short replies (under 10 chars).
Summarizes long replies when summaries are enabled, using summaryModel (or agents.defaults.model.primary).
Attaches the generated audio to the reply.
In mode: "final", still sends audio-only TTS for streamed final replies after the text stream completes; the generated media goes through the same channel media normalization as normal reply attachments.

If the reply exceeds maxLength and summary is off (or no API key for the summary model), audio is skipped and the normal text reply is sent.

Reply -> TTS enabled?
  no  -> send text
  yes -> has media / MEDIA: / short?
          yes -> send text
          no  -> length > limit?
                   no  -> TTS -> attach audio
                   yes -> summary enabled?
                            no  -> send text
                            yes -> summarize -> TTS -> attach audio

Output formats by channel

Target	Format
Feishu / Matrix / Telegram / WhatsApp	Voice-note replies prefer Opus (`opus_48000_64` from ElevenLabs, `opus` from OpenAI). 48 kHz / 64 kbps balances clarity and size.
Other channels	MP3 (`mp3_44100_128` from ElevenLabs, `mp3` from OpenAI). 44.1 kHz / 128 kbps default for speech.
Talk / telephony	Provider-native PCM (Inworld 22050 Hz, Google 24 kHz), or `ulaw_8000` from Gradium for telephony.

Per-provider notes:

Feishu / WhatsApp transcoding: When a voice-note reply lands as MP3/WebM/WAV/M4A, the channel plugin transcodes to 48 kHz Ogg/Opus with ffmpeg. WhatsApp sends through Baileys with ptt: true and audio/ogg; codecs=opus. If conversion fails: Feishu falls back to attaching the original file; WhatsApp send fails rather than posting an incompatible PTT payload.
MiniMax / Xiaomi MiMo: Default MP3 (32 kHz for MiniMax speech-2.8-hd); transcoded to 48 kHz Opus for voice-note targets via ffmpeg.
Local CLI: Uses configured outputFormat. Voice-note targets are converted to Ogg/Opus and telephony output to raw 16 kHz mono PCM.
Google Gemini: Returns raw 24 kHz PCM. OpenClaw wraps as WAV for attachments, transcodes to 48 kHz Opus for voice-note targets, returns PCM directly for Talk/telephony.
Inworld: MP3 attachments, native OGG_OPUS voice-note, raw PCM 22050 Hz for Talk/telephony.
xAI: MP3 by default; responseFormat may be mp3|wav|pcm|mulaw|alaw. Uses xAI's batch REST endpoint — streaming WebSocket TTS is not used. Native Opus voice-note format is not supported.
Microsoft: Uses microsoft.outputFormat (default audio-24khz-48kbitrate-mono-mp3). Telegram sendVoice accepts OGG/MP3/M4A; use OpenAI/ElevenLabs if you need guaranteed Opus voice messages. If the configured Microsoft format fails, OpenClaw retries with MP3.

OpenAI and ElevenLabs output formats are fixed per channel as listed above.

Field reference

Auto-TTS mode. `inbound` only sends audio after an inbound voice message; `tagged` only sends audio when the reply includes `tts:...` directives or a `tts:text` block. Legacy toggle. `openclaw doctor --fix` migrates this to `auto`. `"all"` includes tool/block replies in addition to final replies. Speech provider id. When unset, OpenClaw uses the first configured provider in registry auto-select order. Legacy `provider: "edge"` is rewritten to `"microsoft"` by `openclaw doctor --fix`. Active persona id from `personas`. Normalized to lowercase. Stable spoken identity. Fields: `label`, `description`, `provider`, `fallbackPolicy`, `prompt`, `providers.`. See [Personas](#personas). Cheap model for auto-summary; defaults to `agents.defaults.model.primary`. Accepts `provider/model` or a configured model alias. Allow the model to emit TTS directives. `enabled` defaults to `true`; `allowProvider` defaults to `false`. Provider-owned settings keyed by speech provider id. Legacy direct blocks (`messages.tts.openai`, `.elevenlabs`, `.microsoft`, `.edge`) are rewritten by `openclaw doctor --fix`; commit only `messages.tts.providers.`. Hard cap for TTS input characters. `/tts audio` fails if exceeded. Request timeout in milliseconds. Env: `AZURE_SPEECH_KEY`, `AZURE_SPEECH_API_KEY`, or `SPEECH_KEY`. Azure Speech region (e.g. `eastus`). Env: `AZURE_SPEECH_REGION` or `SPEECH_REGION`. Optional Azure Speech endpoint override (alias `baseUrl`). Azure voice ShortName. Default `en-US-JennyNeural`. SSML language code. Default `en-US`. Azure `X-Microsoft-OutputFormat` for standard audio. Default `audio-24khz-48kbitrate-mono-mp3`. Azure `X-Microsoft-OutputFormat` for voice-note output. Default `ogg-24khz-16bit-mono-opus`. Falls back to `ELEVENLABS_API_KEY` or `XI_API_KEY`. Model id (e.g. `eleven_multilingual_v2`, `eleven_v3`). ElevenLabs voice id. `stability`, `similarityBoost`, `style` (each `0..1`), `useSpeakerBoost` (`true|false`), `speed` (`0.5..2.0`, `1.0` = normal). Text normalization mode. 2-letter ISO 639-1 (e.g. `en`, `de`). Integer `0..4294967295` for best-effort determinism. Override ElevenLabs API base URL. Falls back to `GEMINI_API_KEY` / `GOOGLE_API_KEY`. If omitted, TTS can reuse `models.providers.google.apiKey` before env fallback. Gemini TTS model. Default `gemini-3.1-flash-tts-preview`. Gemini prebuilt voice name. Default `Kore`. Alias: `voice`. Natural-language style prompt prepended before spoken text. Optional speaker label prepended before spoken text when your prompt uses a named speaker. Set to `audio-profile-v1` to wrap active persona prompt fields in a deterministic Gemini TTS prompt structure. Google-specific extra persona prompt text appended to the template's Director's Notes. Only `https://generativelanguage.googleapis.com` is accepted. Env: `GRADIUM_API_KEY`. Default `https://api.gradium.ai`. Default Emma (`YTpq7expH9539ERJ`). ### Inworld primary

<ParamField path="apiKey" type="string">Env: `INWORLD_API_KEY`.</ParamField>
<ParamField path="baseUrl" type="string">Default `https://api.inworld.ai`.</ParamField>
<ParamField path="modelId" type="string">Default `inworld-tts-1.5-max`. Also: `inworld-tts-1.5-mini`, `inworld-tts-1-max`, `inworld-tts-1`.</ParamField>
<ParamField path="voiceId" type="string">Default `Sarah`.</ParamField>
<ParamField path="temperature" type="number">Sampling temperature `0..2`.</ParamField>

Local executable or command string for CLI TTS. Command arguments. Supports `{{Text}}`, `{{OutputPath}}`, `{{OutputDir}}`, `{{OutputBase}}` placeholders. Expected CLI output format. Default `mp3` for audio attachments. Command timeout in milliseconds. Default `120000`. Optional command working directory. Optional environment overrides for the command. Allow Microsoft speech usage. Microsoft neural voice name (e.g. `en-US-MichelleNeural`). Language code (e.g. `en-US`). Microsoft output format. Default `audio-24khz-48kbitrate-mono-mp3`. Not all formats are supported by the bundled Edge-backed transport. Percent strings (e.g. `+10%`, `-5%`). Write JSON subtitles alongside the audio file. Proxy URL for Microsoft speech requests. Request timeout override (ms). Legacy alias. Run `openclaw doctor --fix` to rewrite persisted config to `providers.microsoft`. Falls back to `MINIMAX_API_KEY`. Token Plan auth via `MINIMAX_OAUTH_TOKEN`, `MINIMAX_CODE_PLAN_KEY`, or `MINIMAX_CODING_API_KEY`. Default `https://api.minimax.io`. Env: `MINIMAX_API_HOST`. Default `speech-2.8-hd`. Env: `MINIMAX_TTS_MODEL`. Default `English_expressive_narrator`. Env: `MINIMAX_TTS_VOICE_ID`. `0.5..2.0`. Default `1.0`. `(0, 10]`. Default `1.0`. Integer `-12..12`. Default `0`. Fractional values are truncated before the request. Falls back to `OPENAI_API_KEY`. OpenAI TTS model id (e.g. `gpt-4o-mini-tts`). Voice name (e.g. `alloy`, `cedar`). Explicit OpenAI `instructions` field. When set, persona prompt fields are **not** auto-mapped. Extra JSON fields merged into `/audio/speech` request bodies after generated OpenAI TTS fields. Use this for OpenAI-compatible endpoints such as Kokoro that require provider-specific keys like `lang`; unsafe prototype keys are ignored. Override the OpenAI TTS endpoint. Resolution order: config → `OPENAI_TTS_BASE_URL` → `https://api.openai.com/v1`. Non-default values are treated as OpenAI-compatible TTS endpoints, so custom model and voice names are accepted. Env: `OPENROUTER_API_KEY`. Can reuse `models.providers.openrouter.apiKey`. Default `https://openrouter.ai/api/v1`. Legacy `https://openrouter.ai/v1` is normalized. Default `hexgrad/kokoro-82m`. Alias: `modelId`. Default `af_alloy`. Alias: `voiceId`. Default `mp3`. Provider-native speed override. Env: `VOLCENGINE_TTS_API_KEY` or `BYTEPLUS_SEED_SPEECH_API_KEY`. Default `seed-tts-1.0`. Env: `VOLCENGINE_TTS_RESOURCE_ID`. Use `seed-tts-2.0` when your project has TTS 2.0 entitlement. App key header. Default `aGjiRDfUWi`. Env: `VOLCENGINE_TTS_APP_KEY`. Override the Seed Speech TTS HTTP endpoint. Env: `VOLCENGINE_TTS_BASE_URL`. Voice type. Default `en_female_anna_mars_bigtts`. Env: `VOLCENGINE_TTS_VOICE`. Provider-native speed ratio. Provider-native emotion tag. Legacy Volcengine Speech Console fields. Env: `VOLCENGINE_TTS_APPID`, `VOLCENGINE_TTS_TOKEN`, `VOLCENGINE_TTS_CLUSTER` (default `volcano_tts`). Env: `XAI_API_KEY`. Default `https://api.x.ai/v1`. Env: `XAI_BASE_URL`. Default `eve`. Live voices: `ara`, `eve`, `leo`, `rex`, `sal`, `una`. BCP-47 language code or `auto`. Default `en`. Default `mp3`. Provider-native speed override. Env: `XIAOMI_API_KEY`. Default `https://api.xiaomimimo.com/v1`. Env: `XIAOMI_BASE_URL`. Default `mimo-v2.5-tts`. Env: `XIAOMI_TTS_MODEL`. Also supports `mimo-v2-tts`. Default `mimo_default`. Env: `XIAOMI_TTS_VOICE`. Default `mp3`. Env: `XIAOMI_TTS_FORMAT`. Optional natural-language style instruction sent as the user message; not spoken.

Agent tool

The tts tool converts text to speech and returns an audio attachment for reply delivery. On Feishu, Matrix, Telegram, and WhatsApp, the audio is delivered as a voice message rather than a file attachment. Feishu and WhatsApp can transcode non-Opus TTS output on this path when ffmpeg is available.

WhatsApp sends audio through Baileys as a PTT voice note (audio with ptt: true) and sends visible text separately from PTT audio because clients do not consistently render captions on voice notes.

The tool accepts optional channel and timeoutMs fields; timeoutMs is a per-call provider request timeout in milliseconds.

Gateway RPC

Method	Purpose
`tts.status`	Read current TTS state and last attempt.
`tts.enable`	Set local auto preference to `always`.
`tts.disable`	Set local auto preference to `off`.
`tts.convert`	One-off text → audio.
`tts.setProvider`	Set local provider preference.
`tts.setPersona`	Set local persona preference.
`tts.providers`	List configured providers and status.

45 KiB Raw Blame History Unescape Escape