feat(tts): add BytePlus Seed Speech provider

Add Volcengine/BytePlus Seed Speech as a bundled TTS provider with current API-key auth, legacy AppID/token fallback, native Ogg/Opus voice-note output, and MP3 audio-file output. Co-authored-by: Peter Steinberger <steipete@gmail.com>
2026-05-06 16:50:43 +00:00 · 2026-04-25 23:34:57 +01:00
parent b1b29a8fc2
commit 1531123d35
10 changed files with 937 additions and 16 deletions
--- a/docs/.i18n/glossary.zh-CN.json
+++ b/docs/.i18n/glossary.zh-CN.json
@@ -111,6 +111,10 @@
    "source": "BytePlus (International)",
    "target": "BytePlus（国际版）"
  },
+  {
+    "source": "Volcengine TTS HTTP API",
+    "target": "Volcengine TTS HTTP API"
+  },
  {
    "source": "Amazon Bedrock Mantle",
    "target": "Amazon Bedrock Mantle"
--- a/docs/providers/volcengine.md
+++ b/docs/providers/volcengine.md
@@ -1,20 +1,23 @@
 ---
-summary: "Volcano Engine setup (Doubao models, general + coding endpoints)"
+summary: "Volcano Engine setup (Doubao models, coding endpoints, and Seed Speech TTS)"
 title: "Volcengine (Doubao)"
 read_when:
  - You want to use Volcano Engine or Doubao models with OpenClaw
  - You need the Volcengine API key setup
+  - You want to use Volcengine Speech text-to-speech
 ---

 The Volcengine provider gives access to Doubao models and third-party models
 hosted on Volcano Engine, with separate endpoints for general and coding
-workloads.
+workloads. The same bundled plugin can also register Volcengine Speech as a TTS
+provider.

-| Detail    | Value                                               |
-| --------- | --------------------------------------------------- |
-| Providers | `volcengine` (general) + `volcengine-plan` (coding) |
-| Auth      | `VOLCANO_ENGINE_API_KEY`                            |
-| API       | OpenAI-compatible                                   |
+| Detail     | Value                                                      |
+| ---------- | ---------------------------------------------------------- |
+| Providers  | `volcengine` (general + TTS) + `volcengine-plan` (coding)  |
+| Model auth | `VOLCANO_ENGINE_API_KEY`                                   |
+| TTS auth   | `VOLCENGINE_TTS_API_KEY` or `BYTEPLUS_SEED_SPEECH_API_KEY` |
+| API        | OpenAI-compatible models, BytePlus Seed Speech TTS         |

 ## Getting started

@@ -95,6 +98,59 @@ Both providers are configured from a single API key. Setup registers both automa
  </Tab>
 </Tabs>

+## Text-to-speech
+
+Volcengine TTS uses the BytePlus Seed Speech HTTP API and is configured
+separately from the OpenAI-compatible Doubao model API key. In the BytePlus
+console, open Seed Speech > Settings > API Keys and copy the API key, then set:
+
+```bash
+export VOLCENGINE_TTS_API_KEY="byteplus_seed_speech_api_key"
+export VOLCENGINE_TTS_RESOURCE_ID="seed-tts-1.0"
+```
+
+Then enable it in `openclaw.json`:
+
+```json5
+{
+  messages: {
+    tts: {
+      auto: "always",
+      provider: "volcengine",
+      providers: {
+        volcengine: {
+          apiKey: "byteplus_seed_speech_api_key",
+          voice: "en_female_anna_mars_bigtts",
+          speedRatio: 1.0,
+        },
+      },
+    },
+  },
+}
+```
+
+For voice-note targets, OpenClaw asks Volcengine for provider-native
+`ogg_opus`. For normal audio attachments, it asks for `mp3`. Provider aliases
+`bytedance` and `doubao` also resolve to the same speech provider.
+
+The default resource id is `seed-tts-1.0` because that is what BytePlus grants
+to newly created Seed Speech API keys in the default project. If your project
+has TTS 2.0 entitlement, set `VOLCENGINE_TTS_RESOURCE_ID=seed-tts-2.0`.
+
+<Warning>
+`VOLCANO_ENGINE_API_KEY` is for the ModelArk/Doubao model endpoints and is not a
+Seed Speech API key. TTS needs a Seed Speech API key from the BytePlus Speech
+Console, or a legacy Speech Console AppID/token pair.
+</Warning>
+
+Legacy AppID/token auth remains supported for older Speech Console applications:
+
+```bash
+export VOLCENGINE_TTS_APPID="speech_app_id"
+export VOLCENGINE_TTS_TOKEN="speech_access_token"
+export VOLCENGINE_TTS_CLUSTER="volcano_tts"
+```
+
 ## Advanced configuration

 <AccordionGroup>
@@ -112,8 +168,10 @@ Both providers are configured from a single API key. Setup registers both automa
  </Accordion>

  <Accordion title="Environment variables for daemon processes">
-    If the Gateway runs as a daemon (launchd/systemd), make sure
-    `VOLCANO_ENGINE_API_KEY` is available to that process (for example, in
+    If the Gateway runs as a daemon (launchd/systemd), make sure model and TTS
+    env vars such as `VOLCANO_ENGINE_API_KEY`, `VOLCENGINE_TTS_API_KEY`,
+    `BYTEPLUS_SEED_SPEECH_API_KEY`, `VOLCENGINE_TTS_APPID`, and
+    `VOLCENGINE_TTS_TOKEN` are available to that process (for example, in
    `~/.openclaw/.env` or via `env.shellEnv`).
  </Accordion>
 </AccordionGroup>
--- a/docs/tools/tts.md
+++ b/docs/tools/tts.md
@@ -7,7 +7,7 @@ read_when:
 title: "Text-to-speech"
 ---

-OpenClaw can convert outbound replies into audio using ElevenLabs, Google Gemini, Gradium, Inworld, Local CLI, Microsoft, MiniMax, OpenAI, Vydra, xAI, or Xiaomi MiMo.
+OpenClaw can convert outbound replies into audio using ElevenLabs, Google Gemini, Gradium, Inworld, Local CLI, Microsoft, MiniMax, OpenAI, Volcengine, Vydra, xAI, or Xiaomi MiMo.
 It works anywhere OpenClaw can send audio.

 ## Supported services
@@ -20,6 +20,7 @@ It works anywhere OpenClaw can send audio.
 - **Microsoft** (primary or fallback provider; current bundled implementation uses `node-edge-tts`)
 - **MiniMax** (primary or fallback provider; uses the T2A v2 API)
 - **OpenAI** (primary or fallback provider; also used for summaries)
+- **Volcengine** (primary or fallback provider; uses the BytePlus Seed Speech HTTP API)
 - **Vydra** (primary or fallback provider; shared image, video, and speech provider)
 - **xAI** (primary or fallback provider; uses the xAI TTS API)
 - **Xiaomi MiMo** (primary or fallback provider; uses MiMo TTS through Xiaomi chat completions)
@@ -39,7 +40,7 @@ or ElevenLabs.

 ## Optional keys

-If you want ElevenLabs, Google Gemini, Gradium, Inworld, MiniMax, OpenAI, Vydra, xAI, or Xiaomi MiMo:
+If you want ElevenLabs, Google Gemini, Gradium, Inworld, MiniMax, OpenAI, Volcengine, Vydra, xAI, or Xiaomi MiMo:

 - `ELEVENLABS_API_KEY` (or `XI_API_KEY`)
 - `GEMINI_API_KEY` (or `GOOGLE_API_KEY`)
@@ -49,6 +50,9 @@ If you want ElevenLabs, Google Gemini, Gradium, Inworld, MiniMax, OpenAI, Vydra,
  `MINIMAX_OAUTH_TOKEN`, `MINIMAX_CODE_PLAN_KEY`, or
  `MINIMAX_CODING_API_KEY`
 - `OPENAI_API_KEY`
+- `VOLCENGINE_TTS_API_KEY` (or `BYTEPLUS_SEED_SPEECH_API_KEY`);
+  legacy AppID/token auth also accepts `VOLCENGINE_TTS_APPID` and
+  `VOLCENGINE_TTS_TOKEN`
 - `VYDRA_API_KEY`
 - `XAI_API_KEY`
 - `XIAOMI_API_KEY`
@@ -68,6 +72,7 @@ so that provider must also be authenticated if you enable summaries.
 - [Gradium](/providers/gradium)
 - [Inworld TTS API](https://docs.inworld.ai/tts/tts)
 - [MiniMax T2A v2 API](https://platform.minimaxi.com/document/T2A%20V2)
+- [Volcengine TTS HTTP API](/providers/volcengine#text-to-speech)
 - [Xiaomi MiMo speech synthesis](/providers/xiaomi#text-to-speech)
 - [node-edge-tts](https://github.com/SchneeHertz/node-edge-tts)
 - [Microsoft Speech output formats](https://learn.microsoft.com/azure/ai-services/speech-service/rest-text-to-speech#audio-outputs)
@@ -249,6 +254,35 @@ encoding, so do not pass a raw bearer token and do not Base64-encode it
 yourself. The key falls back to the `INWORLD_API_KEY` env var. See
 [Inworld provider](/providers/inworld) for full setup.

+### Volcengine primary
+
+```json5
+{
+  messages: {
+    tts: {
+      auto: "always",
+      provider: "volcengine",
+      providers: {
+        volcengine: {
+          apiKey: "byteplus_seed_speech_api_key",
+          resourceId: "seed-tts-1.0",
+          voice: "en_female_anna_mars_bigtts",
+          speedRatio: 1.0,
+        },
+      },
+    },
+  },
+}
+```
+
+Volcengine TTS uses the BytePlus Seed Speech API key from the Speech Console,
+not the OpenAI-compatible `VOLCANO_ENGINE_API_KEY` used for Doubao model
+providers. Resolution order is `messages.tts.providers.volcengine.apiKey` ->
+`VOLCENGINE_TTS_API_KEY` -> `BYTEPLUS_SEED_SPEECH_API_KEY`. Legacy AppID/token
+auth still works through `messages.tts.providers.volcengine.appId` / `token` or
+`VOLCENGINE_TTS_APPID` / `VOLCENGINE_TTS_TOKEN`. Voice-note targets request
+provider-native `ogg_opus`; normal audio-file targets request `mp3`.
+
 ### xAI primary

 ```json5
@@ -447,7 +481,7 @@ Then run:
  - `tagged` only sends audio when the reply includes `[[tts:key=value]]` directives or a `[[tts:text]]...[[/tts:text]]` block.
 - `enabled`: legacy toggle (doctor migrates this to `auto`).
 - `mode`: `"final"` (default) or `"all"` (includes tool/block replies).
- `provider`: speech provider id such as `"elevenlabs"`, `"google"`, `"gradium"`, `"inworld"`, `"microsoft"`, `"minimax"`, `"openai"`, `"vydra"`, `"xai"`, or `"xiaomi"` (fallback is automatic).
+- `provider`: speech provider id such as `"elevenlabs"`, `"google"`, `"gradium"`, `"inworld"`, `"microsoft"`, `"minimax"`, `"openai"`, `"volcengine"`, `"vydra"`, `"xai"`, or `"xiaomi"` (fallback is automatic).
 - If `provider` is **unset**, OpenClaw uses the first configured speech provider in registry auto-select order.
 - Legacy `provider: "edge"` config is repaired by `openclaw doctor --fix` and
  rewritten to `provider: "microsoft"`.
@@ -461,7 +495,7 @@ Then run:
 - `maxTextLength`: hard cap for TTS input (chars). `/tts audio` fails if exceeded.
 - `timeoutMs`: request timeout (ms).
 - `prefsPath`: override the local prefs JSON path (provider/limit/summary).
- `apiKey` values fall back to env vars (`ELEVENLABS_API_KEY`/`XI_API_KEY`, `GEMINI_API_KEY`/`GOOGLE_API_KEY`, `GRADIUM_API_KEY`, `INWORLD_API_KEY`, `MINIMAX_API_KEY`, `OPENAI_API_KEY`, `VYDRA_API_KEY`, `XAI_API_KEY`, `XIAOMI_API_KEY`).
+- `apiKey` values fall back to env vars (`ELEVENLABS_API_KEY`/`XI_API_KEY`, `GEMINI_API_KEY`/`GOOGLE_API_KEY`, `GRADIUM_API_KEY`, `INWORLD_API_KEY`, `MINIMAX_API_KEY`, `OPENAI_API_KEY`, `VYDRA_API_KEY`, `XAI_API_KEY`, `XIAOMI_API_KEY`). Volcengine uses `appId`/`token` instead.
 - `providers.elevenlabs.baseUrl`: override ElevenLabs API base URL.
 - `providers.openai.baseUrl`: override the OpenAI TTS endpoint.
  - Resolution order: `messages.tts.providers.openai.baseUrl` -> `OPENAI_TTS_BASE_URL` -> `https://api.openai.com/v1`
@@ -497,6 +531,21 @@ Then run:
  - If `messages.tts.providers.google.apiKey` is omitted, TTS can reuse `models.providers.google.apiKey` before env fallback.
 - `providers.gradium.baseUrl`: override Gradium API base URL (default `https://api.gradium.ai`).
 - `providers.gradium.voiceId`: Gradium voice identifier (default Emma, `YTpq7expH9539ERJ`).
+- `providers.volcengine.apiKey`: BytePlus Seed Speech API key (env:
+  `VOLCENGINE_TTS_API_KEY` or `BYTEPLUS_SEED_SPEECH_API_KEY`).
+- `providers.volcengine.resourceId`: BytePlus Seed Speech resource id (default
+  `seed-tts-1.0`, env: `VOLCENGINE_TTS_RESOURCE_ID`; use `seed-tts-2.0` when
+  your BytePlus project has TTS 2.0 entitlement).
+- `providers.volcengine.appKey`: BytePlus Seed Speech app key header (default
+  `aGjiRDfUWi`, env: `VOLCENGINE_TTS_APP_KEY`).
+- `providers.volcengine.baseUrl`: override the Seed Speech TTS HTTP endpoint
+  (env: `VOLCENGINE_TTS_BASE_URL`).
+- `providers.volcengine.appId`: legacy Volcengine Speech Console application id (env: `VOLCENGINE_TTS_APPID`).
+- `providers.volcengine.token`: legacy Volcengine Speech Console access token (env: `VOLCENGINE_TTS_TOKEN`).
+- `providers.volcengine.cluster`: legacy Volcengine TTS cluster (default `volcano_tts`, env: `VOLCENGINE_TTS_CLUSTER`).
+- `providers.volcengine.voice`: voice type (default `en_female_anna_mars_bigtts`, env: `VOLCENGINE_TTS_VOICE`).
+- `providers.volcengine.speedRatio`: provider-native speed ratio.
+- `providers.volcengine.emotion`: provider-native emotion tag.
 - `providers.xai.apiKey`: xAI TTS API key (env: `XAI_API_KEY`).
 - `providers.xai.baseUrl`: override the xAI TTS base URL (default `https://api.x.ai/v1`, env: `XAI_BASE_URL`).
 - `providers.xai.voiceId`: xAI voice id (default `eve`; current live voices: `ara`, `eve`, `leo`, `rex`, `sal`, `una`).
@@ -550,12 +599,13 @@ Here you go.

 Available directive keys (when enabled):

- `provider` (registered speech provider id, for example `openai`, `elevenlabs`, `google`, `gradium`, `minimax`, `microsoft`, `vydra`, `xai`, or `xiaomi`; requires `allowProvider: true`)
- `voice` (OpenAI, Gradium, or Xiaomi voice), `voiceName` / `voice_name` / `google_voice` (Google voice), or `voiceId` (ElevenLabs / Gradium / MiniMax / xAI)
+- `provider` (registered speech provider id, for example `openai`, `elevenlabs`, `google`, `gradium`, `minimax`, `microsoft`, `volcengine`, `vydra`, `xai`, or `xiaomi`; requires `allowProvider: true`)
+- `voice` (OpenAI, Gradium, Volcengine, or Xiaomi voice), `voiceName` / `voice_name` / `google_voice` (Google voice), or `voiceId` (ElevenLabs / Gradium / MiniMax / xAI)
 - `model` (OpenAI TTS model, ElevenLabs model id, MiniMax model, or Xiaomi MiMo TTS model) or `google_model` (Google TTS model)
 - `stability`, `similarityBoost`, `style`, `speed`, `useSpeakerBoost`
 - `vol` / `volume` (MiniMax volume, 0-10)
 - `pitch` (MiniMax integer pitch, -12 to 12; fractional values are truncated before the MiniMax request)
+- `emotion` (Volcengine emotion tag)
 - `applyTextNormalization` (`auto|on|off`)
 - `languageCode` (ISO 639-1)
 - `seed`