feat(minimax): add native TTS speech provider (T2A v2)

Add MiniMax as a fourth TTS provider alongside OpenAI, ElevenLabs, and Microsoft. Registers a SpeechProviderPlugin in the existing minimax extension with config resolution, directive parsing, and Talk Mode support. Hex-encoded audio response from the T2A v2 API is decoded to MP3. Closes #52720 Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-04-11 01:01:13 +00:00 · 2026-03-27 16:24:56 +01:00
parent 49d962a82f
commit 7d7f5d85b4
6 changed files with 699 additions and 7 deletions
--- a/docs/tools/tts.md
+++ b/docs/tools/tts.md
@@ -9,13 +9,14 @@ title: "Text-to-Speech"

 # Text-to-speech (TTS)

-OpenClaw can convert outbound replies into audio using ElevenLabs, Microsoft, or OpenAI.
+OpenClaw can convert outbound replies into audio using ElevenLabs, Microsoft, MiniMax, or OpenAI.
 It works anywhere OpenClaw can send audio.

 ## Supported services

 - **ElevenLabs** (primary or fallback provider)
 - **Microsoft** (primary or fallback provider; current bundled implementation uses `node-edge-tts`)
+- **MiniMax** (primary or fallback provider; uses the T2A v2 API)
 - **OpenAI** (primary or fallback provider; also used for summaries)

 ### Microsoft speech notes
@@ -33,9 +34,10 @@ or ElevenLabs.

 ## Optional keys

-If you want OpenAI or ElevenLabs:
+If you want OpenAI, ElevenLabs, or MiniMax:

 - `ELEVENLABS_API_KEY` (or `XI_API_KEY`)
+- `MINIMAX_API_KEY`
 - `OPENAI_API_KEY`

 Microsoft speech does **not** require an API key.
@@ -50,6 +52,7 @@ so that provider must also be authenticated if you enable summaries.
 - [OpenAI Audio API reference](https://platform.openai.com/docs/api-reference/audio)
 - [ElevenLabs Text to Speech](https://elevenlabs.io/docs/api-reference/text-to-speech)
 - [ElevenLabs Authentication](https://elevenlabs.io/docs/api-reference/authentication)
+- [MiniMax T2A v2 API](https://platform.minimaxi.com/document/T2A%20V2)
 - [node-edge-tts](https://github.com/SchneeHertz/node-edge-tts)
 - [Microsoft Speech output formats](https://learn.microsoft.com/azure/ai-services/speech-service/rest-text-to-speech#audio-outputs)

@@ -143,6 +146,30 @@ Full schema is in [Gateway configuration](/gateway/configuration).
 }
 ```

+### MiniMax primary
+
+```json5
+{
+  messages: {
+    tts: {
+      auto: "always",
+      provider: "minimax",
+      providers: {
+        minimax: {
+          apiKey: "minimax_api_key",
+          baseUrl: "https://api.minimaxi.com",
+          model: "speech-2.8-hd",
+          voiceId: "English_expressive_narrator",
+          speed: 1.0,
+          vol: 1.0,
+          pitch: 0,
+        },
+      },
+    },
+  },
+}
+```
+
 ### Disable Microsoft speech

 ```json5
@@ -211,7 +238,7 @@ Then run:
  - `tagged` only sends audio when the reply includes `[[tts]]` tags.
 - `enabled`: legacy toggle (doctor migrates this to `auto`).
 - `mode`: `"final"` (default) or `"all"` (includes tool/block replies).
- `provider`: speech provider id such as `"elevenlabs"`, `"microsoft"`, or `"openai"` (fallback is automatic).
+- `provider`: speech provider id such as `"elevenlabs"`, `"microsoft"`, `"minimax"`, or `"openai"` (fallback is automatic).
 - If `provider` is **unset**, OpenClaw uses the first configured speech provider in registry auto-select order.
 - Legacy `provider: "edge"` still works and is normalized to `microsoft`.
 - `summaryModel`: optional cheap model for auto-summary; defaults to `agents.defaults.model.primary`.
@@ -223,7 +250,7 @@ Then run:
 - `maxTextLength`: hard cap for TTS input (chars). `/tts audio` fails if exceeded.
 - `timeoutMs`: request timeout (ms).
 - `prefsPath`: override the local prefs JSON path (provider/limit/summary).
- `apiKey` values fall back to env vars (`ELEVENLABS_API_KEY`/`XI_API_KEY`, `OPENAI_API_KEY`).
+- `apiKey` values fall back to env vars (`ELEVENLABS_API_KEY`/`XI_API_KEY`, `MINIMAX_API_KEY`, `OPENAI_API_KEY`).
 - `providers.elevenlabs.baseUrl`: override ElevenLabs API base URL.
 - `providers.openai.baseUrl`: override the OpenAI TTS endpoint.
  - Resolution order: `messages.tts.providers.openai.baseUrl` -> `OPENAI_TTS_BASE_URL` -> `https://api.openai.com/v1`
@@ -235,6 +262,12 @@ Then run:
 - `providers.elevenlabs.applyTextNormalization`: `auto|on|off`
 - `providers.elevenlabs.languageCode`: 2-letter ISO 639-1 (e.g. `en`, `de`)
 - `providers.elevenlabs.seed`: integer `0..4294967295` (best-effort determinism)
+- `providers.minimax.baseUrl`: override MiniMax API base URL (default `https://api.minimaxi.com`, env: `MINIMAX_API_HOST`).
+- `providers.minimax.model`: TTS model (default `speech-2.8-hd`, env: `MINIMAX_TTS_MODEL`).
+- `providers.minimax.voiceId`: voice identifier (default `English_expressive_narrator`, env: `MINIMAX_TTS_VOICE_ID`).
+- `providers.minimax.speed`: playback speed `0.5..2.0` (default 1.0).
+- `providers.minimax.vol`: volume `(0, 10]` (default 1.0; must be greater than 0).
+- `providers.minimax.pitch`: pitch shift `-12..12` (default 0).
 - `providers.microsoft.enabled`: allow Microsoft speech usage (default `true`; no API key).
 - `providers.microsoft.voice`: Microsoft neural voice name (e.g. `en-US-MichelleNeural`).
 - `providers.microsoft.lang`: language code (e.g. `en-US`).
@@ -269,10 +302,12 @@ Here you go.

 Available directive keys (when enabled):

- `provider` (registered speech provider id, for example `openai`, `elevenlabs`, or `microsoft`; requires `allowProvider: true`)
- `voice` (OpenAI voice) or `voiceId` (ElevenLabs)
- `model` (OpenAI TTS model or ElevenLabs model id)
+- `provider` (registered speech provider id, for example `openai`, `elevenlabs`, `minimax`, or `microsoft`; requires `allowProvider: true`)
+- `voice` (OpenAI voice) or `voiceId` (ElevenLabs / MiniMax)
+- `model` (OpenAI TTS model, ElevenLabs model id, or MiniMax model)
 - `stability`, `similarityBoost`, `style`, `speed`, `useSpeakerBoost`
+- `vol` / `volume` (MiniMax volume, 0-10)
+- `pitch` (MiniMax pitch, -12 to 12)
 - `applyTextNormalization` (`auto|on|off`)
 - `languageCode` (ISO 639-1)
 - `seed`
@@ -328,6 +363,7 @@ These override `messages.tts.*` for that host.
  - 48kHz / 64kbps is a good voice message tradeoff.
 - **Other channels**: MP3 (`mp3_44100_128` from ElevenLabs, `mp3` from OpenAI).
  - 44.1kHz / 128kbps is the default balance for speech clarity.
+- **MiniMax**: MP3 (`speech-2.8-hd` model, 32kHz sample rate). Voice-note format not natively supported; use OpenAI or ElevenLabs for guaranteed Opus voice messages.
 - **Microsoft**: uses `microsoft.outputFormat` (default `audio-24khz-48kbitrate-mono-mp3`).
  - The bundled transport accepts an `outputFormat`, but not all formats are available from the service.
  - Output format values follow Microsoft Speech output formats (including Ogg/WebM Opus).