fix(minimax): transcode voice-note tts to opus

2026-05-06 14:30:45 +00:00 · 2026-04-25 04:52:19 +01:00
parent f3cc74ec5d
commit 225ff9a866
5 changed files with 110 additions and 4 deletions
--- a/docs/providers/minimax.md
+++ b/docs/providers/minimax.md
@@ -244,6 +244,18 @@ exposed separately through the plugin-owned `MiniMax-VL-01` media provider.
 See [Image Generation](/tools/image-generation) for shared tool parameters, provider selection, and failover behavior.
 </Note>

+### Text-to-speech
+
+The bundled `minimax` plugin registers MiniMax T2A v2 as a speech provider for
+`messages.tts`.
+
+- Default TTS model: `speech-2.8-hd`
+- Default voice: `English_expressive_narrator`
+- Normal audio attachments stay MP3.
+- Voice-note targets such as Feishu and Telegram are transcoded from MiniMax
+  MP3 to 48kHz Opus with `ffmpeg`, because the Feishu/Lark file API only
+  accepts `file_type: "opus"` for native audio messages.
+
 ### Music generation

 The bundled `minimax` plugin also registers music generation through the shared
--- a/docs/tools/tts.md
+++ b/docs/tools/tts.md
@@ -488,7 +488,7 @@ These override `messages.tts.*` for that host.
  - 48kHz / 64kbps is a good voice message tradeoff.
 - **Other channels**: MP3 (`mp3_44100_128` from ElevenLabs, `mp3` from OpenAI).
  - 44.1kHz / 128kbps is the default balance for speech clarity.
- **MiniMax**: MP3 (`speech-2.8-hd` model, 32kHz sample rate). Voice-note format not natively supported; use OpenAI or ElevenLabs for guaranteed Opus voice messages.
+- **MiniMax**: MP3 (`speech-2.8-hd` model, 32kHz sample rate) for normal audio attachments. For voice-note targets such as Feishu and Telegram, OpenClaw transcodes the MiniMax MP3 to 48kHz Opus with `ffmpeg` before delivery.
 - **Google Gemini**: Gemini API TTS returns raw 24kHz PCM. OpenClaw wraps it as WAV for audio attachments and returns PCM directly for Talk/telephony. Native Opus voice-note format is not supported by this path.
 - **Gradium**: WAV for audio attachments, Opus for voice-note targets, and `ulaw_8000` at 8 kHz for telephony.
 - **xAI**: MP3 by default; `responseFormat` may be `mp3`, `wav`, `pcm`, `mulaw`, or `alaw`. OpenClaw uses xAI's batch REST TTS endpoint and returns a complete audio attachment; xAI's streaming TTS WebSocket is not used by this provider path. Native Opus voice-note format is not supported by this path.