fix(minimax): transcode voice-note tts to opus

This commit is contained in:
Peter Steinberger
2026-04-25 04:52:19 +01:00
parent f3cc74ec5d
commit 225ff9a866
5 changed files with 110 additions and 4 deletions

View File

@@ -244,6 +244,18 @@ exposed separately through the plugin-owned `MiniMax-VL-01` media provider.
See [Image Generation](/tools/image-generation) for shared tool parameters, provider selection, and failover behavior.
</Note>
### Text-to-speech
The bundled `minimax` plugin registers MiniMax T2A v2 as a speech provider for
`messages.tts`.
- Default TTS model: `speech-2.8-hd`
- Default voice: `English_expressive_narrator`
- Normal audio attachments stay MP3.
- Voice-note targets such as Feishu and Telegram are transcoded from MiniMax
MP3 to 48kHz Opus with `ffmpeg`, because the Feishu/Lark file API only
accepts `file_type: "opus"` for native audio messages.
### Music generation
The bundled `minimax` plugin also registers music generation through the shared

View File

@@ -488,7 +488,7 @@ These override `messages.tts.*` for that host.
- 48kHz / 64kbps is a good voice message tradeoff.
- **Other channels**: MP3 (`mp3_44100_128` from ElevenLabs, `mp3` from OpenAI).
- 44.1kHz / 128kbps is the default balance for speech clarity.
- **MiniMax**: MP3 (`speech-2.8-hd` model, 32kHz sample rate). Voice-note format not natively supported; use OpenAI or ElevenLabs for guaranteed Opus voice messages.
- **MiniMax**: MP3 (`speech-2.8-hd` model, 32kHz sample rate) for normal audio attachments. For voice-note targets such as Feishu and Telegram, OpenClaw transcodes the MiniMax MP3 to 48kHz Opus with `ffmpeg` before delivery.
- **Google Gemini**: Gemini API TTS returns raw 24kHz PCM. OpenClaw wraps it as WAV for audio attachments and returns PCM directly for Talk/telephony. Native Opus voice-note format is not supported by this path.
- **Gradium**: WAV for audio attachments, Opus for voice-note targets, and `ulaw_8000` at 8 kHz for telephony.
- **xAI**: MP3 by default; `responseFormat` may be `mp3`, `wav`, `pcm`, `mulaw`, or `alaw`. OpenClaw uses xAI's batch REST TTS endpoint and returns a complete audio attachment; xAI's streaming TTS WebSocket is not used by this provider path. Native Opus voice-note format is not supported by this path.