feat(media): add voice conversion and speech plugins

2026-05-06 16:30:57 +00:00 · 2026-04-25 12:12:11 +01:00
parent 16b7dee1ef
commit b511250e5c
37 changed files with 1681 additions and 47 deletions
--- a/docs/channels/qqbot.md
+++ b/docs/channels/qqbot.md
@@ -147,6 +147,11 @@ STT and TTS support two-level configuration with priority fallback:

 Set `enabled: false` on either to disable.

+Inbound QQ voice attachments are exposed to agents as audio media metadata while
+keeping raw voice files out of generic `MediaPaths`. `[[audio_as_voice]]` plain
+text replies synthesize TTS and send a native QQ voice message when TTS is
+configured.
+
 Outbound audio upload/transcode behavior can also be tuned with
 `channels.qqbot.audioFormatPolicy`:

--- a/docs/channels/whatsapp.md
+++ b/docs/channels/whatsapp.md
@@ -362,7 +362,8 @@ When the linked self number is also present in `allowFrom`, WhatsApp self-chat s
  <Accordion title="Outbound media behavior">
    - supports image, video, audio (PTT voice-note), and document payloads
    - reply payloads preserve `audioAsVoice`; WhatsApp sends audio media as Baileys PTT voice notes
-    - `audio/ogg` is rewritten to `audio/ogg; codecs=opus` for voice-note compatibility
+    - non-Ogg audio, including Microsoft Edge TTS MP3/WebM output, is transcoded to Ogg/Opus before PTT delivery
+    - native Ogg/Opus audio is sent with `audio/ogg; codecs=opus` for voice-note compatibility
    - animated GIF playback is supported via `gifPlayback: true` on video sends
    - captions are applied to the first media item when sending multi-media reply payloads
    - media source can be HTTP(S), `file://`, or local paths
--- a/docs/nodes/audio.md
+++ b/docs/nodes/audio.md
@@ -31,7 +31,7 @@ OpenClaw auto-detects in this order and stops at the first working option:
 3. **Gemini CLI** (`gemini`) using `read_many_files`
 4. **Provider auth**
   - Configured `models.providers.*` entries that support audio are tried first
-   - Bundled fallback order: OpenAI → Groq → Deepgram → Google → Mistral
+   - Bundled fallback order: OpenAI → Groq → xAI → Deepgram → Google → SenseAudio → ElevenLabs → Mistral

 To disable auto-detection, set `tools.media.audio.enabled: false`.
 To customize, set `tools.media.audio.models`.
@@ -112,6 +112,21 @@ Note: Binary detection is best-effort across macOS/Linux/Windows; ensure the CLI
 }
 ```

+### Provider-only (SenseAudio)
+
+```json5
+{
+  tools: {
+    media: {
+      audio: {
+        enabled: true,
+        models: [{ provider: "senseaudio", model: "senseaudio-asr-pro-1.5-260319" }],
+      },
+    },
+  },
+}
+```
+
 ### Echo transcript to chat (opt-in)

 ```json5
@@ -136,6 +151,8 @@ Note: Binary detection is best-effort across macOS/Linux/Windows; ensure the CLI
 - Deepgram picks up `DEEPGRAM_API_KEY` when `provider: "deepgram"` is used.
 - Deepgram setup details: [Deepgram (audio transcription)](/providers/deepgram).
 - Mistral setup details: [Mistral](/providers/mistral).
+- SenseAudio picks up `SENSEAUDIO_API_KEY` when `provider: "senseaudio"` is used.
+- SenseAudio setup details: [SenseAudio](/providers/senseaudio).
 - Audio providers can override `baseUrl`, `headers`, and `providerOptions` via `tools.media.audio`.
 - Default size cap is 20MB (`tools.media.audio.maxBytes`). Oversize audio is skipped for that model and the next entry is tried.
 - Tiny/empty audio files below 1024 bytes are skipped before provider/CLI transcription.
--- a/docs/nodes/media-understanding.md
+++ b/docs/nodes/media-understanding.md
@@ -167,7 +167,7 @@ working option**:
     example through `agents.defaults.imageModel` or
     `openclaw infer image describe --model ollama/<vision-model>`.
   - Bundled fallback order:
-     - Audio: OpenAI → Groq → xAI → Deepgram → Google → Mistral
+     - Audio: OpenAI → Groq → xAI → Deepgram → Google → SenseAudio → ElevenLabs → Mistral
     - Image: OpenAI → Anthropic → Google → MiniMax → MiniMax Portal → Z.AI
     - Video: Google → Qwen → Moonshot

@@ -228,7 +228,7 @@ If you omit `capabilities`, the entry is eligible for the list it appears in.
 | Capability | Provider integration                                                                                                         | Notes                                                                                                                                                                                                                                   |
 | ---------- | ---------------------------------------------------------------------------------------------------------------------------- | --------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- |
 | Image      | OpenAI, OpenAI Codex OAuth, Codex app-server, OpenRouter, Anthropic, Google, MiniMax, Moonshot, Qwen, Z.AI, config providers | Vendor plugins register image support; `openai-codex/*` uses OAuth provider plumbing; `codex/*` uses a bounded Codex app-server turn; MiniMax and MiniMax OAuth both use `MiniMax-VL-01`; image-capable config providers auto-register. |
-| Audio      | OpenAI, Groq, Deepgram, Google, Mistral                                                                                      | Provider transcription (Whisper/Deepgram/Gemini/Voxtral).                                                                                                                                                                               |
+| Audio      | OpenAI, Groq, xAI, Deepgram, Google, SenseAudio, ElevenLabs, Mistral                                                         | Provider transcription (Whisper/Groq/xAI/Deepgram/Gemini/SenseAudio/Scribe/Voxtral).                                                                                                                                                    |
 | Video      | Google, Qwen, Moonshot                                                                                                       | Provider video understanding via vendor plugins; Qwen video understanding uses the Standard DashScope endpoints.                                                                                                                        |

 MiniMax note:
--- a/docs/providers/index.md
+++ b/docs/providers/index.md
@@ -62,6 +62,7 @@ Looking for chat channel docs (WhatsApp/Telegram/Discord/Slack/Mattermost (plugi
 - [Qianfan](/providers/qianfan)
 - [Qwen Cloud](/providers/qwen)
 - [Runway](/providers/runway)
+- [SenseAudio](/providers/senseaudio)
 - [SGLang (local models)](/providers/sglang)
 - [StepFun](/providers/stepfun)
 - [Synthetic](/providers/synthetic)
@@ -89,6 +90,7 @@ Looking for chat channel docs (WhatsApp/Telegram/Discord/Slack/Mattermost (plugi
 - [ElevenLabs](/providers/elevenlabs#speech-to-text)
 - [Mistral](/providers/mistral#audio-transcription-voxtral)
 - [OpenAI](/providers/openai#speech-to-text)
+- [SenseAudio](/providers/senseaudio)
 - [xAI](/providers/xai#speech-to-text)

 ## Community tools
--- a/docs/providers/senseaudio.md
+++ b/docs/providers/senseaudio.md
@@ -0,0 +1,65 @@
+---
+summary: "SenseAudio batch speech-to-text for inbound voice notes"
+read_when:
+  - You want SenseAudio speech-to-text for audio attachments
+  - You need the SenseAudio API key env var or audio config path
+title: "SenseAudio"
+---
+
+# SenseAudio
+
+SenseAudio can transcribe inbound audio/voice-note attachments through
+OpenClaw's shared `tools.media.audio` pipeline. OpenClaw posts multipart audio
+to the OpenAI-compatible transcription endpoint and injects the returned text
+as `{{Transcript}}` plus an `[Audio]` block.
+
+| Detail        | Value                                            |
+| ------------- | ------------------------------------------------ |
+| Website       | [senseaudio.cn](https://senseaudio.cn)           |
+| Docs          | [senseaudio.cn/docs](https://senseaudio.cn/docs) |
+| Auth          | `SENSEAUDIO_API_KEY`                             |
+| Default model | `senseaudio-asr-pro-1.5-260319`                  |
+| Default URL   | `https://api.senseaudio.cn/v1`                   |
+
+## Getting Started
+
+<Steps>
+  <Step title="Set your API key">
+    ```bash
+    export SENSEAUDIO_API_KEY="..."
+    ```
+  </Step>
+  <Step title="Enable the audio provider">
+    ```json5
+    {
+      tools: {
+        media: {
+          audio: {
+            enabled: true,
+            models: [{ provider: "senseaudio", model: "senseaudio-asr-pro-1.5-260319" }],
+          },
+        },
+      },
+    }
+    ```
+  </Step>
+  <Step title="Send a voice note">
+    Send an audio message through any connected channel. OpenClaw uploads the
+    audio to SenseAudio and uses the transcript in the reply pipeline.
+  </Step>
+</Steps>
+
+## Options
+
+| Option     | Path                                  | Description                         |
+| ---------- | ------------------------------------- | ----------------------------------- |
+| `model`    | `tools.media.audio.models[].model`    | SenseAudio ASR model id             |
+| `language` | `tools.media.audio.models[].language` | Optional language hint              |
+| `prompt`   | `tools.media.audio.prompt`            | Optional transcription prompt       |
+| `baseUrl`  | `tools.media.audio.baseUrl` or model  | Override the OpenAI-compatible base |
+| `headers`  | `tools.media.audio.request.headers`   | Extra request headers               |
+
+<Note>
+SenseAudio is batch STT only in OpenClaw. Voice Call realtime transcription
+continues to use providers with streaming STT support.
+</Note>
--- a/docs/tools/media-overview.md
+++ b/docs/tools/media-overview.md
@@ -18,32 +18,35 @@ OpenClaw generates images, videos, and music, understands inbound media (images,
 | Image generation     | `image_generate` | ComfyUI, fal, Google, MiniMax, OpenAI, Vydra, xAI                                            | Creates or edits images from text prompts or references |
 | Video generation     | `video_generate` | Alibaba, BytePlus, ComfyUI, fal, Google, MiniMax, OpenAI, Qwen, Runway, Together, Vydra, xAI | Creates videos from text, images, or existing videos    |
 | Music generation     | `music_generate` | ComfyUI, Google, MiniMax                                                                     | Creates music or audio tracks from text prompts         |
-| Text-to-speech (TTS) | `tts`            | ElevenLabs, Google, Gradium, Microsoft, MiniMax, OpenAI, Vydra, xAI                          | Converts outbound replies to spoken audio               |
+| Text-to-speech (TTS) | `tts`            | ElevenLabs, Google, Gradium, Local CLI, Microsoft, MiniMax, OpenAI, Vydra, xAI, Xiaomi MiMo  | Converts outbound replies to spoken audio               |
 | Media understanding  | (automatic)      | Any vision/audio-capable model provider, plus CLI fallbacks                                  | Summarizes inbound images, audio, and video             |

 ## Provider capability matrix

 This table shows which providers support which media capabilities across the platform.

-| Provider   | Image | Video | Music | TTS | STT / Transcription | Realtime Voice | Media Understanding |
-| ---------- | ----- | ----- | ----- | --- | ------------------- | -------------- | ------------------- |
-| Alibaba    |       | Yes   |       |     |                     |                |                     |
-| BytePlus   |       | Yes   |       |     |                     |                |                     |
-| ComfyUI    | Yes   | Yes   | Yes   |     |                     |                |                     |
-| Deepgram   |       |       |       |     | Yes                 |                |                     |
-| ElevenLabs |       |       |       | Yes | Yes                 |                |                     |
-| fal        | Yes   | Yes   |       |     |                     |                |                     |
-| Google     | Yes   | Yes   | Yes   | Yes |                     | Yes            | Yes                 |
-| Gradium    |       |       |       | Yes |                     |                |                     |
-| Microsoft  |       |       |       | Yes |                     |                |                     |
-| MiniMax    | Yes   | Yes   | Yes   | Yes |                     |                |                     |
-| Mistral    |       |       |       |     | Yes                 |                |                     |
-| OpenAI     | Yes   | Yes   |       | Yes | Yes                 | Yes            | Yes                 |
-| Qwen       |       | Yes   |       |     |                     |                |                     |
-| Runway     |       | Yes   |       |     |                     |                |                     |
-| Together   |       | Yes   |       |     |                     |                |                     |
-| Vydra      | Yes   | Yes   |       | Yes |                     |                |                     |
-| xAI        | Yes   | Yes   |       | Yes | Yes                 |                | Yes                 |
+| Provider    | Image | Video | Music | TTS | STT / Transcription | Realtime Voice | Media Understanding |
+| ----------- | ----- | ----- | ----- | --- | ------------------- | -------------- | ------------------- |
+| Alibaba     |       | Yes   |       |     |                     |                |                     |
+| BytePlus    |       | Yes   |       |     |                     |                |                     |
+| ComfyUI     | Yes   | Yes   | Yes   |     |                     |                |                     |
+| Deepgram    |       |       |       |     | Yes                 | Yes            |                     |
+| ElevenLabs  |       |       |       | Yes | Yes                 |                |                     |
+| fal         | Yes   | Yes   |       |     |                     |                |                     |
+| Google      | Yes   | Yes   | Yes   | Yes |                     | Yes            | Yes                 |
+| Gradium     |       |       |       | Yes |                     |                |                     |
+| Local CLI   |       |       |       | Yes |                     |                |                     |
+| Microsoft   |       |       |       | Yes |                     |                |                     |
+| MiniMax     | Yes   | Yes   | Yes   | Yes |                     |                |                     |
+| Mistral     |       |       |       |     | Yes                 |                |                     |
+| OpenAI      | Yes   | Yes   |       | Yes | Yes                 | Yes            | Yes                 |
+| Qwen        |       | Yes   |       |     |                     |                |                     |
+| Runway      |       | Yes   |       |     |                     |                |                     |
+| SenseAudio  |       |       |       |     | Yes                 |                |                     |
+| Together    |       | Yes   |       |     |                     |                |                     |
+| Vydra       | Yes   | Yes   |       | Yes |                     |                |                     |
+| xAI         | Yes   | Yes   |       | Yes | Yes                 |                | Yes                 |
+| Xiaomi MiMo | Yes   |       |       | Yes |                     |                | Yes                 |

 <Note>
 Media understanding uses any vision-capable or audio-capable model registered in your provider config. The table above highlights providers with dedicated media-understanding support; most LLM providers with multimodal models (Anthropic, Google, OpenAI, etc.) can also understand inbound media when configured as the active reply model.
@@ -53,11 +56,11 @@ Media understanding uses any vision-capable or audio-capable model registered in

 Video and music generation run as background tasks because provider processing typically takes 30 seconds to several minutes. When the agent calls `video_generate` or `music_generate`, OpenClaw submits the request to the provider, returns a task ID immediately, and tracks the job in the task ledger. The agent continues responding to other messages while the job runs. When the provider finishes, OpenClaw wakes the agent so it can post the finished media back into the original channel. Image generation and TTS are synchronous and complete inline with the reply.

-Deepgram, ElevenLabs, Mistral, OpenAI, and xAI can all transcribe inbound
-audio through the batch `tools.media.audio` path when configured. Deepgram,
-ElevenLabs, Mistral, OpenAI, and xAI also register Voice Call streaming STT
-providers, so live phone audio can be forwarded to the selected vendor
-without waiting for a completed recording.
+Deepgram, ElevenLabs, Mistral, OpenAI, SenseAudio, and xAI can all transcribe
+inbound audio through the batch `tools.media.audio` path when configured.
+Deepgram, ElevenLabs, Mistral, OpenAI, and xAI also register Voice Call
+streaming STT providers, so live phone audio can be forwarded to the selected
+vendor without waiting for a completed recording.

 Google maps to OpenClaw's image, video, music, batch TTS, backend realtime
 voice, and media-understanding surfaces. OpenAI maps to OpenClaw's image,
--- a/docs/tools/tts.md
+++ b/docs/tools/tts.md
@@ -7,7 +7,7 @@ read_when:
 title: "Text-to-speech"
 ---

-OpenClaw can convert outbound replies into audio using ElevenLabs, Google Gemini, Gradium, Microsoft, MiniMax, OpenAI, Vydra, xAI, or Xiaomi MiMo.
+OpenClaw can convert outbound replies into audio using ElevenLabs, Google Gemini, Gradium, Local CLI, Microsoft, MiniMax, OpenAI, Vydra, xAI, or Xiaomi MiMo.
 It works anywhere OpenClaw can send audio.

 ## Supported services
@@ -15,6 +15,7 @@ It works anywhere OpenClaw can send audio.
 - **ElevenLabs** (primary or fallback provider)
 - **Google Gemini** (primary or fallback provider; uses Gemini API TTS)
 - **Gradium** (primary or fallback provider; supports voice-note and telephony output)
+- **Local CLI** (primary or fallback provider; runs a configured local TTS command)
 - **Microsoft** (primary or fallback provider; current bundled implementation uses `node-edge-tts`)
 - **MiniMax** (primary or fallback provider; uses the T2A v2 API)
 - **OpenAI** (primary or fallback provider; also used for summaries)
@@ -50,7 +51,7 @@ If you want OpenAI, ElevenLabs, Google Gemini, Gradium, MiniMax, Vydra, xAI, or
 - `XAI_API_KEY`
 - `XIAOMI_API_KEY`

-Microsoft speech does **not** require an API key.
+Local CLI and Microsoft speech do **not** require an API key.

 If multiple providers are configured, the selected provider is used first and the others are fallback options.
 Auto-summary uses the configured `summaryModel` (or `agents.defaults.model.primary`),
@@ -297,6 +298,35 @@ OpenRouter model provider. Resolution order is
 `messages.tts.providers.openrouter.apiKey` ->
 `models.providers.openrouter.apiKey` -> `OPENROUTER_API_KEY`.

+### Local CLI primary
+
+```json5
+{
+  messages: {
+    tts: {
+      auto: "always",
+      provider: "tts-local-cli",
+      providers: {
+        "tts-local-cli": {
+          command: "say",
+          args: ["-o", "{{OutputPath}}", "{{Text}}"],
+          outputFormat: "wav",
+          timeoutMs: 120000,
+        },
+      },
+    },
+  },
+}
+```
+
+Local CLI TTS runs the configured command on the gateway host. `{{Text}}`,
+`{{OutputPath}}`, `{{OutputDir}}`, and `{{OutputBase}}` placeholders are
+expanded in `args`; if no `{{Text}}` placeholder is present, OpenClaw writes the
+spoken text to stdin. `outputFormat` accepts `mp3`, `opus`, or `wav`.
+Voice-note targets are transcoded to Ogg/Opus and telephony output is
+transcoded to raw 16 kHz mono PCM with `ffmpeg`. The legacy provider alias
+`cli` still works, but new config should use `tts-local-cli`.
+
 ### Gradium primary

 ```json5
@@ -417,6 +447,12 @@ Then run:
 - `providers.minimax.speed`: playback speed `0.5..2.0` (default 1.0).
 - `providers.minimax.vol`: volume `(0, 10]` (default 1.0; must be greater than 0).
 - `providers.minimax.pitch`: integer pitch shift `-12..12` (default 0). Fractional values are truncated before calling MiniMax T2A because the API rejects non-integer pitch values.
+- `providers.tts-local-cli.command`: local executable or command string for CLI TTS.
+- `providers.tts-local-cli.args`: command arguments; supports `{{Text}}`, `{{OutputPath}}`, `{{OutputDir}}`, and `{{OutputBase}}` placeholders.
+- `providers.tts-local-cli.outputFormat`: expected CLI output format (`mp3`, `opus`, or `wav`; default `mp3` for audio attachments).
+- `providers.tts-local-cli.timeoutMs`: command timeout in milliseconds (default `120000`).
+- `providers.tts-local-cli.cwd`: optional command working directory.
+- `providers.tts-local-cli.env`: optional string environment overrides for the command.
 - `providers.google.model`: Gemini TTS model (default `gemini-3.1-flash-tts-preview`).
 - `providers.google.voiceName`: Gemini prebuilt voice name (default `Kore`; `voice` is also accepted).
 - `providers.google.audioProfile`: natural-language style prompt prepended before the spoken text.
@@ -545,6 +581,9 @@ These override `messages.tts.*` for that host.
  - 44.1kHz / 128kbps is the default balance for speech clarity.
 - **MiniMax**: MP3 (`speech-2.8-hd` model, 32kHz sample rate) for normal audio attachments. For voice-note targets such as Feishu and Telegram, OpenClaw transcodes the MiniMax MP3 to 48kHz Opus with `ffmpeg` before delivery.
 - **Xiaomi MiMo**: MP3 by default, or WAV when configured. For voice-note targets such as Feishu and Telegram, OpenClaw transcodes Xiaomi output to 48kHz Opus with `ffmpeg` before delivery.
+- **Local CLI**: uses the configured `outputFormat`. Voice-note targets are
+  converted to Ogg/Opus and telephony output is converted to raw 16 kHz mono PCM
+  with `ffmpeg`.
 - **Google Gemini**: Gemini API TTS returns raw 24kHz PCM. OpenClaw wraps it as WAV for audio attachments and returns PCM directly for Talk/telephony. Native Opus voice-note format is not supported by this path.
 - **Gradium**: WAV for audio attachments, Opus for voice-note targets, and `ulaw_8000` at 8 kHz for telephony.
 - **xAI**: MP3 by default; `responseFormat` may be `mp3`, `wav`, `pcm`, `mulaw`, or `alaw`. OpenClaw uses xAI's batch REST TTS endpoint and returns a complete audio attachment; xAI's streaming TTS WebSocket is not used by this provider path. Native Opus voice-note format is not supported by this path.