feat(tts): add Azure Speech provider

Co-authored-by: Leon Chui <84605354+leonchui@users.noreply.github.com>
2026-05-06 18:00:54 +00:00 · 2026-04-26 01:35:45 +01:00
parent 753ccf615c
commit 5b80d0c15e
17 changed files with 1230 additions and 3 deletions
--- a/docs/.i18n/glossary.zh-CN.json
+++ b/docs/.i18n/glossary.zh-CN.json
@@ -11,6 +11,14 @@
    "source": "OpenAI provider",
    "target": "OpenAI provider"
  },
+  {
+    "source": "Azure Speech",
+    "target": "Azure Speech"
+  },
+  {
+    "source": "Azure Speech provider",
+    "target": "Azure Speech provider"
+  },
  {
    "source": "Status",
    "target": "Status"
--- a/docs/docs.json
+++ b/docs/docs.json
@@ -1301,6 +1301,7 @@
                  "providers/bedrock-mantle",
                  "providers/anthropic",
                  "providers/arcee",
+                  "providers/azure-speech",
                  "providers/chutes",
                  "providers/claude-max-api-proxy",
                  "providers/cloudflare-ai-gateway",
--- a/docs/providers/azure-speech.md
+++ b/docs/providers/azure-speech.md
@@ -0,0 +1,119 @@
+---
+summary: "Azure AI Speech text-to-speech for OpenClaw replies"
+read_when:
+  - You want Azure Speech synthesis for outbound replies
+  - You need native Ogg Opus voice-note output from Azure Speech
+title: "Azure Speech"
+---
+
+Azure Speech is an Azure AI Speech text-to-speech provider. In OpenClaw it
+synthesizes outbound reply audio as MP3 by default, native Ogg/Opus for voice
+notes, and 8 kHz mulaw audio for telephony channels such as Voice Call.
+
+OpenClaw uses the Azure Speech REST API directly with SSML and sends the
+provider-owned output format through `X-Microsoft-OutputFormat`.
+
+| Detail                  | Value                                                                                                          |
+| ----------------------- | -------------------------------------------------------------------------------------------------------------- |
+| Website                 | [Azure AI Speech](https://azure.microsoft.com/products/ai-services/ai-speech)                                  |
+| Docs                    | [Speech REST text-to-speech](https://learn.microsoft.com/azure/ai-services/speech-service/rest-text-to-speech) |
+| Auth                    | `AZURE_SPEECH_KEY` plus `AZURE_SPEECH_REGION`                                                                  |
+| Default voice           | `en-US-JennyNeural`                                                                                            |
+| Default file output     | `audio-24khz-48kbitrate-mono-mp3`                                                                              |
+| Default voice-note file | `ogg-24khz-16bit-mono-opus`                                                                                    |
+
+## Getting started
+
+<Steps>
+  <Step title="Create an Azure Speech resource">
+    In the Azure portal, create a Speech resource. Copy **KEY 1** from
+    Resource Management > Keys and Endpoint, and copy the resource location
+    such as `eastus`.
+
+    ```
+    AZURE_SPEECH_KEY=<speech-resource-key>
+    AZURE_SPEECH_REGION=eastus
+    ```
+
+  </Step>
+  <Step title="Select Azure Speech in messages.tts">
+    ```json5
+    {
+      messages: {
+        tts: {
+          auto: "always",
+          provider: "azure-speech",
+          providers: {
+            "azure-speech": {
+              voice: "en-US-JennyNeural",
+              lang: "en-US",
+            },
+          },
+        },
+      },
+    }
+    ```
+  </Step>
+  <Step title="Send a message">
+    Send a reply through any connected channel. OpenClaw synthesizes the audio
+    with Azure Speech and delivers MP3 for standard audio, or Ogg/Opus when
+    the channel expects a voice note.
+  </Step>
+</Steps>
+
+## Configuration options
+
+| Option                  | Path                                                        | Description                                                                                           |
+| ----------------------- | ----------------------------------------------------------- | ----------------------------------------------------------------------------------------------------- |
+| `apiKey`                | `messages.tts.providers.azure-speech.apiKey`                | Azure Speech resource key. Falls back to `AZURE_SPEECH_KEY`, `AZURE_SPEECH_API_KEY`, or `SPEECH_KEY`. |
+| `region`                | `messages.tts.providers.azure-speech.region`                | Azure Speech resource region. Falls back to `AZURE_SPEECH_REGION` or `SPEECH_REGION`.                 |
+| `endpoint`              | `messages.tts.providers.azure-speech.endpoint`              | Optional Azure Speech endpoint/base URL override.                                                     |
+| `baseUrl`               | `messages.tts.providers.azure-speech.baseUrl`               | Optional Azure Speech base URL override.                                                              |
+| `voice`                 | `messages.tts.providers.azure-speech.voice`                 | Azure voice ShortName (default `en-US-JennyNeural`).                                                  |
+| `lang`                  | `messages.tts.providers.azure-speech.lang`                  | SSML language code (default `en-US`).                                                                 |
+| `outputFormat`          | `messages.tts.providers.azure-speech.outputFormat`          | Audio-file output format (default `audio-24khz-48kbitrate-mono-mp3`).                                 |
+| `voiceNoteOutputFormat` | `messages.tts.providers.azure-speech.voiceNoteOutputFormat` | Voice-note output format (default `ogg-24khz-16bit-mono-opus`).                                       |
+
+## Notes
+
+<AccordionGroup>
+  <Accordion title="Authentication">
+    Azure Speech uses a Speech resource key, not an Azure OpenAI key. The key
+    is sent as `Ocp-Apim-Subscription-Key`; OpenClaw derives
+    `https://<region>.tts.speech.microsoft.com` from `region` unless you
+    provide `endpoint` or `baseUrl`.
+  </Accordion>
+  <Accordion title="Voice names">
+    Use the Azure Speech voice `ShortName` value, for example
+    `en-US-JennyNeural`. The bundled provider can list voices through the
+    same Speech resource and filters voices marked deprecated or retired.
+  </Accordion>
+  <Accordion title="Audio outputs">
+    Azure accepts output formats such as `audio-24khz-48kbitrate-mono-mp3`,
+    `ogg-24khz-16bit-mono-opus`, and `riff-24khz-16bit-mono-pcm`. OpenClaw
+    requests Ogg/Opus for `voice-note` targets so channels can send native
+    voice bubbles without an extra MP3 conversion.
+  </Accordion>
+  <Accordion title="Alias">
+    `azure` is accepted as a provider alias for existing PRs and user config,
+    but new config should use `azure-speech` to avoid confusion with Azure
+    OpenAI model providers.
+  </Accordion>
+</AccordionGroup>
+
+## Related
+
+<CardGroup cols={2}>
+  <Card title="Text-to-speech" href="/tools/tts" icon="waveform-lines">
+    TTS overview, providers, and `messages.tts` config.
+  </Card>
+  <Card title="Configuration" href="/gateway/configuration" icon="gear">
+    Full config reference including `messages.tts` settings.
+  </Card>
+  <Card title="Providers" href="/providers" icon="grid">
+    All bundled OpenClaw providers.
+  </Card>
+  <Card title="Troubleshooting" href="/help/troubleshooting" icon="wrench">
+    Common issues and debugging steps.
+  </Card>
+</CardGroup>
--- a/docs/providers/index.md
+++ b/docs/providers/index.md
@@ -31,6 +31,7 @@ Looking for chat channel docs (WhatsApp/Telegram/Discord/Slack/Mattermost (plugi
 - [Amazon Bedrock Mantle](/providers/bedrock-mantle)
 - [Anthropic (API + Claude CLI)](/providers/anthropic)
 - [Arcee AI (Trinity models)](/providers/arcee)
+- [Azure Speech](/providers/azure-speech)
 - [BytePlus (International)](/concepts/model-providers#byteplus-international)
 - [Chutes](/providers/chutes)
 - [Cloudflare AI Gateway](/providers/cloudflare-ai-gateway)
--- a/docs/tools/tts.md
+++ b/docs/tools/tts.md
@@ -7,11 +7,12 @@ read_when:
 title: "Text-to-speech"
 ---

-OpenClaw can convert outbound replies into audio using ElevenLabs, Google Gemini, Gradium, Inworld, Local CLI, Microsoft, MiniMax, OpenAI, Volcengine, Vydra, xAI, or Xiaomi MiMo.
+OpenClaw can convert outbound replies into audio using Azure Speech, ElevenLabs, Google Gemini, Gradium, Inworld, Local CLI, Microsoft, MiniMax, OpenAI, Volcengine, Vydra, xAI, or Xiaomi MiMo.
 It works anywhere OpenClaw can send audio.

 ## Supported services

+- **Azure Speech** (primary or fallback provider; uses the Azure AI Speech REST API)
 - **ElevenLabs** (primary or fallback provider)
 - **Google Gemini** (primary or fallback provider; uses Gemini API TTS)
 - **Gradium** (primary or fallback provider; supports voice-note and telephony output)
@@ -40,8 +41,10 @@ or ElevenLabs.

 ## Optional keys

-If you want ElevenLabs, Google Gemini, Gradium, Inworld, MiniMax, OpenAI, Volcengine, Vydra, xAI, or Xiaomi MiMo:
+If you want Azure Speech, ElevenLabs, Google Gemini, Gradium, Inworld, MiniMax, OpenAI, Volcengine, Vydra, xAI, or Xiaomi MiMo:

+- `AZURE_SPEECH_KEY` plus `AZURE_SPEECH_REGION` (also accepts
+  `AZURE_SPEECH_API_KEY`, `SPEECH_KEY`, and `SPEECH_REGION`)
 - `ELEVENLABS_API_KEY` (or `XI_API_KEY`)
 - `GEMINI_API_KEY` (or `GOOGLE_API_KEY`)
 - `GRADIUM_API_KEY`
@@ -67,6 +70,8 @@ so that provider must also be authenticated if you enable summaries.

 - [OpenAI Text-to-Speech guide](https://platform.openai.com/docs/guides/text-to-speech)
 - [OpenAI Audio API reference](https://platform.openai.com/docs/api-reference/audio)
+- [Azure Speech REST text-to-speech](https://learn.microsoft.com/azure/ai-services/speech-service/rest-text-to-speech)
+- [Azure Speech provider](/providers/azure-speech)
 - [ElevenLabs Text to Speech](https://elevenlabs.io/docs/api-reference/text-to-speech)
 - [ElevenLabs Authentication](https://elevenlabs.io/docs/api-reference/authentication)
 - [Gradium](/providers/gradium)
@@ -145,6 +150,36 @@ Full schema is in [Gateway configuration](/gateway/configuration).
 }
 ```

+### Azure Speech primary
+
+```json5
+{
+  messages: {
+    tts: {
+      auto: "always",
+      provider: "azure-speech",
+      providers: {
+        "azure-speech": {
+          // apiKey falls back to AZURE_SPEECH_KEY.
+          // region falls back to AZURE_SPEECH_REGION.
+          voice: "en-US-JennyNeural",
+          lang: "en-US",
+          outputFormat: "audio-24khz-48kbitrate-mono-mp3",
+          voiceNoteOutputFormat: "ogg-24khz-16bit-mono-opus",
+        },
+      },
+    },
+  },
+}
+```
+
+Azure Speech uses a Speech resource key, not an Azure OpenAI key. Resolution
+order is `messages.tts.providers.azure-speech.apiKey` ->
+`AZURE_SPEECH_KEY` -> `AZURE_SPEECH_API_KEY` -> `SPEECH_KEY`, plus
+`messages.tts.providers.azure-speech.region` -> `AZURE_SPEECH_REGION` ->
+`SPEECH_REGION` for the region. New config should use `azure-speech`; `azure`
+is accepted as a provider alias.
+
 ### Microsoft primary (no API key)

 ```json5
@@ -495,7 +530,21 @@ Then run:
 - `maxTextLength`: hard cap for TTS input (chars). `/tts audio` fails if exceeded.
 - `timeoutMs`: request timeout (ms).
 - `prefsPath`: override the local prefs JSON path (provider/limit/summary).
- `apiKey` values fall back to env vars (`ELEVENLABS_API_KEY`/`XI_API_KEY`, `GEMINI_API_KEY`/`GOOGLE_API_KEY`, `GRADIUM_API_KEY`, `INWORLD_API_KEY`, `MINIMAX_API_KEY`, `OPENAI_API_KEY`, `VYDRA_API_KEY`, `XAI_API_KEY`, `XIAOMI_API_KEY`). Volcengine uses `appId`/`token` instead.
+- `apiKey` values fall back to env vars (`AZURE_SPEECH_KEY`/`AZURE_SPEECH_API_KEY`/`SPEECH_KEY`, `ELEVENLABS_API_KEY`/`XI_API_KEY`, `GEMINI_API_KEY`/`GOOGLE_API_KEY`, `GRADIUM_API_KEY`, `INWORLD_API_KEY`, `MINIMAX_API_KEY`, `OPENAI_API_KEY`, `VYDRA_API_KEY`, `XAI_API_KEY`, `XIAOMI_API_KEY`). Volcengine uses `appId`/`token` instead.
+- `providers.azure-speech.apiKey`: Azure Speech resource key (env:
+  `AZURE_SPEECH_KEY`, `AZURE_SPEECH_API_KEY`, or `SPEECH_KEY`).
+- `providers.azure-speech.region`: Azure Speech region such as `eastus` (env:
+  `AZURE_SPEECH_REGION` or `SPEECH_REGION`).
+- `providers.azure-speech.endpoint` / `providers.azure-speech.baseUrl`: optional
+  Azure Speech endpoint/base URL override.
+- `providers.azure-speech.voice`: Azure voice ShortName (default
+  `en-US-JennyNeural`).
+- `providers.azure-speech.lang`: SSML language code (default `en-US`).
+- `providers.azure-speech.outputFormat`: Azure `X-Microsoft-OutputFormat` for
+  standard audio output (default `audio-24khz-48kbitrate-mono-mp3`).
+- `providers.azure-speech.voiceNoteOutputFormat`: Azure
+  `X-Microsoft-OutputFormat` for voice-note output (default
+  `ogg-24khz-16bit-mono-opus`).
 - `providers.elevenlabs.baseUrl`: override ElevenLabs API base URL.
 - `providers.openai.baseUrl`: override the OpenAI TTS endpoint.
  - Resolution order: `messages.tts.providers.openai.baseUrl` -> `OPENAI_TTS_BASE_URL` -> `https://api.openai.com/v1`