feat(tts): add Azure Speech provider

Co-authored-by: Leon Chui <84605354+leonchui@users.noreply.github.com>
This commit is contained in:
Peter Steinberger
2026-04-26 01:35:45 +01:00
parent 753ccf615c
commit 5b80d0c15e
17 changed files with 1230 additions and 3 deletions

View File

@@ -11,6 +11,14 @@
"source": "OpenAI provider",
"target": "OpenAI provider"
},
{
"source": "Azure Speech",
"target": "Azure Speech"
},
{
"source": "Azure Speech provider",
"target": "Azure Speech provider"
},
{
"source": "Status",
"target": "Status"

View File

@@ -1301,6 +1301,7 @@
"providers/bedrock-mantle",
"providers/anthropic",
"providers/arcee",
"providers/azure-speech",
"providers/chutes",
"providers/claude-max-api-proxy",
"providers/cloudflare-ai-gateway",

View File

@@ -0,0 +1,119 @@
---
summary: "Azure AI Speech text-to-speech for OpenClaw replies"
read_when:
- You want Azure Speech synthesis for outbound replies
- You need native Ogg Opus voice-note output from Azure Speech
title: "Azure Speech"
---
Azure Speech is an Azure AI Speech text-to-speech provider. In OpenClaw it
synthesizes outbound reply audio as MP3 by default, native Ogg/Opus for voice
notes, and 8 kHz mulaw audio for telephony channels such as Voice Call.
OpenClaw uses the Azure Speech REST API directly with SSML and sends the
provider-owned output format through `X-Microsoft-OutputFormat`.
| Detail | Value |
| ----------------------- | -------------------------------------------------------------------------------------------------------------- |
| Website | [Azure AI Speech](https://azure.microsoft.com/products/ai-services/ai-speech) |
| Docs | [Speech REST text-to-speech](https://learn.microsoft.com/azure/ai-services/speech-service/rest-text-to-speech) |
| Auth | `AZURE_SPEECH_KEY` plus `AZURE_SPEECH_REGION` |
| Default voice | `en-US-JennyNeural` |
| Default file output | `audio-24khz-48kbitrate-mono-mp3` |
| Default voice-note file | `ogg-24khz-16bit-mono-opus` |
## Getting started
<Steps>
<Step title="Create an Azure Speech resource">
In the Azure portal, create a Speech resource. Copy **KEY 1** from
Resource Management > Keys and Endpoint, and copy the resource location
such as `eastus`.
```
AZURE_SPEECH_KEY=<speech-resource-key>
AZURE_SPEECH_REGION=eastus
```
</Step>
<Step title="Select Azure Speech in messages.tts">
```json5
{
messages: {
tts: {
auto: "always",
provider: "azure-speech",
providers: {
"azure-speech": {
voice: "en-US-JennyNeural",
lang: "en-US",
},
},
},
},
}
```
</Step>
<Step title="Send a message">
Send a reply through any connected channel. OpenClaw synthesizes the audio
with Azure Speech and delivers MP3 for standard audio, or Ogg/Opus when
the channel expects a voice note.
</Step>
</Steps>
## Configuration options
| Option | Path | Description |
| ----------------------- | ----------------------------------------------------------- | ----------------------------------------------------------------------------------------------------- |
| `apiKey` | `messages.tts.providers.azure-speech.apiKey` | Azure Speech resource key. Falls back to `AZURE_SPEECH_KEY`, `AZURE_SPEECH_API_KEY`, or `SPEECH_KEY`. |
| `region` | `messages.tts.providers.azure-speech.region` | Azure Speech resource region. Falls back to `AZURE_SPEECH_REGION` or `SPEECH_REGION`. |
| `endpoint` | `messages.tts.providers.azure-speech.endpoint` | Optional Azure Speech endpoint/base URL override. |
| `baseUrl` | `messages.tts.providers.azure-speech.baseUrl` | Optional Azure Speech base URL override. |
| `voice` | `messages.tts.providers.azure-speech.voice` | Azure voice ShortName (default `en-US-JennyNeural`). |
| `lang` | `messages.tts.providers.azure-speech.lang` | SSML language code (default `en-US`). |
| `outputFormat` | `messages.tts.providers.azure-speech.outputFormat` | Audio-file output format (default `audio-24khz-48kbitrate-mono-mp3`). |
| `voiceNoteOutputFormat` | `messages.tts.providers.azure-speech.voiceNoteOutputFormat` | Voice-note output format (default `ogg-24khz-16bit-mono-opus`). |
## Notes
<AccordionGroup>
<Accordion title="Authentication">
Azure Speech uses a Speech resource key, not an Azure OpenAI key. The key
is sent as `Ocp-Apim-Subscription-Key`; OpenClaw derives
`https://<region>.tts.speech.microsoft.com` from `region` unless you
provide `endpoint` or `baseUrl`.
</Accordion>
<Accordion title="Voice names">
Use the Azure Speech voice `ShortName` value, for example
`en-US-JennyNeural`. The bundled provider can list voices through the
same Speech resource and filters voices marked deprecated or retired.
</Accordion>
<Accordion title="Audio outputs">
Azure accepts output formats such as `audio-24khz-48kbitrate-mono-mp3`,
`ogg-24khz-16bit-mono-opus`, and `riff-24khz-16bit-mono-pcm`. OpenClaw
requests Ogg/Opus for `voice-note` targets so channels can send native
voice bubbles without an extra MP3 conversion.
</Accordion>
<Accordion title="Alias">
`azure` is accepted as a provider alias for existing PRs and user config,
but new config should use `azure-speech` to avoid confusion with Azure
OpenAI model providers.
</Accordion>
</AccordionGroup>
## Related
<CardGroup cols={2}>
<Card title="Text-to-speech" href="/tools/tts" icon="waveform-lines">
TTS overview, providers, and `messages.tts` config.
</Card>
<Card title="Configuration" href="/gateway/configuration" icon="gear">
Full config reference including `messages.tts` settings.
</Card>
<Card title="Providers" href="/providers" icon="grid">
All bundled OpenClaw providers.
</Card>
<Card title="Troubleshooting" href="/help/troubleshooting" icon="wrench">
Common issues and debugging steps.
</Card>
</CardGroup>

View File

@@ -31,6 +31,7 @@ Looking for chat channel docs (WhatsApp/Telegram/Discord/Slack/Mattermost (plugi
- [Amazon Bedrock Mantle](/providers/bedrock-mantle)
- [Anthropic (API + Claude CLI)](/providers/anthropic)
- [Arcee AI (Trinity models)](/providers/arcee)
- [Azure Speech](/providers/azure-speech)
- [BytePlus (International)](/concepts/model-providers#byteplus-international)
- [Chutes](/providers/chutes)
- [Cloudflare AI Gateway](/providers/cloudflare-ai-gateway)

View File

@@ -7,11 +7,12 @@ read_when:
title: "Text-to-speech"
---
OpenClaw can convert outbound replies into audio using ElevenLabs, Google Gemini, Gradium, Inworld, Local CLI, Microsoft, MiniMax, OpenAI, Volcengine, Vydra, xAI, or Xiaomi MiMo.
OpenClaw can convert outbound replies into audio using Azure Speech, ElevenLabs, Google Gemini, Gradium, Inworld, Local CLI, Microsoft, MiniMax, OpenAI, Volcengine, Vydra, xAI, or Xiaomi MiMo.
It works anywhere OpenClaw can send audio.
## Supported services
- **Azure Speech** (primary or fallback provider; uses the Azure AI Speech REST API)
- **ElevenLabs** (primary or fallback provider)
- **Google Gemini** (primary or fallback provider; uses Gemini API TTS)
- **Gradium** (primary or fallback provider; supports voice-note and telephony output)
@@ -40,8 +41,10 @@ or ElevenLabs.
## Optional keys
If you want ElevenLabs, Google Gemini, Gradium, Inworld, MiniMax, OpenAI, Volcengine, Vydra, xAI, or Xiaomi MiMo:
If you want Azure Speech, ElevenLabs, Google Gemini, Gradium, Inworld, MiniMax, OpenAI, Volcengine, Vydra, xAI, or Xiaomi MiMo:
- `AZURE_SPEECH_KEY` plus `AZURE_SPEECH_REGION` (also accepts
`AZURE_SPEECH_API_KEY`, `SPEECH_KEY`, and `SPEECH_REGION`)
- `ELEVENLABS_API_KEY` (or `XI_API_KEY`)
- `GEMINI_API_KEY` (or `GOOGLE_API_KEY`)
- `GRADIUM_API_KEY`
@@ -67,6 +70,8 @@ so that provider must also be authenticated if you enable summaries.
- [OpenAI Text-to-Speech guide](https://platform.openai.com/docs/guides/text-to-speech)
- [OpenAI Audio API reference](https://platform.openai.com/docs/api-reference/audio)
- [Azure Speech REST text-to-speech](https://learn.microsoft.com/azure/ai-services/speech-service/rest-text-to-speech)
- [Azure Speech provider](/providers/azure-speech)
- [ElevenLabs Text to Speech](https://elevenlabs.io/docs/api-reference/text-to-speech)
- [ElevenLabs Authentication](https://elevenlabs.io/docs/api-reference/authentication)
- [Gradium](/providers/gradium)
@@ -145,6 +150,36 @@ Full schema is in [Gateway configuration](/gateway/configuration).
}
```
### Azure Speech primary
```json5
{
messages: {
tts: {
auto: "always",
provider: "azure-speech",
providers: {
"azure-speech": {
// apiKey falls back to AZURE_SPEECH_KEY.
// region falls back to AZURE_SPEECH_REGION.
voice: "en-US-JennyNeural",
lang: "en-US",
outputFormat: "audio-24khz-48kbitrate-mono-mp3",
voiceNoteOutputFormat: "ogg-24khz-16bit-mono-opus",
},
},
},
},
}
```
Azure Speech uses a Speech resource key, not an Azure OpenAI key. Resolution
order is `messages.tts.providers.azure-speech.apiKey` ->
`AZURE_SPEECH_KEY` -> `AZURE_SPEECH_API_KEY` -> `SPEECH_KEY`, plus
`messages.tts.providers.azure-speech.region` -> `AZURE_SPEECH_REGION` ->
`SPEECH_REGION` for the region. New config should use `azure-speech`; `azure`
is accepted as a provider alias.
### Microsoft primary (no API key)
```json5
@@ -495,7 +530,21 @@ Then run:
- `maxTextLength`: hard cap for TTS input (chars). `/tts audio` fails if exceeded.
- `timeoutMs`: request timeout (ms).
- `prefsPath`: override the local prefs JSON path (provider/limit/summary).
- `apiKey` values fall back to env vars (`ELEVENLABS_API_KEY`/`XI_API_KEY`, `GEMINI_API_KEY`/`GOOGLE_API_KEY`, `GRADIUM_API_KEY`, `INWORLD_API_KEY`, `MINIMAX_API_KEY`, `OPENAI_API_KEY`, `VYDRA_API_KEY`, `XAI_API_KEY`, `XIAOMI_API_KEY`). Volcengine uses `appId`/`token` instead.
- `apiKey` values fall back to env vars (`AZURE_SPEECH_KEY`/`AZURE_SPEECH_API_KEY`/`SPEECH_KEY`, `ELEVENLABS_API_KEY`/`XI_API_KEY`, `GEMINI_API_KEY`/`GOOGLE_API_KEY`, `GRADIUM_API_KEY`, `INWORLD_API_KEY`, `MINIMAX_API_KEY`, `OPENAI_API_KEY`, `VYDRA_API_KEY`, `XAI_API_KEY`, `XIAOMI_API_KEY`). Volcengine uses `appId`/`token` instead.
- `providers.azure-speech.apiKey`: Azure Speech resource key (env:
`AZURE_SPEECH_KEY`, `AZURE_SPEECH_API_KEY`, or `SPEECH_KEY`).
- `providers.azure-speech.region`: Azure Speech region such as `eastus` (env:
`AZURE_SPEECH_REGION` or `SPEECH_REGION`).
- `providers.azure-speech.endpoint` / `providers.azure-speech.baseUrl`: optional
Azure Speech endpoint/base URL override.
- `providers.azure-speech.voice`: Azure voice ShortName (default
`en-US-JennyNeural`).
- `providers.azure-speech.lang`: SSML language code (default `en-US`).
- `providers.azure-speech.outputFormat`: Azure `X-Microsoft-OutputFormat` for
standard audio output (default `audio-24khz-48kbitrate-mono-mp3`).
- `providers.azure-speech.voiceNoteOutputFormat`: Azure
`X-Microsoft-OutputFormat` for voice-note output (default
`ogg-24khz-16bit-mono-opus`).
- `providers.elevenlabs.baseUrl`: override ElevenLabs API base URL.
- `providers.openai.baseUrl`: override the OpenAI TTS endpoint.
- Resolution order: `messages.tts.providers.openai.baseUrl` -> `OPENAI_TTS_BASE_URL` -> `https://api.openai.com/v1`