mirror of
https://github.com/openclaw/openclaw.git
synced 2026-05-06 18:00:54 +00:00
feat(tts): add Azure Speech provider
Co-authored-by: Leon Chui <84605354+leonchui@users.noreply.github.com>
This commit is contained in:
@@ -11,6 +11,14 @@
|
||||
"source": "OpenAI provider",
|
||||
"target": "OpenAI provider"
|
||||
},
|
||||
{
|
||||
"source": "Azure Speech",
|
||||
"target": "Azure Speech"
|
||||
},
|
||||
{
|
||||
"source": "Azure Speech provider",
|
||||
"target": "Azure Speech provider"
|
||||
},
|
||||
{
|
||||
"source": "Status",
|
||||
"target": "Status"
|
||||
|
||||
@@ -1301,6 +1301,7 @@
|
||||
"providers/bedrock-mantle",
|
||||
"providers/anthropic",
|
||||
"providers/arcee",
|
||||
"providers/azure-speech",
|
||||
"providers/chutes",
|
||||
"providers/claude-max-api-proxy",
|
||||
"providers/cloudflare-ai-gateway",
|
||||
|
||||
119
docs/providers/azure-speech.md
Normal file
119
docs/providers/azure-speech.md
Normal file
@@ -0,0 +1,119 @@
|
||||
---
|
||||
summary: "Azure AI Speech text-to-speech for OpenClaw replies"
|
||||
read_when:
|
||||
- You want Azure Speech synthesis for outbound replies
|
||||
- You need native Ogg Opus voice-note output from Azure Speech
|
||||
title: "Azure Speech"
|
||||
---
|
||||
|
||||
Azure Speech is an Azure AI Speech text-to-speech provider. In OpenClaw it
|
||||
synthesizes outbound reply audio as MP3 by default, native Ogg/Opus for voice
|
||||
notes, and 8 kHz mulaw audio for telephony channels such as Voice Call.
|
||||
|
||||
OpenClaw uses the Azure Speech REST API directly with SSML and sends the
|
||||
provider-owned output format through `X-Microsoft-OutputFormat`.
|
||||
|
||||
| Detail | Value |
|
||||
| ----------------------- | -------------------------------------------------------------------------------------------------------------- |
|
||||
| Website | [Azure AI Speech](https://azure.microsoft.com/products/ai-services/ai-speech) |
|
||||
| Docs | [Speech REST text-to-speech](https://learn.microsoft.com/azure/ai-services/speech-service/rest-text-to-speech) |
|
||||
| Auth | `AZURE_SPEECH_KEY` plus `AZURE_SPEECH_REGION` |
|
||||
| Default voice | `en-US-JennyNeural` |
|
||||
| Default file output | `audio-24khz-48kbitrate-mono-mp3` |
|
||||
| Default voice-note file | `ogg-24khz-16bit-mono-opus` |
|
||||
|
||||
## Getting started
|
||||
|
||||
<Steps>
|
||||
<Step title="Create an Azure Speech resource">
|
||||
In the Azure portal, create a Speech resource. Copy **KEY 1** from
|
||||
Resource Management > Keys and Endpoint, and copy the resource location
|
||||
such as `eastus`.
|
||||
|
||||
```
|
||||
AZURE_SPEECH_KEY=<speech-resource-key>
|
||||
AZURE_SPEECH_REGION=eastus
|
||||
```
|
||||
|
||||
</Step>
|
||||
<Step title="Select Azure Speech in messages.tts">
|
||||
```json5
|
||||
{
|
||||
messages: {
|
||||
tts: {
|
||||
auto: "always",
|
||||
provider: "azure-speech",
|
||||
providers: {
|
||||
"azure-speech": {
|
||||
voice: "en-US-JennyNeural",
|
||||
lang: "en-US",
|
||||
},
|
||||
},
|
||||
},
|
||||
},
|
||||
}
|
||||
```
|
||||
</Step>
|
||||
<Step title="Send a message">
|
||||
Send a reply through any connected channel. OpenClaw synthesizes the audio
|
||||
with Azure Speech and delivers MP3 for standard audio, or Ogg/Opus when
|
||||
the channel expects a voice note.
|
||||
</Step>
|
||||
</Steps>
|
||||
|
||||
## Configuration options
|
||||
|
||||
| Option | Path | Description |
|
||||
| ----------------------- | ----------------------------------------------------------- | ----------------------------------------------------------------------------------------------------- |
|
||||
| `apiKey` | `messages.tts.providers.azure-speech.apiKey` | Azure Speech resource key. Falls back to `AZURE_SPEECH_KEY`, `AZURE_SPEECH_API_KEY`, or `SPEECH_KEY`. |
|
||||
| `region` | `messages.tts.providers.azure-speech.region` | Azure Speech resource region. Falls back to `AZURE_SPEECH_REGION` or `SPEECH_REGION`. |
|
||||
| `endpoint` | `messages.tts.providers.azure-speech.endpoint` | Optional Azure Speech endpoint/base URL override. |
|
||||
| `baseUrl` | `messages.tts.providers.azure-speech.baseUrl` | Optional Azure Speech base URL override. |
|
||||
| `voice` | `messages.tts.providers.azure-speech.voice` | Azure voice ShortName (default `en-US-JennyNeural`). |
|
||||
| `lang` | `messages.tts.providers.azure-speech.lang` | SSML language code (default `en-US`). |
|
||||
| `outputFormat` | `messages.tts.providers.azure-speech.outputFormat` | Audio-file output format (default `audio-24khz-48kbitrate-mono-mp3`). |
|
||||
| `voiceNoteOutputFormat` | `messages.tts.providers.azure-speech.voiceNoteOutputFormat` | Voice-note output format (default `ogg-24khz-16bit-mono-opus`). |
|
||||
|
||||
## Notes
|
||||
|
||||
<AccordionGroup>
|
||||
<Accordion title="Authentication">
|
||||
Azure Speech uses a Speech resource key, not an Azure OpenAI key. The key
|
||||
is sent as `Ocp-Apim-Subscription-Key`; OpenClaw derives
|
||||
`https://<region>.tts.speech.microsoft.com` from `region` unless you
|
||||
provide `endpoint` or `baseUrl`.
|
||||
</Accordion>
|
||||
<Accordion title="Voice names">
|
||||
Use the Azure Speech voice `ShortName` value, for example
|
||||
`en-US-JennyNeural`. The bundled provider can list voices through the
|
||||
same Speech resource and filters voices marked deprecated or retired.
|
||||
</Accordion>
|
||||
<Accordion title="Audio outputs">
|
||||
Azure accepts output formats such as `audio-24khz-48kbitrate-mono-mp3`,
|
||||
`ogg-24khz-16bit-mono-opus`, and `riff-24khz-16bit-mono-pcm`. OpenClaw
|
||||
requests Ogg/Opus for `voice-note` targets so channels can send native
|
||||
voice bubbles without an extra MP3 conversion.
|
||||
</Accordion>
|
||||
<Accordion title="Alias">
|
||||
`azure` is accepted as a provider alias for existing PRs and user config,
|
||||
but new config should use `azure-speech` to avoid confusion with Azure
|
||||
OpenAI model providers.
|
||||
</Accordion>
|
||||
</AccordionGroup>
|
||||
|
||||
## Related
|
||||
|
||||
<CardGroup cols={2}>
|
||||
<Card title="Text-to-speech" href="/tools/tts" icon="waveform-lines">
|
||||
TTS overview, providers, and `messages.tts` config.
|
||||
</Card>
|
||||
<Card title="Configuration" href="/gateway/configuration" icon="gear">
|
||||
Full config reference including `messages.tts` settings.
|
||||
</Card>
|
||||
<Card title="Providers" href="/providers" icon="grid">
|
||||
All bundled OpenClaw providers.
|
||||
</Card>
|
||||
<Card title="Troubleshooting" href="/help/troubleshooting" icon="wrench">
|
||||
Common issues and debugging steps.
|
||||
</Card>
|
||||
</CardGroup>
|
||||
@@ -31,6 +31,7 @@ Looking for chat channel docs (WhatsApp/Telegram/Discord/Slack/Mattermost (plugi
|
||||
- [Amazon Bedrock Mantle](/providers/bedrock-mantle)
|
||||
- [Anthropic (API + Claude CLI)](/providers/anthropic)
|
||||
- [Arcee AI (Trinity models)](/providers/arcee)
|
||||
- [Azure Speech](/providers/azure-speech)
|
||||
- [BytePlus (International)](/concepts/model-providers#byteplus-international)
|
||||
- [Chutes](/providers/chutes)
|
||||
- [Cloudflare AI Gateway](/providers/cloudflare-ai-gateway)
|
||||
|
||||
@@ -7,11 +7,12 @@ read_when:
|
||||
title: "Text-to-speech"
|
||||
---
|
||||
|
||||
OpenClaw can convert outbound replies into audio using ElevenLabs, Google Gemini, Gradium, Inworld, Local CLI, Microsoft, MiniMax, OpenAI, Volcengine, Vydra, xAI, or Xiaomi MiMo.
|
||||
OpenClaw can convert outbound replies into audio using Azure Speech, ElevenLabs, Google Gemini, Gradium, Inworld, Local CLI, Microsoft, MiniMax, OpenAI, Volcengine, Vydra, xAI, or Xiaomi MiMo.
|
||||
It works anywhere OpenClaw can send audio.
|
||||
|
||||
## Supported services
|
||||
|
||||
- **Azure Speech** (primary or fallback provider; uses the Azure AI Speech REST API)
|
||||
- **ElevenLabs** (primary or fallback provider)
|
||||
- **Google Gemini** (primary or fallback provider; uses Gemini API TTS)
|
||||
- **Gradium** (primary or fallback provider; supports voice-note and telephony output)
|
||||
@@ -40,8 +41,10 @@ or ElevenLabs.
|
||||
|
||||
## Optional keys
|
||||
|
||||
If you want ElevenLabs, Google Gemini, Gradium, Inworld, MiniMax, OpenAI, Volcengine, Vydra, xAI, or Xiaomi MiMo:
|
||||
If you want Azure Speech, ElevenLabs, Google Gemini, Gradium, Inworld, MiniMax, OpenAI, Volcengine, Vydra, xAI, or Xiaomi MiMo:
|
||||
|
||||
- `AZURE_SPEECH_KEY` plus `AZURE_SPEECH_REGION` (also accepts
|
||||
`AZURE_SPEECH_API_KEY`, `SPEECH_KEY`, and `SPEECH_REGION`)
|
||||
- `ELEVENLABS_API_KEY` (or `XI_API_KEY`)
|
||||
- `GEMINI_API_KEY` (or `GOOGLE_API_KEY`)
|
||||
- `GRADIUM_API_KEY`
|
||||
@@ -67,6 +70,8 @@ so that provider must also be authenticated if you enable summaries.
|
||||
|
||||
- [OpenAI Text-to-Speech guide](https://platform.openai.com/docs/guides/text-to-speech)
|
||||
- [OpenAI Audio API reference](https://platform.openai.com/docs/api-reference/audio)
|
||||
- [Azure Speech REST text-to-speech](https://learn.microsoft.com/azure/ai-services/speech-service/rest-text-to-speech)
|
||||
- [Azure Speech provider](/providers/azure-speech)
|
||||
- [ElevenLabs Text to Speech](https://elevenlabs.io/docs/api-reference/text-to-speech)
|
||||
- [ElevenLabs Authentication](https://elevenlabs.io/docs/api-reference/authentication)
|
||||
- [Gradium](/providers/gradium)
|
||||
@@ -145,6 +150,36 @@ Full schema is in [Gateway configuration](/gateway/configuration).
|
||||
}
|
||||
```
|
||||
|
||||
### Azure Speech primary
|
||||
|
||||
```json5
|
||||
{
|
||||
messages: {
|
||||
tts: {
|
||||
auto: "always",
|
||||
provider: "azure-speech",
|
||||
providers: {
|
||||
"azure-speech": {
|
||||
// apiKey falls back to AZURE_SPEECH_KEY.
|
||||
// region falls back to AZURE_SPEECH_REGION.
|
||||
voice: "en-US-JennyNeural",
|
||||
lang: "en-US",
|
||||
outputFormat: "audio-24khz-48kbitrate-mono-mp3",
|
||||
voiceNoteOutputFormat: "ogg-24khz-16bit-mono-opus",
|
||||
},
|
||||
},
|
||||
},
|
||||
},
|
||||
}
|
||||
```
|
||||
|
||||
Azure Speech uses a Speech resource key, not an Azure OpenAI key. Resolution
|
||||
order is `messages.tts.providers.azure-speech.apiKey` ->
|
||||
`AZURE_SPEECH_KEY` -> `AZURE_SPEECH_API_KEY` -> `SPEECH_KEY`, plus
|
||||
`messages.tts.providers.azure-speech.region` -> `AZURE_SPEECH_REGION` ->
|
||||
`SPEECH_REGION` for the region. New config should use `azure-speech`; `azure`
|
||||
is accepted as a provider alias.
|
||||
|
||||
### Microsoft primary (no API key)
|
||||
|
||||
```json5
|
||||
@@ -495,7 +530,21 @@ Then run:
|
||||
- `maxTextLength`: hard cap for TTS input (chars). `/tts audio` fails if exceeded.
|
||||
- `timeoutMs`: request timeout (ms).
|
||||
- `prefsPath`: override the local prefs JSON path (provider/limit/summary).
|
||||
- `apiKey` values fall back to env vars (`ELEVENLABS_API_KEY`/`XI_API_KEY`, `GEMINI_API_KEY`/`GOOGLE_API_KEY`, `GRADIUM_API_KEY`, `INWORLD_API_KEY`, `MINIMAX_API_KEY`, `OPENAI_API_KEY`, `VYDRA_API_KEY`, `XAI_API_KEY`, `XIAOMI_API_KEY`). Volcengine uses `appId`/`token` instead.
|
||||
- `apiKey` values fall back to env vars (`AZURE_SPEECH_KEY`/`AZURE_SPEECH_API_KEY`/`SPEECH_KEY`, `ELEVENLABS_API_KEY`/`XI_API_KEY`, `GEMINI_API_KEY`/`GOOGLE_API_KEY`, `GRADIUM_API_KEY`, `INWORLD_API_KEY`, `MINIMAX_API_KEY`, `OPENAI_API_KEY`, `VYDRA_API_KEY`, `XAI_API_KEY`, `XIAOMI_API_KEY`). Volcengine uses `appId`/`token` instead.
|
||||
- `providers.azure-speech.apiKey`: Azure Speech resource key (env:
|
||||
`AZURE_SPEECH_KEY`, `AZURE_SPEECH_API_KEY`, or `SPEECH_KEY`).
|
||||
- `providers.azure-speech.region`: Azure Speech region such as `eastus` (env:
|
||||
`AZURE_SPEECH_REGION` or `SPEECH_REGION`).
|
||||
- `providers.azure-speech.endpoint` / `providers.azure-speech.baseUrl`: optional
|
||||
Azure Speech endpoint/base URL override.
|
||||
- `providers.azure-speech.voice`: Azure voice ShortName (default
|
||||
`en-US-JennyNeural`).
|
||||
- `providers.azure-speech.lang`: SSML language code (default `en-US`).
|
||||
- `providers.azure-speech.outputFormat`: Azure `X-Microsoft-OutputFormat` for
|
||||
standard audio output (default `audio-24khz-48kbitrate-mono-mp3`).
|
||||
- `providers.azure-speech.voiceNoteOutputFormat`: Azure
|
||||
`X-Microsoft-OutputFormat` for voice-note output (default
|
||||
`ogg-24khz-16bit-mono-opus`).
|
||||
- `providers.elevenlabs.baseUrl`: override ElevenLabs API base URL.
|
||||
- `providers.openai.baseUrl`: override the OpenAI TTS endpoint.
|
||||
- Resolution order: `messages.tts.providers.openai.baseUrl` -> `OPENAI_TTS_BASE_URL` -> `https://api.openai.com/v1`
|
||||
|
||||
Reference in New Issue
Block a user