feat(tts): add Inworld speech provider (#55972)

Adds the bundled Inworld speech provider with docs, config surface, SSRF-guarded fetches, directive overrides, native voice-note/telephony output coverage, and live `.profile` verification.

Co-authored-by: cshape <cshape@users.noreply.github.com>
This commit is contained in:
Cale Shapera
2026-04-25 14:33:21 -07:00
committed by GitHub
parent 167588cb4f
commit 0bcb4c95c1
23 changed files with 1295 additions and 16 deletions

View File

@@ -1,4 +1,4 @@
9ac3d271f9bfa9611557f0b52e4d0a600693bdd1de75cc1bafc320fc4d4f0075 config-baseline.json
0b0d796bceddfb9e2929518ba84af626da7f5d75c392a217041f36e850c4e74f config-baseline.json
271fdf1d6652927e0fc160a6f25276bf6dccb8f1b27fab15e0fc2620e8cacab4 config-baseline.core.json
7cd9c908f066c143eab2a201efbc9640f483ab28bba92ddeca1d18cc2b528bc3 config-baseline.channel.json
7825b56a5b3fcdbe2e09ef8fe5d9f12ac3598435afebe20413051e45b0d1968e config-baseline.plugin.json
17eb3f8887193579ff32e35f9bd520ba2bd6049e52ab18855c5d41fcbf195d83 config-baseline.plugin.json

View File

@@ -1317,6 +1317,7 @@
"providers/groq",
"providers/huggingface",
"providers/inferrs",
"providers/inworld",
"providers/kilocode",
"providers/litellm",
"providers/lmstudio",

115
docs/providers/inworld.md Normal file
View File

@@ -0,0 +1,115 @@
---
summary: "Inworld streaming text-to-speech for OpenClaw replies"
read_when:
- You want Inworld speech synthesis for outbound replies
- You need PCM telephony or OGG_OPUS voice-note output from Inworld
title: "Inworld"
---
Inworld is a streaming text-to-speech (TTS) provider. In OpenClaw it
synthesizes outbound reply audio (MP3 by default, OGG_OPUS for voice notes)
and PCM audio for telephony channels such as Voice Call.
OpenClaw posts to Inworld's streaming TTS endpoint, concatenates the
returned base64 audio chunks into a single buffer, and hands the result to
the standard reply-audio pipeline.
| Detail | Value |
| ------------- | ----------------------------------------------------------- |
| Website | [inworld.ai](https://inworld.ai) |
| Docs | [docs.inworld.ai/tts/tts](https://docs.inworld.ai/tts/tts) |
| Auth | `INWORLD_API_KEY` (HTTP Basic, Base64 dashboard credential) |
| Default voice | `Sarah` |
| Default model | `inworld-tts-1.5-max` |
## Getting started
<Steps>
<Step title="Set your API key">
Copy the credential from your Inworld dashboard (Workspace > API Keys)
and set it as an env var. The value is sent verbatim as the HTTP Basic
credential, so do not Base64-encode it again or convert it to a bearer
token.
```
INWORLD_API_KEY=<base64-credential-from-dashboard>
```
</Step>
<Step title="Select Inworld in messages.tts">
```json5
{
messages: {
tts: {
auto: "always",
provider: "inworld",
providers: {
inworld: {
voiceId: "Sarah",
modelId: "inworld-tts-1.5-max",
},
},
},
},
}
```
</Step>
<Step title="Send a message">
Send a reply through any connected channel. OpenClaw synthesizes the
audio with Inworld and delivers it as MP3 (or OGG_OPUS when the channel
expects a voice note).
</Step>
</Steps>
## Configuration options
| Option | Path | Description |
| ------------- | -------------------------------------------- | ----------------------------------------------------------------- |
| `apiKey` | `messages.tts.providers.inworld.apiKey` | Base64 dashboard credential. Falls back to `INWORLD_API_KEY`. |
| `baseUrl` | `messages.tts.providers.inworld.baseUrl` | Override Inworld API base URL (default `https://api.inworld.ai`). |
| `voiceId` | `messages.tts.providers.inworld.voiceId` | Voice identifier (default `Sarah`). |
| `modelId` | `messages.tts.providers.inworld.modelId` | TTS model id (default `inworld-tts-1.5-max`). |
| `temperature` | `messages.tts.providers.inworld.temperature` | Sampling temperature `0..2` (optional). |
## Notes
<AccordionGroup>
<Accordion title="Authentication">
Inworld uses HTTP Basic auth with a single Base64-encoded credential
string. Copy it verbatim from the Inworld dashboard. The provider sends
it as `Authorization: Basic <apiKey>` without any further encoding, so
do not Base64-encode it yourself and do not pass a bearer-style token.
See [TTS auth notes](/tools/tts#inworld-primary) for the same callout.
</Accordion>
<Accordion title="Models">
Supported model ids: `inworld-tts-1.5-max` (default),
`inworld-tts-1.5-mini`, `inworld-tts-1-max`, `inworld-tts-1`.
</Accordion>
<Accordion title="Audio outputs">
Replies use MP3 by default. When the channel target is `voice-note`
OpenClaw asks Inworld for `OGG_OPUS` so the audio plays as a native
voice bubble. Telephony synthesis uses raw `PCM` at 22050 Hz to feed
the telephony bridge.
</Accordion>
<Accordion title="Custom endpoints">
Override the API host with `messages.tts.providers.inworld.baseUrl`.
Trailing slashes are stripped before requests are sent.
</Accordion>
</AccordionGroup>
## Related
<CardGroup cols={2}>
<Card title="Text-to-speech" href="/tools/tts" icon="waveform-lines">
TTS overview, providers, and `messages.tts` config.
</Card>
<Card title="Configuration" href="/gateway/configuration" icon="gear">
Full config reference including `messages.tts` settings.
</Card>
<Card title="Providers" href="/providers" icon="grid">
All bundled OpenClaw providers.
</Card>
<Card title="Troubleshooting" href="/help/troubleshooting" icon="wrench">
Common issues and debugging steps.
</Card>
</CardGroup>

View File

@@ -7,7 +7,7 @@ read_when:
title: "Text-to-speech"
---
OpenClaw can convert outbound replies into audio using ElevenLabs, Google Gemini, Gradium, Local CLI, Microsoft, MiniMax, OpenAI, Vydra, xAI, or Xiaomi MiMo.
OpenClaw can convert outbound replies into audio using ElevenLabs, Google Gemini, Gradium, Inworld, Local CLI, Microsoft, MiniMax, OpenAI, Vydra, xAI, or Xiaomi MiMo.
It works anywhere OpenClaw can send audio.
## Supported services
@@ -15,6 +15,7 @@ It works anywhere OpenClaw can send audio.
- **ElevenLabs** (primary or fallback provider)
- **Google Gemini** (primary or fallback provider; uses Gemini API TTS)
- **Gradium** (primary or fallback provider; supports voice-note and telephony output)
- **Inworld** (primary or fallback provider; uses the Inworld streaming TTS API)
- **Local CLI** (primary or fallback provider; runs a configured local TTS command)
- **Microsoft** (primary or fallback provider; current bundled implementation uses `node-edge-tts`)
- **MiniMax** (primary or fallback provider; uses the T2A v2 API)
@@ -38,11 +39,12 @@ or ElevenLabs.
## Optional keys
If you want OpenAI, ElevenLabs, Google Gemini, Gradium, MiniMax, Vydra, xAI, or Xiaomi MiMo:
If you want ElevenLabs, Google Gemini, Gradium, Inworld, MiniMax, OpenAI, Vydra, xAI, or Xiaomi MiMo:
- `ELEVENLABS_API_KEY` (or `XI_API_KEY`)
- `GEMINI_API_KEY` (or `GOOGLE_API_KEY`)
- `GRADIUM_API_KEY`
- `INWORLD_API_KEY`
- `MINIMAX_API_KEY`; MiniMax TTS also accepts Token Plan auth via
`MINIMAX_OAUTH_TOKEN`, `MINIMAX_CODE_PLAN_KEY`, or
`MINIMAX_CODING_API_KEY`
@@ -64,6 +66,7 @@ so that provider must also be authenticated if you enable summaries.
- [ElevenLabs Text to Speech](https://elevenlabs.io/docs/api-reference/text-to-speech)
- [ElevenLabs Authentication](https://elevenlabs.io/docs/api-reference/authentication)
- [Gradium](/providers/gradium)
- [Inworld TTS API](https://docs.inworld.ai/tts/tts)
- [MiniMax T2A v2 API](https://platform.minimaxi.com/document/T2A%20V2)
- [Xiaomi MiMo speech synthesis](/providers/xiaomi#text-to-speech)
- [node-edge-tts](https://github.com/SchneeHertz/node-edge-tts)
@@ -217,6 +220,35 @@ by the bundled Google image-generation provider. Resolution order is
`messages.tts.providers.google.apiKey` -> `models.providers.google.apiKey` ->
`GEMINI_API_KEY` -> `GOOGLE_API_KEY`.
### Inworld primary
```json5
{
messages: {
tts: {
auto: "always",
provider: "inworld",
providers: {
inworld: {
apiKey: "inworld_api_key",
baseUrl: "https://api.inworld.ai",
voiceId: "Sarah",
modelId: "inworld-tts-1.5-max",
temperature: 0.8,
},
},
},
},
}
```
The `apiKey` value must be the Base64-encoded credential string copied
verbatim from the Inworld dashboard (Workspace > API Keys). The provider
sends it as `Authorization: Basic <apiKey>` without any additional
encoding, so do not pass a raw bearer token and do not Base64-encode it
yourself. The key falls back to the `INWORLD_API_KEY` env var. See
[Inworld provider](/providers/inworld) for full setup.
### xAI primary
```json5
@@ -415,7 +447,7 @@ Then run:
- `tagged` only sends audio when the reply includes `[[tts:key=value]]` directives or a `[[tts:text]]...[[/tts:text]]` block.
- `enabled`: legacy toggle (doctor migrates this to `auto`).
- `mode`: `"final"` (default) or `"all"` (includes tool/block replies).
- `provider`: speech provider id such as `"elevenlabs"`, `"google"`, `"gradium"`, `"microsoft"`, `"minimax"`, `"openai"`, `"vydra"`, `"xai"`, or `"xiaomi"` (fallback is automatic).
- `provider`: speech provider id such as `"elevenlabs"`, `"google"`, `"gradium"`, `"inworld"`, `"microsoft"`, `"minimax"`, `"openai"`, `"vydra"`, `"xai"`, or `"xiaomi"` (fallback is automatic).
- If `provider` is **unset**, OpenClaw uses the first configured speech provider in registry auto-select order.
- Legacy `provider: "edge"` config is repaired by `openclaw doctor --fix` and
rewritten to `provider: "microsoft"`.
@@ -429,7 +461,7 @@ Then run:
- `maxTextLength`: hard cap for TTS input (chars). `/tts audio` fails if exceeded.
- `timeoutMs`: request timeout (ms).
- `prefsPath`: override the local prefs JSON path (provider/limit/summary).
- `apiKey` values fall back to env vars (`ELEVENLABS_API_KEY`/`XI_API_KEY`, `GEMINI_API_KEY`/`GOOGLE_API_KEY`, `GRADIUM_API_KEY`, `MINIMAX_API_KEY`, `OPENAI_API_KEY`, `VYDRA_API_KEY`, `XAI_API_KEY`, `XIAOMI_API_KEY`).
- `apiKey` values fall back to env vars (`ELEVENLABS_API_KEY`/`XI_API_KEY`, `GEMINI_API_KEY`/`GOOGLE_API_KEY`, `GRADIUM_API_KEY`, `INWORLD_API_KEY`, `MINIMAX_API_KEY`, `OPENAI_API_KEY`, `VYDRA_API_KEY`, `XAI_API_KEY`, `XIAOMI_API_KEY`).
- `providers.elevenlabs.baseUrl`: override ElevenLabs API base URL.
- `providers.openai.baseUrl`: override the OpenAI TTS endpoint.
- Resolution order: `messages.tts.providers.openai.baseUrl` -> `OPENAI_TTS_BASE_URL` -> `https://api.openai.com/v1`
@@ -453,6 +485,10 @@ Then run:
- `providers.tts-local-cli.timeoutMs`: command timeout in milliseconds (default `120000`).
- `providers.tts-local-cli.cwd`: optional command working directory.
- `providers.tts-local-cli.env`: optional string environment overrides for the command.
- `providers.inworld.baseUrl`: override Inworld API base URL (default `https://api.inworld.ai`).
- `providers.inworld.voiceId`: Inworld voice identifier (default `Sarah`).
- `providers.inworld.modelId`: Inworld TTS model (default `inworld-tts-1.5-max`; also supports `inworld-tts-1.5-mini`, `inworld-tts-1-max`, `inworld-tts-1`).
- `providers.inworld.temperature`: sampling temperature `0..2` (optional).
- `providers.google.model`: Gemini TTS model (default `gemini-3.1-flash-tts-preview`).
- `providers.google.voiceName`: Gemini prebuilt voice name (default `Kore`; `voice` is also accepted).
- `providers.google.audioProfile`: natural-language style prompt prepended before the spoken text.
@@ -586,6 +622,7 @@ These override `messages.tts.*` for that host.
with `ffmpeg`.
- **Google Gemini**: Gemini API TTS returns raw 24kHz PCM. OpenClaw wraps it as WAV for audio attachments, transcodes it to 48kHz Opus for voice-note targets, and returns PCM directly for Talk/telephony.
- **Gradium**: WAV for audio attachments, Opus for voice-note targets, and `ulaw_8000` at 8 kHz for telephony.
- **Inworld**: MP3 for normal audio attachments, native `OGG_OPUS` for voice-note targets, and raw `PCM` at 22050 Hz for Talk/telephony.
- **xAI**: MP3 by default; `responseFormat` may be `mp3`, `wav`, `pcm`, `mulaw`, or `alaw`. OpenClaw uses xAI's batch REST TTS endpoint and returns a complete audio attachment; xAI's streaming TTS WebSocket is not used by this provider path. Native Opus voice-note format is not supported by this path.
- **Microsoft**: uses `microsoft.outputFormat` (default `audio-24khz-48kbitrate-mono-mp3`).
- The bundled transport accepts an `outputFormat`, but not all formats are available from the service.