mirror of
https://github.com/openclaw/openclaw.git
synced 2026-05-06 16:01:01 +00:00
feat(tts): add Inworld speech provider (#55972)
Adds the bundled Inworld speech provider with docs, config surface, SSRF-guarded fetches, directive overrides, native voice-note/telephony output coverage, and live `.profile` verification. Co-authored-by: cshape <cshape@users.noreply.github.com>
This commit is contained in:
@@ -1,4 +1,4 @@
|
||||
9ac3d271f9bfa9611557f0b52e4d0a600693bdd1de75cc1bafc320fc4d4f0075 config-baseline.json
|
||||
0b0d796bceddfb9e2929518ba84af626da7f5d75c392a217041f36e850c4e74f config-baseline.json
|
||||
271fdf1d6652927e0fc160a6f25276bf6dccb8f1b27fab15e0fc2620e8cacab4 config-baseline.core.json
|
||||
7cd9c908f066c143eab2a201efbc9640f483ab28bba92ddeca1d18cc2b528bc3 config-baseline.channel.json
|
||||
7825b56a5b3fcdbe2e09ef8fe5d9f12ac3598435afebe20413051e45b0d1968e config-baseline.plugin.json
|
||||
17eb3f8887193579ff32e35f9bd520ba2bd6049e52ab18855c5d41fcbf195d83 config-baseline.plugin.json
|
||||
|
||||
@@ -1317,6 +1317,7 @@
|
||||
"providers/groq",
|
||||
"providers/huggingface",
|
||||
"providers/inferrs",
|
||||
"providers/inworld",
|
||||
"providers/kilocode",
|
||||
"providers/litellm",
|
||||
"providers/lmstudio",
|
||||
|
||||
115
docs/providers/inworld.md
Normal file
115
docs/providers/inworld.md
Normal file
@@ -0,0 +1,115 @@
|
||||
---
|
||||
summary: "Inworld streaming text-to-speech for OpenClaw replies"
|
||||
read_when:
|
||||
- You want Inworld speech synthesis for outbound replies
|
||||
- You need PCM telephony or OGG_OPUS voice-note output from Inworld
|
||||
title: "Inworld"
|
||||
---
|
||||
|
||||
Inworld is a streaming text-to-speech (TTS) provider. In OpenClaw it
|
||||
synthesizes outbound reply audio (MP3 by default, OGG_OPUS for voice notes)
|
||||
and PCM audio for telephony channels such as Voice Call.
|
||||
|
||||
OpenClaw posts to Inworld's streaming TTS endpoint, concatenates the
|
||||
returned base64 audio chunks into a single buffer, and hands the result to
|
||||
the standard reply-audio pipeline.
|
||||
|
||||
| Detail | Value |
|
||||
| ------------- | ----------------------------------------------------------- |
|
||||
| Website | [inworld.ai](https://inworld.ai) |
|
||||
| Docs | [docs.inworld.ai/tts/tts](https://docs.inworld.ai/tts/tts) |
|
||||
| Auth | `INWORLD_API_KEY` (HTTP Basic, Base64 dashboard credential) |
|
||||
| Default voice | `Sarah` |
|
||||
| Default model | `inworld-tts-1.5-max` |
|
||||
|
||||
## Getting started
|
||||
|
||||
<Steps>
|
||||
<Step title="Set your API key">
|
||||
Copy the credential from your Inworld dashboard (Workspace > API Keys)
|
||||
and set it as an env var. The value is sent verbatim as the HTTP Basic
|
||||
credential, so do not Base64-encode it again or convert it to a bearer
|
||||
token.
|
||||
|
||||
```
|
||||
INWORLD_API_KEY=<base64-credential-from-dashboard>
|
||||
```
|
||||
|
||||
</Step>
|
||||
<Step title="Select Inworld in messages.tts">
|
||||
```json5
|
||||
{
|
||||
messages: {
|
||||
tts: {
|
||||
auto: "always",
|
||||
provider: "inworld",
|
||||
providers: {
|
||||
inworld: {
|
||||
voiceId: "Sarah",
|
||||
modelId: "inworld-tts-1.5-max",
|
||||
},
|
||||
},
|
||||
},
|
||||
},
|
||||
}
|
||||
```
|
||||
</Step>
|
||||
<Step title="Send a message">
|
||||
Send a reply through any connected channel. OpenClaw synthesizes the
|
||||
audio with Inworld and delivers it as MP3 (or OGG_OPUS when the channel
|
||||
expects a voice note).
|
||||
</Step>
|
||||
</Steps>
|
||||
|
||||
## Configuration options
|
||||
|
||||
| Option | Path | Description |
|
||||
| ------------- | -------------------------------------------- | ----------------------------------------------------------------- |
|
||||
| `apiKey` | `messages.tts.providers.inworld.apiKey` | Base64 dashboard credential. Falls back to `INWORLD_API_KEY`. |
|
||||
| `baseUrl` | `messages.tts.providers.inworld.baseUrl` | Override Inworld API base URL (default `https://api.inworld.ai`). |
|
||||
| `voiceId` | `messages.tts.providers.inworld.voiceId` | Voice identifier (default `Sarah`). |
|
||||
| `modelId` | `messages.tts.providers.inworld.modelId` | TTS model id (default `inworld-tts-1.5-max`). |
|
||||
| `temperature` | `messages.tts.providers.inworld.temperature` | Sampling temperature `0..2` (optional). |
|
||||
|
||||
## Notes
|
||||
|
||||
<AccordionGroup>
|
||||
<Accordion title="Authentication">
|
||||
Inworld uses HTTP Basic auth with a single Base64-encoded credential
|
||||
string. Copy it verbatim from the Inworld dashboard. The provider sends
|
||||
it as `Authorization: Basic <apiKey>` without any further encoding, so
|
||||
do not Base64-encode it yourself and do not pass a bearer-style token.
|
||||
See [TTS auth notes](/tools/tts#inworld-primary) for the same callout.
|
||||
</Accordion>
|
||||
<Accordion title="Models">
|
||||
Supported model ids: `inworld-tts-1.5-max` (default),
|
||||
`inworld-tts-1.5-mini`, `inworld-tts-1-max`, `inworld-tts-1`.
|
||||
</Accordion>
|
||||
<Accordion title="Audio outputs">
|
||||
Replies use MP3 by default. When the channel target is `voice-note`
|
||||
OpenClaw asks Inworld for `OGG_OPUS` so the audio plays as a native
|
||||
voice bubble. Telephony synthesis uses raw `PCM` at 22050 Hz to feed
|
||||
the telephony bridge.
|
||||
</Accordion>
|
||||
<Accordion title="Custom endpoints">
|
||||
Override the API host with `messages.tts.providers.inworld.baseUrl`.
|
||||
Trailing slashes are stripped before requests are sent.
|
||||
</Accordion>
|
||||
</AccordionGroup>
|
||||
|
||||
## Related
|
||||
|
||||
<CardGroup cols={2}>
|
||||
<Card title="Text-to-speech" href="/tools/tts" icon="waveform-lines">
|
||||
TTS overview, providers, and `messages.tts` config.
|
||||
</Card>
|
||||
<Card title="Configuration" href="/gateway/configuration" icon="gear">
|
||||
Full config reference including `messages.tts` settings.
|
||||
</Card>
|
||||
<Card title="Providers" href="/providers" icon="grid">
|
||||
All bundled OpenClaw providers.
|
||||
</Card>
|
||||
<Card title="Troubleshooting" href="/help/troubleshooting" icon="wrench">
|
||||
Common issues and debugging steps.
|
||||
</Card>
|
||||
</CardGroup>
|
||||
@@ -7,7 +7,7 @@ read_when:
|
||||
title: "Text-to-speech"
|
||||
---
|
||||
|
||||
OpenClaw can convert outbound replies into audio using ElevenLabs, Google Gemini, Gradium, Local CLI, Microsoft, MiniMax, OpenAI, Vydra, xAI, or Xiaomi MiMo.
|
||||
OpenClaw can convert outbound replies into audio using ElevenLabs, Google Gemini, Gradium, Inworld, Local CLI, Microsoft, MiniMax, OpenAI, Vydra, xAI, or Xiaomi MiMo.
|
||||
It works anywhere OpenClaw can send audio.
|
||||
|
||||
## Supported services
|
||||
@@ -15,6 +15,7 @@ It works anywhere OpenClaw can send audio.
|
||||
- **ElevenLabs** (primary or fallback provider)
|
||||
- **Google Gemini** (primary or fallback provider; uses Gemini API TTS)
|
||||
- **Gradium** (primary or fallback provider; supports voice-note and telephony output)
|
||||
- **Inworld** (primary or fallback provider; uses the Inworld streaming TTS API)
|
||||
- **Local CLI** (primary or fallback provider; runs a configured local TTS command)
|
||||
- **Microsoft** (primary or fallback provider; current bundled implementation uses `node-edge-tts`)
|
||||
- **MiniMax** (primary or fallback provider; uses the T2A v2 API)
|
||||
@@ -38,11 +39,12 @@ or ElevenLabs.
|
||||
|
||||
## Optional keys
|
||||
|
||||
If you want OpenAI, ElevenLabs, Google Gemini, Gradium, MiniMax, Vydra, xAI, or Xiaomi MiMo:
|
||||
If you want ElevenLabs, Google Gemini, Gradium, Inworld, MiniMax, OpenAI, Vydra, xAI, or Xiaomi MiMo:
|
||||
|
||||
- `ELEVENLABS_API_KEY` (or `XI_API_KEY`)
|
||||
- `GEMINI_API_KEY` (or `GOOGLE_API_KEY`)
|
||||
- `GRADIUM_API_KEY`
|
||||
- `INWORLD_API_KEY`
|
||||
- `MINIMAX_API_KEY`; MiniMax TTS also accepts Token Plan auth via
|
||||
`MINIMAX_OAUTH_TOKEN`, `MINIMAX_CODE_PLAN_KEY`, or
|
||||
`MINIMAX_CODING_API_KEY`
|
||||
@@ -64,6 +66,7 @@ so that provider must also be authenticated if you enable summaries.
|
||||
- [ElevenLabs Text to Speech](https://elevenlabs.io/docs/api-reference/text-to-speech)
|
||||
- [ElevenLabs Authentication](https://elevenlabs.io/docs/api-reference/authentication)
|
||||
- [Gradium](/providers/gradium)
|
||||
- [Inworld TTS API](https://docs.inworld.ai/tts/tts)
|
||||
- [MiniMax T2A v2 API](https://platform.minimaxi.com/document/T2A%20V2)
|
||||
- [Xiaomi MiMo speech synthesis](/providers/xiaomi#text-to-speech)
|
||||
- [node-edge-tts](https://github.com/SchneeHertz/node-edge-tts)
|
||||
@@ -217,6 +220,35 @@ by the bundled Google image-generation provider. Resolution order is
|
||||
`messages.tts.providers.google.apiKey` -> `models.providers.google.apiKey` ->
|
||||
`GEMINI_API_KEY` -> `GOOGLE_API_KEY`.
|
||||
|
||||
### Inworld primary
|
||||
|
||||
```json5
|
||||
{
|
||||
messages: {
|
||||
tts: {
|
||||
auto: "always",
|
||||
provider: "inworld",
|
||||
providers: {
|
||||
inworld: {
|
||||
apiKey: "inworld_api_key",
|
||||
baseUrl: "https://api.inworld.ai",
|
||||
voiceId: "Sarah",
|
||||
modelId: "inworld-tts-1.5-max",
|
||||
temperature: 0.8,
|
||||
},
|
||||
},
|
||||
},
|
||||
},
|
||||
}
|
||||
```
|
||||
|
||||
The `apiKey` value must be the Base64-encoded credential string copied
|
||||
verbatim from the Inworld dashboard (Workspace > API Keys). The provider
|
||||
sends it as `Authorization: Basic <apiKey>` without any additional
|
||||
encoding, so do not pass a raw bearer token and do not Base64-encode it
|
||||
yourself. The key falls back to the `INWORLD_API_KEY` env var. See
|
||||
[Inworld provider](/providers/inworld) for full setup.
|
||||
|
||||
### xAI primary
|
||||
|
||||
```json5
|
||||
@@ -415,7 +447,7 @@ Then run:
|
||||
- `tagged` only sends audio when the reply includes `[[tts:key=value]]` directives or a `[[tts:text]]...[[/tts:text]]` block.
|
||||
- `enabled`: legacy toggle (doctor migrates this to `auto`).
|
||||
- `mode`: `"final"` (default) or `"all"` (includes tool/block replies).
|
||||
- `provider`: speech provider id such as `"elevenlabs"`, `"google"`, `"gradium"`, `"microsoft"`, `"minimax"`, `"openai"`, `"vydra"`, `"xai"`, or `"xiaomi"` (fallback is automatic).
|
||||
- `provider`: speech provider id such as `"elevenlabs"`, `"google"`, `"gradium"`, `"inworld"`, `"microsoft"`, `"minimax"`, `"openai"`, `"vydra"`, `"xai"`, or `"xiaomi"` (fallback is automatic).
|
||||
- If `provider` is **unset**, OpenClaw uses the first configured speech provider in registry auto-select order.
|
||||
- Legacy `provider: "edge"` config is repaired by `openclaw doctor --fix` and
|
||||
rewritten to `provider: "microsoft"`.
|
||||
@@ -429,7 +461,7 @@ Then run:
|
||||
- `maxTextLength`: hard cap for TTS input (chars). `/tts audio` fails if exceeded.
|
||||
- `timeoutMs`: request timeout (ms).
|
||||
- `prefsPath`: override the local prefs JSON path (provider/limit/summary).
|
||||
- `apiKey` values fall back to env vars (`ELEVENLABS_API_KEY`/`XI_API_KEY`, `GEMINI_API_KEY`/`GOOGLE_API_KEY`, `GRADIUM_API_KEY`, `MINIMAX_API_KEY`, `OPENAI_API_KEY`, `VYDRA_API_KEY`, `XAI_API_KEY`, `XIAOMI_API_KEY`).
|
||||
- `apiKey` values fall back to env vars (`ELEVENLABS_API_KEY`/`XI_API_KEY`, `GEMINI_API_KEY`/`GOOGLE_API_KEY`, `GRADIUM_API_KEY`, `INWORLD_API_KEY`, `MINIMAX_API_KEY`, `OPENAI_API_KEY`, `VYDRA_API_KEY`, `XAI_API_KEY`, `XIAOMI_API_KEY`).
|
||||
- `providers.elevenlabs.baseUrl`: override ElevenLabs API base URL.
|
||||
- `providers.openai.baseUrl`: override the OpenAI TTS endpoint.
|
||||
- Resolution order: `messages.tts.providers.openai.baseUrl` -> `OPENAI_TTS_BASE_URL` -> `https://api.openai.com/v1`
|
||||
@@ -453,6 +485,10 @@ Then run:
|
||||
- `providers.tts-local-cli.timeoutMs`: command timeout in milliseconds (default `120000`).
|
||||
- `providers.tts-local-cli.cwd`: optional command working directory.
|
||||
- `providers.tts-local-cli.env`: optional string environment overrides for the command.
|
||||
- `providers.inworld.baseUrl`: override Inworld API base URL (default `https://api.inworld.ai`).
|
||||
- `providers.inworld.voiceId`: Inworld voice identifier (default `Sarah`).
|
||||
- `providers.inworld.modelId`: Inworld TTS model (default `inworld-tts-1.5-max`; also supports `inworld-tts-1.5-mini`, `inworld-tts-1-max`, `inworld-tts-1`).
|
||||
- `providers.inworld.temperature`: sampling temperature `0..2` (optional).
|
||||
- `providers.google.model`: Gemini TTS model (default `gemini-3.1-flash-tts-preview`).
|
||||
- `providers.google.voiceName`: Gemini prebuilt voice name (default `Kore`; `voice` is also accepted).
|
||||
- `providers.google.audioProfile`: natural-language style prompt prepended before the spoken text.
|
||||
@@ -586,6 +622,7 @@ These override `messages.tts.*` for that host.
|
||||
with `ffmpeg`.
|
||||
- **Google Gemini**: Gemini API TTS returns raw 24kHz PCM. OpenClaw wraps it as WAV for audio attachments, transcodes it to 48kHz Opus for voice-note targets, and returns PCM directly for Talk/telephony.
|
||||
- **Gradium**: WAV for audio attachments, Opus for voice-note targets, and `ulaw_8000` at 8 kHz for telephony.
|
||||
- **Inworld**: MP3 for normal audio attachments, native `OGG_OPUS` for voice-note targets, and raw `PCM` at 22050 Hz for Talk/telephony.
|
||||
- **xAI**: MP3 by default; `responseFormat` may be `mp3`, `wav`, `pcm`, `mulaw`, or `alaw`. OpenClaw uses xAI's batch REST TTS endpoint and returns a complete audio attachment; xAI's streaming TTS WebSocket is not used by this provider path. Native Opus voice-note format is not supported by this path.
|
||||
- **Microsoft**: uses `microsoft.outputFormat` (default `audio-24khz-48kbitrate-mono-mp3`).
|
||||
- The bundled transport accepts an `outputFormat`, but not all formats are available from the service.
|
||||
|
||||
Reference in New Issue
Block a user