docs: outline unified talk API

This commit is contained in:
Peter Steinberger
2026-05-05 20:59:13 +01:00
parent 1f7d0ef310
commit 24853ced11
13 changed files with 625 additions and 13 deletions

View File

@@ -14,6 +14,12 @@ media capabilities are tool-driven: the agent decides when to use them based
on the conversation, and each tool only appears when at least one backing
provider is configured.
Live speech uses the Talk session contract instead of the one-shot media tool
path. Talk has three modes: provider-native `realtime`, local or streaming
`stt-tts`, and `transcription` for observe-only speech capture. Those modes
share provider catalogs, event envelopes, and cancellation semantics with
telephony, meetings, browser realtime, and native push-to-talk clients.
## Capabilities
<CardGroup cols={2}>
@@ -110,6 +116,11 @@ Deepgram, ElevenLabs, Mistral, OpenAI, and xAI also register Voice Call
streaming STT providers, so live phone audio can be forwarded to the selected
vendor without waiting for a completed recording.
For live user conversations, prefer [Talk mode](/nodes/talk). Batch audio
attachments stay on the media path; browser realtime, native push-to-talk,
telephony, and meeting audio should use Talk events and the session-scoped
catalogs returned by the Gateway.
## Provider mappings (how vendors split across surfaces)
<AccordionGroup>
@@ -144,3 +155,4 @@ vendor without waiting for a completed recording.
- [Text-to-speech](/tools/tts)
- [Media understanding](/nodes/media-understanding)
- [Audio nodes](/nodes/audio)
- [Talk mode](/nodes/talk)

View File

@@ -12,6 +12,11 @@ OpenClaw can convert outbound replies into audio across **14 speech providers**
and deliver native voice messages on Feishu, Matrix, Telegram, and WhatsApp,
audio attachments everywhere else, and PCM/Ulaw streams for telephony and Talk.
TTS is the speech-output half of Talk's `stt-tts` mode. Provider-native
`realtime` Talk sessions synthesize speech inside the realtime provider instead
of calling this TTS path, while `transcription` sessions do not synthesize an
assistant voice response.
## Quick start
<Steps>
@@ -586,6 +591,11 @@ attempted provider:
The whole TTS request only fails when **every** attempted provider is skipped
or fails.
Talk session provider selection is session-scoped. A Talk client should choose
provider ids, model ids, voice ids, and locales from `talk.catalog` and pass
them through the Talk session or handoff request. Opening a voice session should
not mutate `messages.tts` or global Talk provider defaults.
## Model-driven directives
By default, the assistant **can** emit `[[tts:...]]` directives to override