mirror of
https://github.com/openclaw/openclaw.git
synced 2026-05-06 13:40:44 +00:00
docs: outline unified talk API
This commit is contained in:
@@ -14,6 +14,12 @@ media capabilities are tool-driven: the agent decides when to use them based
|
||||
on the conversation, and each tool only appears when at least one backing
|
||||
provider is configured.
|
||||
|
||||
Live speech uses the Talk session contract instead of the one-shot media tool
|
||||
path. Talk has three modes: provider-native `realtime`, local or streaming
|
||||
`stt-tts`, and `transcription` for observe-only speech capture. Those modes
|
||||
share provider catalogs, event envelopes, and cancellation semantics with
|
||||
telephony, meetings, browser realtime, and native push-to-talk clients.
|
||||
|
||||
## Capabilities
|
||||
|
||||
<CardGroup cols={2}>
|
||||
@@ -110,6 +116,11 @@ Deepgram, ElevenLabs, Mistral, OpenAI, and xAI also register Voice Call
|
||||
streaming STT providers, so live phone audio can be forwarded to the selected
|
||||
vendor without waiting for a completed recording.
|
||||
|
||||
For live user conversations, prefer [Talk mode](/nodes/talk). Batch audio
|
||||
attachments stay on the media path; browser realtime, native push-to-talk,
|
||||
telephony, and meeting audio should use Talk events and the session-scoped
|
||||
catalogs returned by the Gateway.
|
||||
|
||||
## Provider mappings (how vendors split across surfaces)
|
||||
|
||||
<AccordionGroup>
|
||||
@@ -144,3 +155,4 @@ vendor without waiting for a completed recording.
|
||||
- [Text-to-speech](/tools/tts)
|
||||
- [Media understanding](/nodes/media-understanding)
|
||||
- [Audio nodes](/nodes/audio)
|
||||
- [Talk mode](/nodes/talk)
|
||||
|
||||
@@ -12,6 +12,11 @@ OpenClaw can convert outbound replies into audio across **14 speech providers**
|
||||
and deliver native voice messages on Feishu, Matrix, Telegram, and WhatsApp,
|
||||
audio attachments everywhere else, and PCM/Ulaw streams for telephony and Talk.
|
||||
|
||||
TTS is the speech-output half of Talk's `stt-tts` mode. Provider-native
|
||||
`realtime` Talk sessions synthesize speech inside the realtime provider instead
|
||||
of calling this TTS path, while `transcription` sessions do not synthesize an
|
||||
assistant voice response.
|
||||
|
||||
## Quick start
|
||||
|
||||
<Steps>
|
||||
@@ -586,6 +591,11 @@ attempted provider:
|
||||
The whole TTS request only fails when **every** attempted provider is skipped
|
||||
or fails.
|
||||
|
||||
Talk session provider selection is session-scoped. A Talk client should choose
|
||||
provider ids, model ids, voice ids, and locales from `talk.catalog` and pass
|
||||
them through the Talk session or handoff request. Opening a voice session should
|
||||
not mutate `messages.tts` or global Talk provider defaults.
|
||||
|
||||
## Model-driven directives
|
||||
|
||||
By default, the assistant **can** emit `[[tts:...]]` directives to override
|
||||
|
||||
Reference in New Issue
Block a user