docs: outline unified talk API

2026-05-06 13:40:44 +00:00 · 2026-05-05 20:59:13 +01:00
parent 1f7d0ef310
commit 24853ced11
13 changed files with 625 additions and 13 deletions
--- a/docs/tools/media-overview.md
+++ b/docs/tools/media-overview.md
@@ -14,6 +14,12 @@ media capabilities are tool-driven: the agent decides when to use them based
 on the conversation, and each tool only appears when at least one backing
 provider is configured.

+Live speech uses the Talk session contract instead of the one-shot media tool
+path. Talk has three modes: provider-native `realtime`, local or streaming
+`stt-tts`, and `transcription` for observe-only speech capture. Those modes
+share provider catalogs, event envelopes, and cancellation semantics with
+telephony, meetings, browser realtime, and native push-to-talk clients.
+
 ## Capabilities

 <CardGroup cols={2}>
@@ -110,6 +116,11 @@ Deepgram, ElevenLabs, Mistral, OpenAI, and xAI also register Voice Call
 streaming STT providers, so live phone audio can be forwarded to the selected
 vendor without waiting for a completed recording.

+For live user conversations, prefer [Talk mode](/nodes/talk). Batch audio
+attachments stay on the media path; browser realtime, native push-to-talk,
+telephony, and meeting audio should use Talk events and the session-scoped
+catalogs returned by the Gateway.
+
 ## Provider mappings (how vendors split across surfaces)

 <AccordionGroup>
@@ -144,3 +155,4 @@ vendor without waiting for a completed recording.
 - [Text-to-speech](/tools/tts)
 - [Media understanding](/nodes/media-understanding)
 - [Audio nodes](/nodes/audio)
+- [Talk mode](/nodes/talk)
--- a/docs/tools/tts.md
+++ b/docs/tools/tts.md
@@ -12,6 +12,11 @@ OpenClaw can convert outbound replies into audio across **14 speech providers**
 and deliver native voice messages on Feishu, Matrix, Telegram, and WhatsApp,
 audio attachments everywhere else, and PCM/Ulaw streams for telephony and Talk.

+TTS is the speech-output half of Talk's `stt-tts` mode. Provider-native
+`realtime` Talk sessions synthesize speech inside the realtime provider instead
+of calling this TTS path, while `transcription` sessions do not synthesize an
+assistant voice response.
+
 ## Quick start

 <Steps>
@@ -586,6 +591,11 @@ attempted provider:
 The whole TTS request only fails when **every** attempted provider is skipped
 or fails.

+Talk session provider selection is session-scoped. A Talk client should choose
+provider ids, model ids, voice ids, and locales from `talk.catalog` and pass
+them through the Talk session or handoff request. Opening a voice session should
+not mutate `messages.tts` or global Talk provider defaults.
+
 ## Model-driven directives

 By default, the assistant **can** emit `[[tts:...]]` directives to override