mirror of
https://github.com/openclaw/openclaw.git
synced 2026-05-07 06:30:44 +00:00
7.2 KiB
7.2 KiB
summary, read_when, title
| summary | read_when | title | ||
|---|---|---|---|---|
| Talk mode: continuous speech conversations across local STT/TTS and realtime voice |
|
Talk mode |
Talk mode has two runtime shapes:
- Native macOS/iOS/Android Talk uses local speech recognition, Gateway chat, and
talk.speakTTS. Nodes advertise thetalkcapability and declare thetalk.*commands they support. - Browser Talk uses
talk.client.createfor client-ownedwebrtcandprovider-websocketsessions, ortalk.session.createfor Gateway-ownedgateway-relaysessions.managed-roomis reserved for Gateway handoff and walkie-talkie rooms. - Transcription-only clients use
talk.session.create({ mode: "transcription", transport: "gateway-relay", brain: "none" }), thentalk.session.appendAudio,talk.session.cancelTurn, andtalk.session.closewhen they need captions or dictation without an assistant voice response.
Native Talk is a continuous voice conversation loop:
- Listen for speech
- Send transcript to the model through the active session
- Wait for the response
- Speak it via the configured Talk provider (
talk.speak)
Browser realtime Talk forwards provider tool calls through talk.client.toolCall; browser clients do not call chat.send directly for realtime consults.
Transcription-only Talk emits the same common Talk event envelope as realtime and STT/TTS sessions, but uses mode: "transcription" and brain: "none". It is for captions, dictation, and observe-only speech capture; one-shot uploaded voice notes still use the media/audio path.
Behavior (macOS)
- Always-on overlay while Talk mode is enabled.
- Listening → Thinking → Speaking phase transitions.
- On a short pause (silence window), the current transcript is sent.
- Replies are written to WebChat (same as typing).
- Interrupt on speech (default on): if the user starts talking while the assistant is speaking, we stop playback and note the interruption timestamp for the next prompt.
Voice directives in replies
The assistant may prefix its reply with a single JSON line to control voice:
{ "voice": "<voice-id>", "once": true }
Rules:
- First non-empty line only.
- Unknown keys are ignored.
once: trueapplies to the current reply only.- Without
once, the voice becomes the new default for Talk mode. - The JSON line is stripped before TTS playback.
Supported keys:
voice/voice_id/voiceIdmodel/model_id/modelIdspeed,rate(WPM),stability,similarity,style,speakerBoostseed,normalize,lang,output_format,latency_tieronce
Config (~/.openclaw/openclaw.json)
{
talk: {
provider: "elevenlabs",
providers: {
elevenlabs: {
voiceId: "elevenlabs_voice_id",
modelId: "eleven_v3",
outputFormat: "mp3_44100_128",
apiKey: "elevenlabs_api_key",
},
mlx: {
modelId: "mlx-community/Soprano-80M-bf16",
},
system: {},
},
speechLocale: "ru-RU",
silenceTimeoutMs: 1500,
interruptOnSpeech: true,
realtime: {
provider: "openai",
providers: {
openai: {
apiKey: "openai_api_key",
model: "gpt-realtime",
voice: "alloy",
},
},
mode: "realtime",
transport: "webrtc",
brain: "agent-consult",
},
},
}
Defaults:
interruptOnSpeech: truesilenceTimeoutMs: when unset, Talk keeps the platform default pause window before sending the transcript (700 ms on macOS and Android, 900 ms on iOS)provider: selects the active Talk provider. Useelevenlabs,mlx, orsystemfor the macOS-local playback paths.providers.<provider>.voiceId: falls back toELEVENLABS_VOICE_ID/SAG_VOICE_IDfor ElevenLabs (or first ElevenLabs voice when API key is available).providers.elevenlabs.modelId: defaults toeleven_v3when unset.providers.mlx.modelId: defaults tomlx-community/Soprano-80M-bf16when unset.providers.elevenlabs.apiKey: falls back toELEVENLABS_API_KEY(or gateway shell profile if available).realtime.provider: selects the active browser/server realtime voice provider. Useopenaifor WebRTC,googlefor provider WebSocket, or a bridge-only provider through Gateway relay.realtime.providers.<provider>stores provider-owned realtime config. The browser receives only ephemeral or constrained session credentials, never a standard API key.realtime.brain:agent-consultroutes realtime tool calls through Gateway policy;direct-toolsis owner-only compatibility behavior;noneis for transcription or external orchestration.talk.catalogexposes each provider's valid modes, transports, brain strategies, realtime audio formats, and capability flags so first-party Talk clients can avoid unsupported combinations.- Streaming transcription providers are discovered through
talk.catalog.transcription. The current Gateway relay uses the Voice Call streaming provider config until the dedicated Talk transcription config surface is added. speechLocale: optional BCP 47 locale id for on-device Talk speech recognition on iOS/macOS. Leave unset to use the device default.outputFormat: defaults topcm_44100on macOS/iOS andpcm_24000on Android (setmp3_*to force MP3 streaming)
macOS UI
- Menu bar toggle: Talk
- Config tab: Talk Mode group (voice id + interrupt toggle)
- Overlay:
- Listening: cloud pulses with mic level
- Thinking: sinking animation
- Speaking: radiating rings
- Click cloud: stop speaking
- Click X: exit Talk mode
Android UI
- Voice tab toggle: Talk
- Manual Mic and Talk are mutually exclusive runtime capture modes.
- Manual Mic stops when the app leaves the foreground or the user leaves the Voice tab.
- Talk Mode keeps running until toggled off or the Android node disconnects, and uses Android's microphone foreground-service type while active.
Notes
- Requires Speech + Microphone permissions.
- Native Talk uses the active Gateway session and only falls back to history polling when response events are unavailable.
- Browser realtime Talk uses
talk.client.toolCallforopenclaw_agent_consultinstead of exposingchat.sendto provider-owned browser sessions. - Transcription-only Talk uses
talk.session.create,talk.session.appendAudio,talk.session.cancelTurn, andtalk.session.close; clients subscribe totalk.eventfor partial/final transcript updates. - The gateway resolves Talk playback through
talk.speakusing the active Talk provider. Android falls back to local system TTS only when that RPC is unavailable. - macOS local MLX playback uses the bundled
openclaw-mlx-ttshelper when present, or an executable onPATH. SetOPENCLAW_MLX_TTS_BINto point at a custom helper binary during development. stabilityforeleven_v3is validated to0.0,0.5, or1.0; other models accept0..1.latency_tieris validated to0..4when set.- Android supports
pcm_16000,pcm_22050,pcm_24000, andpcm_44100output formats for low-latency AudioTrack streaming.