Reduce WebUI/Gateway latency churn by avoiding redundant session reloads, carrying session keys through transcript update events, and deferring explicit media provider discovery. Includes changelog attribution and closes the referenced runtime latency issues.
8.0 KiB
summary, read_when, title
| summary | read_when | title | |
|---|---|---|---|
| How inbound audio/voice notes are downloaded, transcribed, and injected into replies |
|
Audio and voice notes |
Audio / Voice Notes (2026-01-17)
What works
- Media understanding (audio): If audio understanding is enabled (or auto‑detected), OpenClaw:
- Locates the first audio attachment (local path or URL) and downloads it if needed.
- Enforces
maxBytesbefore sending to each model entry. - Runs the first eligible model entry in order (provider or CLI).
- If it fails or skips (size/timeout), it tries the next entry.
- On success, it replaces
Bodywith an[Audio]block and sets{{Transcript}}.
- Command parsing: When transcription succeeds,
CommandBody/RawBodyare set to the transcript so slash commands still work. - Verbose logging: In
--verbose, we log when transcription runs and when it replaces the body.
Auto-detection (default)
If you don’t configure models and tools.media.audio.enabled is not set to false,
OpenClaw auto-detects in this order and stops at the first working option:
- Active reply model when its provider supports audio understanding.
- Local CLIs (if installed)
sherpa-onnx-offline(requiresSHERPA_ONNX_MODEL_DIRwith encoder/decoder/joiner/tokens)whisper-cli(fromwhisper-cpp; usesWHISPER_CPP_MODELor the bundled tiny model)whisper(Python CLI; downloads models automatically)
- Gemini CLI (
gemini) usingread_many_files - Provider auth
- Configured
models.providers.*entries that support audio are tried first - Bundled fallback order: OpenAI → Groq → xAI → Deepgram → Google → SenseAudio → ElevenLabs → Mistral
- Configured
To disable auto-detection, set tools.media.audio.enabled: false.
To customize, set tools.media.audio.models.
Note: Binary detection is best-effort across macOS/Linux/Windows; ensure the CLI is on PATH (we expand ~), or set an explicit CLI model with a full command path.
Config examples
Provider + CLI fallback (OpenAI + Whisper CLI)
{
tools: {
media: {
audio: {
enabled: true,
maxBytes: 20971520,
models: [
{ provider: "openai", model: "gpt-4o-mini-transcribe" },
{
type: "cli",
command: "whisper",
args: ["--model", "base", "{{MediaPath}}"],
timeoutSeconds: 45,
},
],
},
},
},
}
Provider-only with scope gating
{
tools: {
media: {
audio: {
enabled: true,
scope: {
default: "allow",
rules: [{ action: "deny", match: { chatType: "group" } }],
},
models: [{ provider: "openai", model: "gpt-4o-mini-transcribe" }],
},
},
},
}
Provider-only (Deepgram)
{
tools: {
media: {
audio: {
enabled: true,
models: [{ provider: "deepgram", model: "nova-3" }],
},
},
},
}
Provider-only (Mistral Voxtral)
{
tools: {
media: {
audio: {
enabled: true,
models: [{ provider: "mistral", model: "voxtral-mini-latest" }],
},
},
},
}
Provider-only (SenseAudio)
{
tools: {
media: {
audio: {
enabled: true,
models: [{ provider: "senseaudio", model: "senseaudio-asr-pro-1.5-260319" }],
},
},
},
}
Echo transcript to chat (opt-in)
{
tools: {
media: {
audio: {
enabled: true,
echoTranscript: true, // default is false
echoFormat: '📝 "{transcript}"', // optional, supports {transcript}
models: [{ provider: "openai", model: "gpt-4o-mini-transcribe" }],
},
},
},
}
Notes & limits
- Provider auth follows the standard model auth order (auth profiles, env vars,
models.providers.*.apiKey). - Groq setup details: Groq.
- Deepgram picks up
DEEPGRAM_API_KEYwhenprovider: "deepgram"is used. - Deepgram setup details: Deepgram (audio transcription).
- Mistral setup details: Mistral.
- SenseAudio picks up
SENSEAUDIO_API_KEYwhenprovider: "senseaudio"is used. - SenseAudio setup details: SenseAudio.
- Audio providers can override
baseUrl,headers, andproviderOptionsviatools.media.audio. - Default size cap is 20MB (
tools.media.audio.maxBytes). Oversize audio is skipped for that model and the next entry is tried. - Tiny/empty audio files below 1024 bytes are skipped before provider/CLI transcription.
- Default
maxCharsfor audio is unset (full transcript). Settools.media.audio.maxCharsor per-entrymaxCharsto trim output. - OpenAI auto default is
gpt-4o-mini-transcribe; setmodel: "gpt-4o-transcribe"for higher accuracy. - Use
tools.media.audio.attachmentsto process multiple voice notes (mode: "all"+maxAttachments). - Transcript is available to templates as
{{Transcript}}. tools.media.audio.echoTranscriptis off by default; enable it to send transcript confirmation back to the originating chat before agent processing.tools.media.audio.echoFormatcustomizes the echo text (placeholder:{transcript}).- CLI stdout is capped (5MB); keep CLI output concise.
- CLI
argsshould use{{MediaPath}}for the local audio file path. Runopenclaw doctor --fixto migrate deprecated{input}placeholders from olderaudio.transcription.commandconfigs.
Proxy environment support
Provider-based audio transcription honors standard outbound proxy env vars:
HTTPS_PROXYHTTP_PROXYALL_PROXYhttps_proxyhttp_proxyall_proxy
If no proxy env vars are set, direct egress is used. If proxy config is malformed, OpenClaw logs a warning and falls back to direct fetch.
Mention detection in groups
When requireMention: true is set for a group chat, OpenClaw now transcribes audio before checking for mentions. This allows voice notes to be processed even when they contain mentions.
How it works:
- If a voice message has no text body and the group requires mentions, OpenClaw performs a "preflight" transcription.
- The transcript is checked for mention patterns (e.g.,
@BotName, emoji triggers). - If a mention is found, the message proceeds through the full reply pipeline.
- The transcript is used for mention detection so voice notes can pass the mention gate.
Fallback behavior:
- If transcription fails during preflight (timeout, API error, etc.), the message is processed based on text-only mention detection.
- This ensures that mixed messages (text + audio) are never incorrectly dropped.
Opt-out per Telegram group/topic:
- Set
channels.telegram.groups.<chatId>.disableAudioPreflight: trueto skip preflight transcript mention checks for that group. - Set
channels.telegram.groups.<chatId>.topics.<threadId>.disableAudioPreflightto override per-topic (trueto skip,falseto force-enable). - Default is
false(preflight enabled when mention-gated conditions match).
Example: A user sends a voice note saying "Hey @Claude, what's the weather?" in a Telegram group with requireMention: true. The voice note is transcribed, the mention is detected, and the agent replies.
Gotchas
- Scope rules use first-match wins.
chatTypeis normalized todirect,group, orroom. - Ensure your CLI exits 0 and prints plain text; JSON needs to be massaged via
jq -r .text. - For
parakeet-mlx, if you pass--output-dir, OpenClaw reads<output-dir>/<media-basename>.txtwhen--output-formatistxt(or omitted); non-txtoutput formats fall back to stdout parsing. - Keep timeouts reasonable (
timeoutSeconds, default 60s) to avoid blocking the reply queue. - Preflight transcription only processes the first audio attachment for mention detection. Additional audio is processed during the main media understanding phase.