feat(webchat): add server-side dictation (#76021)

Summary: - This PR adds WebChat server-side dictation through a new authenticated `chat.transcribeAudio` Gateway RPC, MediaRecorder composer controls, docs/changelog updates, and focused gateway/UI tests. - Reproducibility: yes. Current main reproduces the missing feature by inspection: the Gateway method list, write scopes, docs, and WebChat voice-control test have no `chat.transcribeAudio` server-dictation path. ClawSweeper fixups: - Included follow-up commit: feat(webchat): add server-side dictation - Included follow-up commit: fix(clawsweeper): address review for automerge-openclaw-openclaw-7602… Validation: - ClawSweeper review passed for head 850571380a. - Required merge gates passed before the squash merge. Prepared head SHA: 850571380a Review: https://github.com/openclaw/openclaw/pull/76021#issuecomment-4363514226 Co-authored-by: Peter Steinberger <steipete@gmail.com> Co-authored-by: clawsweeper <274271284+clawsweeper[bot]@users.noreply.github.com>
2026-05-06 09:30:43 +00:00 · 2026-05-03 00:09:23 +01:00
parent 15bbf4f2f3
commit 68359cacbf
23 changed files with 847 additions and 23 deletions
--- a/docs/nodes/audio.md
+++ b/docs/nodes/audio.md
@@ -17,6 +17,7 @@ title: "Audio and voice notes"
  5. On success, it replaces `Body` with an `[Audio]` block and sets `{{Transcript}}`.
 - **Command parsing**: When transcription succeeds, `CommandBody`/`RawBody` are set to the transcript so slash commands still work.
 - **Verbose logging**: In `--verbose`, we log when transcription runs and when it replaces the body.
+- **Control UI dictation**: The Chat composer can send a browser-recorded microphone clip to `chat.transcribeAudio`. That Gateway RPC writes the clip to a temporary local file, runs this same audio transcription pipeline, returns draft text to the browser, and deletes the temporary file. It does not create an agent run by itself.

 ## Auto-detection (default)

--- a/docs/web/control-ui.md
+++ b/docs/web/control-ui.md
@@ -96,6 +96,7 @@ Imported themes are stored only in the current browser profile. They are not wri
 <AccordionGroup>
  <Accordion title="Chat and Talk">
    - Chat with the model via Gateway WS (`chat.history`, `chat.send`, `chat.abort`, `chat.inject`).
+    - Dictate into the Chat composer with server-side STT (`chat.transcribeAudio`). The browser records a short microphone clip and sends it to the Gateway, which runs the configured `tools.media.audio` transcription pipeline and returns draft text without exposing provider credentials to the browser.
    - Talk through browser realtime sessions. OpenAI uses direct WebRTC, Google Live uses a constrained one-use browser token over WebSocket, and backend-only realtime voice plugins use the Gateway relay transport. The relay keeps provider credentials on the Gateway while the browser streams microphone PCM through `talk.realtime.relay*` RPCs and sends `openclaw_agent_consult` tool calls back through `chat.send` for the larger configured OpenClaw model.
    - Stream tool calls + live tool output cards in Chat (agent events).

@@ -149,6 +150,7 @@ Imported themes are stored only in the current browser profile. They are not wri
 <AccordionGroup>
  <Accordion title="Send and history semantics">
    - `chat.send` is **non-blocking**: it acks immediately with `{ runId, status: "started" }` and the response streams via `chat` events.
+    - `chat.transcribeAudio` is a one-shot dictation helper for Chat drafts. It accepts browser-recorded base64 audio, keeps uploads below the Gateway WebSocket frame limit, writes a temporary local file, runs media-understanding audio transcription with the active Gateway config, returns `{ text, provider, model }`, and removes the temporary file. It does not create an agent run and is separate from realtime Talk.
    - Chat uploads accept images plus non-video files. Images keep the native image path; other files are stored as managed media and shown in history as attachment links.
    - Re-sending with the same `idempotencyKey` returns `{ status: "in_flight" }` while running, and `{ status: "ok" }` after completion.
    - `chat.history` responses are size-bounded for UI safety. When transcript entries are too large, Gateway may truncate long text fields, omit heavy metadata blocks, and replace oversized messages with a placeholder (`[chat.history omitted: message too large]`).
--- a/docs/web/webchat.md
+++ b/docs/web/webchat.md
@@ -22,7 +22,7 @@ Status: the macOS/iOS SwiftUI chat UI talks directly to the Gateway WebSocket.

 ## How it works (behavior)

- The UI connects to the Gateway WebSocket and uses `chat.history`, `chat.send`, and `chat.inject`.
+- The UI connects to the Gateway WebSocket and uses `chat.history`, `chat.send`, `chat.inject`, and `chat.transcribeAudio`.
 - `chat.history` is bounded for stability: Gateway may truncate long text fields, omit heavy metadata, and replace oversized entries with `[chat.history omitted: message too large]`.
 - `chat.history` follows the active transcript branch for modern append-only session files, so abandoned rewrite branches and superseded prompt copies are not rendered in WebChat.
 - Control UI remembers the backing Gateway `sessionId` returned by `chat.history` and includes it on follow-up `chat.send` calls, so reconnects and page refreshes continue the same stored conversation unless the user starts or resets a session.
@@ -37,6 +37,7 @@ Status: the macOS/iOS SwiftUI chat UI talks directly to the Gateway WebSocket.
  and assistant entries whose whole visible text is only the exact silent
  token `NO_REPLY` / `no_reply` are omitted.
 - Reasoning-flagged reply payloads (`isReasoning: true`) are excluded from WebChat assistant content, transcript replay text, and audio content blocks, so thinking-only payloads do not surface as visible assistant messages or playable audio.
+- `chat.transcribeAudio` powers server-side dictation in the Control UI chat composer. The browser records microphone audio, sends it as base64 to the Gateway, and the Gateway runs the configured `tools.media.audio` pipeline. The returned transcript is inserted into the draft; no agent run is started until the user sends it.
 - `chat.inject` appends an assistant note directly to the transcript and broadcasts it to the UI (no agent run).
 - Aborted runs can keep partial assistant output visible in the UI.
 - Gateway persists aborted partial assistant text into transcript history when buffered output exists, and marks those entries with abort metadata.