mirror of
https://github.com/openclaw/openclaw.git
synced 2026-05-06 09:30:43 +00:00
feat(webchat): add server-side dictation (#76021)
Summary: - This PR adds WebChat server-side dictation through a new authenticated `chat.transcribeAudio` Gateway RPC, MediaRecorder composer controls, docs/changelog updates, and focused gateway/UI tests. - Reproducibility: yes. Current main reproduces the missing feature by inspection: the Gateway method list, write scopes, docs, and WebChat voice-control test have no `chat.transcribeAudio` server-dictation path. ClawSweeper fixups: - Included follow-up commit: feat(webchat): add server-side dictation - Included follow-up commit: fix(clawsweeper): address review for automerge-openclaw-openclaw-7602… Validation: - ClawSweeper review passed for head850571380a. - Required merge gates passed before the squash merge. Prepared head SHA:850571380aReview: https://github.com/openclaw/openclaw/pull/76021#issuecomment-4363514226 Co-authored-by: Peter Steinberger <steipete@gmail.com> Co-authored-by: clawsweeper <274271284+clawsweeper[bot]@users.noreply.github.com>
This commit is contained in:
committed by
GitHub
parent
15bbf4f2f3
commit
68359cacbf
@@ -17,6 +17,7 @@ title: "Audio and voice notes"
|
||||
5. On success, it replaces `Body` with an `[Audio]` block and sets `{{Transcript}}`.
|
||||
- **Command parsing**: When transcription succeeds, `CommandBody`/`RawBody` are set to the transcript so slash commands still work.
|
||||
- **Verbose logging**: In `--verbose`, we log when transcription runs and when it replaces the body.
|
||||
- **Control UI dictation**: The Chat composer can send a browser-recorded microphone clip to `chat.transcribeAudio`. That Gateway RPC writes the clip to a temporary local file, runs this same audio transcription pipeline, returns draft text to the browser, and deletes the temporary file. It does not create an agent run by itself.
|
||||
|
||||
## Auto-detection (default)
|
||||
|
||||
|
||||
@@ -96,6 +96,7 @@ Imported themes are stored only in the current browser profile. They are not wri
|
||||
<AccordionGroup>
|
||||
<Accordion title="Chat and Talk">
|
||||
- Chat with the model via Gateway WS (`chat.history`, `chat.send`, `chat.abort`, `chat.inject`).
|
||||
- Dictate into the Chat composer with server-side STT (`chat.transcribeAudio`). The browser records a short microphone clip and sends it to the Gateway, which runs the configured `tools.media.audio` transcription pipeline and returns draft text without exposing provider credentials to the browser.
|
||||
- Talk through browser realtime sessions. OpenAI uses direct WebRTC, Google Live uses a constrained one-use browser token over WebSocket, and backend-only realtime voice plugins use the Gateway relay transport. The relay keeps provider credentials on the Gateway while the browser streams microphone PCM through `talk.realtime.relay*` RPCs and sends `openclaw_agent_consult` tool calls back through `chat.send` for the larger configured OpenClaw model.
|
||||
- Stream tool calls + live tool output cards in Chat (agent events).
|
||||
|
||||
@@ -149,6 +150,7 @@ Imported themes are stored only in the current browser profile. They are not wri
|
||||
<AccordionGroup>
|
||||
<Accordion title="Send and history semantics">
|
||||
- `chat.send` is **non-blocking**: it acks immediately with `{ runId, status: "started" }` and the response streams via `chat` events.
|
||||
- `chat.transcribeAudio` is a one-shot dictation helper for Chat drafts. It accepts browser-recorded base64 audio, keeps uploads below the Gateway WebSocket frame limit, writes a temporary local file, runs media-understanding audio transcription with the active Gateway config, returns `{ text, provider, model }`, and removes the temporary file. It does not create an agent run and is separate from realtime Talk.
|
||||
- Chat uploads accept images plus non-video files. Images keep the native image path; other files are stored as managed media and shown in history as attachment links.
|
||||
- Re-sending with the same `idempotencyKey` returns `{ status: "in_flight" }` while running, and `{ status: "ok" }` after completion.
|
||||
- `chat.history` responses are size-bounded for UI safety. When transcript entries are too large, Gateway may truncate long text fields, omit heavy metadata blocks, and replace oversized messages with a placeholder (`[chat.history omitted: message too large]`).
|
||||
|
||||
@@ -22,7 +22,7 @@ Status: the macOS/iOS SwiftUI chat UI talks directly to the Gateway WebSocket.
|
||||
|
||||
## How it works (behavior)
|
||||
|
||||
- The UI connects to the Gateway WebSocket and uses `chat.history`, `chat.send`, and `chat.inject`.
|
||||
- The UI connects to the Gateway WebSocket and uses `chat.history`, `chat.send`, `chat.inject`, and `chat.transcribeAudio`.
|
||||
- `chat.history` is bounded for stability: Gateway may truncate long text fields, omit heavy metadata, and replace oversized entries with `[chat.history omitted: message too large]`.
|
||||
- `chat.history` follows the active transcript branch for modern append-only session files, so abandoned rewrite branches and superseded prompt copies are not rendered in WebChat.
|
||||
- Control UI remembers the backing Gateway `sessionId` returned by `chat.history` and includes it on follow-up `chat.send` calls, so reconnects and page refreshes continue the same stored conversation unless the user starts or resets a session.
|
||||
@@ -37,6 +37,7 @@ Status: the macOS/iOS SwiftUI chat UI talks directly to the Gateway WebSocket.
|
||||
and assistant entries whose whole visible text is only the exact silent
|
||||
token `NO_REPLY` / `no_reply` are omitted.
|
||||
- Reasoning-flagged reply payloads (`isReasoning: true`) are excluded from WebChat assistant content, transcript replay text, and audio content blocks, so thinking-only payloads do not surface as visible assistant messages or playable audio.
|
||||
- `chat.transcribeAudio` powers server-side dictation in the Control UI chat composer. The browser records microphone audio, sends it as base64 to the Gateway, and the Gateway runs the configured `tools.media.audio` pipeline. The returned transcript is inserted into the draft; no agent run is started until the user sends it.
|
||||
- `chat.inject` appends an assistant note directly to the transcript and broadcasts it to the UI (no agent run).
|
||||
- Aborted runs can keep partial assistant output visible in the UI.
|
||||
- Gateway persists aborted partial assistant text into transcript history when buffered output exists, and marks those entries with abort metadata.
|
||||
|
||||
Reference in New Issue
Block a user