openclaw/docs/concepts/streaming.md at 16bae70af4cb1c5f46551f52feb0f4b52090d91c

mirror of https://github.com/openclaw/openclaw.git synced 2026-06-03 21:24:05 +00:00

Files

scotthuang 7920af0c9e refactor: route browser screenshot vision through shared media understanding

* feat(browser): add optional vision understanding to screenshot tool

* fix(browser): wrap vision output as external content, enforce maxBytes, forward auth profiles

* fix(browser): remove no-op scope/attachments config, drop profile pass-through lacking runtime support

* feat(media-understanding): add profile/preferredProfile to DescribeImageFileWithModelParams and forward to describeImage

* style(browser): add curly braces to satisfy eslint curly rule

* fix(browser): correct tools.browser.enabled help text to match actual behavior

* fix(browser): thread agentDir/workspaceDir from plugin tool context into browser vision

* refactor(browser): move vision config from tools.browser to browser.models

The browser plugin's vision configuration now lives on the top-level
`browser` config namespace (browser.models, browser.visionEnabled,
browser.visionPrompt, etc.) instead of `tools.browser`. This aligns
with the plugin's existing config location and avoids confusion between
tool-level and plugin-level settings.

- Remove tools.browser from ToolsSchema and ToolsConfig
- Add models/vision* fields to BrowserConfig and its zod schema
- Update getBrowserVisionConfig to read from cfg.browser
- Update schema help, labels, and quality test
- Update vision.test.ts to use new config shape

* docs(browser): add screenshot vision configuration section

Document the new browser.models config for automatic screenshot
description via vision models, enabling text-only main models to
reason about web page content.

* fix(browser): remove deliverable media markers from vision result, drop unused import

P1: Vision-success path no longer exposes the raw screenshot as
deliverable media (removes MEDIA: line and details.media.mediaUrl).
This prevents channel delivery from auto-sending sensitive page content
when the intended output is a text description.

P2: Remove unused ToolsMediaUnderstandingSchema import that would fail
noUnusedLocals typecheck.

* fix(browser): add command/args fields to browser models schema

The browser vision model schema uses .strict(), so CLI-type entries
with command/args were rejected by TypeScript. Add these fields to
align with MediaUnderstandingModelSchema.

* chore(browser): remove debug console.log statements

* fix(browser): harden screenshot vision result against MEDIA: directive injection and restore image sanitization on failure fallback

ClawSweeper #84247 review round 2:

P1 (security, high): neutralize line-start MEDIA: directives in vision descriptions
before wrapping with wrapExternalContent. The agent media extractor scans every
browser tool-result text block via splitMediaFromOutput which treats line-start
MEDIA: as a trusted local-media delivery directive, and browser is on the
trusted-media allowlist. Without neutralization, page or vision-provider output
containing 'MEDIA:/tmp/secret.png' could synthesize a channel-deliverable media
artifact from untrusted content. wrapExternalContent itself does not strip
line-start directives. Introduce neutralizeMediaDirectives in vision.ts that
prepends '[neutralized] ' to any line whose trimStart() begins with MEDIA:
(case-insensitive), defanging the parser anchor while keeping the original
text human-readable.

P2 (compatibility): pass resolveRuntimeImageSanitization() to imageResultFromFile
in the vision-failure catch fallback. The non-vision screenshot path already
forwards this option (d5cc0d53b7) so configured agents.defaults.imageMaxDimensionPx
takes effect. Without this fix, any provider timeout/error silently bypasses the
sanitization guard and returns a raw full-resolution screenshot.

Regression coverage:
- vision.test.ts: 6 unit cases for neutralizeMediaDirectives (no-op fast path,
  mid-line MEDIA: untouched, line-start defanged, leading-whitespace defanged,
  case-insensitive, multiple directives per blob).
- browser-tool.test.ts: 2 integration cases that drive the full screenshot
  tool execute path:
    - 'neutralizes MEDIA: directives in vision text and does not attach media'
      asserts no line matches /^\s*MEDIA:/i in returned text, secret path text
      is preserved verbatim, details.media is absent, and imageResultFromFile
      is not called on the success path.
    - 'preserves screenshot image sanitization on vision failure fallback'
      mocks describeImageFileWithModel to reject and asserts the fallback
      imageResultFromFile call receives imageSanitization: {maxDimensionPx:1600}
      plus the 'browser screenshot vision failed' extraText.

* fix(browser): apply clawsweeper fallback media fix from PR #84247

* refactor: reuse media image understanding for browser screenshots

* refactor: use structured media delivery

* test: update music completion media instruction expectation

* fix: trim buffered reply directive padding

* test: refresh codex prompt snapshots for message media aliases

---------

Co-authored-by: scotthuang <scotthuang@tencent.com>
Co-authored-by: Peter Steinberger <steipete@gmail.com>

2026-05-31 00:00:19 +01:00

14 KiB

Raw Blame History

summary, read_when, title

summary

read_when

title

Streaming + chunking behavior (block replies, channel preview streaming, mode mapping)

Explaining how streaming or chunking works on channels

Changing block streaming or channel chunking behavior

Debugging duplicate/early block replies or channel preview streaming

Streaming and chunking

OpenClaw has two separate streaming layers:

Block streaming (channels): emit completed blocks as the assistant writes. These are normal channel messages (not token deltas).
Preview streaming (Telegram/Discord/Slack): update a temporary preview message while generating.

There is no true token-delta streaming to channel messages today. Preview streaming is message-based (send + edits/appends).

Block streaming (channel messages)

Block streaming sends assistant output in coarse chunks as it becomes available.

Model output
  └─ text_delta/events
       ├─ (blockStreamingBreak=text_end)
       │    └─ chunker emits blocks as buffer grows
       └─ (blockStreamingBreak=message_end)
            └─ chunker flushes at message_end
                   └─ channel send (block replies)

Legend:

text_delta/events: model stream events (may be sparse for non-streaming models).
chunker: EmbeddedBlockChunker applying min/max bounds + break preference.
channel send: actual outbound messages (block replies).

Controls:

agents.defaults.blockStreamingDefault: "on"/"off" (default off).
Channel overrides: *.blockStreaming (and per-account variants) to force "on"/"off" per channel.
agents.defaults.blockStreamingBreak: "text_end" or "message_end".
agents.defaults.blockStreamingChunk: { minChars, maxChars, breakPreference? }.
agents.defaults.blockStreamingCoalesce: { minChars?, maxChars?, idleMs? } (merge streamed blocks before send).
Channel hard cap: *.textChunkLimit (e.g., channels.whatsapp.textChunkLimit).
Channel chunk mode: *.chunkMode (length default, newline splits on blank lines (paragraph boundaries) before length chunking).
Discord soft cap: channels.discord.maxLinesPerMessage (default 17) splits tall replies to avoid UI clipping.

Boundary semantics:

text_end: stream blocks as soon as chunker emits; flush on each text_end.
message_end: wait until assistant message finishes, then flush buffered output.

message_end still uses the chunker if the buffered text exceeds maxChars, so it can emit multiple chunks at the end.

Media delivery with block streaming

Streaming media must use structured payload fields such as mediaUrl or mediaUrls; streamed text is not parsed as an attachment command. When block streaming sends media early, OpenClaw remembers that delivery for the turn. If the final assistant payload repeats the same media URL, the final delivery strips the duplicate media instead of sending the attachment again.

Exact duplicate final payloads are suppressed. If the final payload adds distinct text around media that was already streamed, OpenClaw still sends the new text while keeping the media single-delivery. This prevents duplicate voice notes or files on channels such as Telegram.

Chunking algorithm (low/high bounds)

Block chunking is implemented by EmbeddedBlockChunker:

Low bound: don't emit until buffer >= minChars (unless forced).
High bound: prefer splits before maxChars; if forced, split at maxChars.
Break preference: paragraph → newline → sentence → whitespace → hard break.
Code fences: never split inside fences; when forced at maxChars, close + reopen the fence to keep Markdown valid.

maxChars is clamped to the channel textChunkLimit, so you can't exceed per-channel caps.

Coalescing (merge streamed blocks)

When block streaming is enabled, OpenClaw can merge consecutive block chunks before sending them out. This reduces "single-line spam" while still providing progressive output.

Coalescing waits for idle gaps (idleMs) before flushing.
Buffers are capped by maxChars and will flush if they exceed it.
minChars prevents tiny fragments from sending until enough text accumulates (final flush always sends remaining text).
Joiner is derived from blockStreamingChunk.breakPreference (paragraph → \n\n, newline → \n, sentence → space).
Channel overrides are available via *.blockStreamingCoalesce (including per-account configs).
Default coalesce minChars is bumped to 1500 for Signal/Slack/Discord unless overridden.

Human-like pacing between blocks

When block streaming is enabled, you can add a randomized pause between block replies (after the first block). This makes multi-bubble responses feel more natural.

Config: agents.defaults.humanDelay (override per agent via agents.list[].humanDelay).
Modes: off (default), natural (800-2500ms), custom (minMs/maxMs).
Applies only to block replies, not final replies or tool summaries.

"Stream chunks or everything"

This maps to:

Stream chunks: blockStreamingDefault: "on" + blockStreamingBreak: "text_end" (emit as you go). Non-Telegram channels also need *.blockStreaming: true.
Stream everything at end: blockStreamingBreak: "message_end" (flush once, possibly multiple chunks if very long).
No block streaming: blockStreamingDefault: "off" (only final reply).

Channel note: Block streaming is off unless *.blockStreaming is explicitly set to true. Channels can stream a live preview (channels.<channel>.streaming) without block replies.

Config location reminder: the blockStreaming* defaults live under agents.defaults, not the root config.

Preview streaming modes

Canonical key: channels.<channel>.streaming

Modes:

off: disable preview streaming.
partial: single preview that is replaced with latest text.
block: preview updates in chunked/appended steps.
progress: progress/status preview during generation, final answer at completion.

streaming.mode: "block" is a preview-streaming mode for edit-capable channels such as Discord and Telegram. It does not enable channel block delivery there. Use streaming.block.enabled or the legacy blockStreaming channel key when you want normal block replies. Microsoft Teams is the exception: it has no draft-preview block transport, so streaming.mode: "block" maps to Teams block delivery instead of native partial/progress streaming.

Channel mapping

Channel	`off`	`partial`	`block`	`progress`
Telegram	✅	✅	✅	editable progress draft
Discord	✅	✅	✅	editable progress draft
Slack	✅	✅	✅	✅
Mattermost	✅	✅	✅	✅
MS Teams	✅	✅	✅	native progress stream

Slack-only:

channels.slack.streaming.nativeTransport toggles Slack native streaming API calls when channels.slack.streaming.mode="partial" (default: true).
Slack native streaming and Slack assistant thread status require a reply thread target. Top-level DMs do not show that thread-style preview, but they can still use Slack draft preview posts and edits.

Legacy key migration:

Telegram: legacy streamMode and scalar/boolean streaming values are detected and migrated by doctor/config compatibility paths to streaming.mode.
Discord: streamMode + boolean streaming remain runtime aliases for the streaming enum; run openclaw doctor --fix to rewrite persisted config.
Slack: streamMode remains a runtime alias for streaming.mode; boolean streaming remains a runtime alias for streaming.mode plus streaming.nativeTransport; legacy nativeStreaming remains a runtime alias for streaming.nativeTransport. Run openclaw doctor --fix to rewrite persisted config.

Runtime behavior

Telegram:

Uses sendMessage + editMessageText preview updates across DMs and group/topics.
Final text edits the active preview in place; long finals reuse that message for the first chunk and send only the remaining chunks.
progress mode keeps tool progress in an editable status draft, clears that draft at completion, and sends the final answer through normal delivery.
If the final edit fails before the completed text is confirmed, OpenClaw uses normal final delivery and cleans up the stale preview.
Preview streaming is skipped when Telegram block streaming is explicitly enabled (to avoid double-streaming).
/reasoning stream can write reasoning to a transient preview that is deleted after final delivery.

Discord:

Uses send + edit preview messages.
block mode uses draft chunking (draftChunk).
Preview streaming is skipped when Discord block streaming is explicitly enabled.
Final media, error, and explicit-reply payloads cancel pending previews without flushing a new draft, then use normal delivery.

Slack:

partial can use Slack native streaming (chat.startStream/append/stop) when available.
block uses append-style draft previews.
progress uses status preview text, then final answer.
Top-level DMs without a reply thread use draft preview posts and edits instead of Slack native streaming.
Native and draft preview streaming suppress block replies for that turn, so a Slack reply is streamed by one delivery path only.
Final media/error payloads and progress finals do not create throwaway draft messages; only text/block finals that can edit the preview flush pending draft text.

Mattermost:

Streams thinking, tool activity, and partial reply text into a single draft preview post that finalizes in place when the final answer is safe to send.
Falls back to sending a fresh final post if the preview post was deleted or is otherwise unavailable at finalize time.
Final media/error payloads cancel pending preview updates before normal delivery instead of flushing a temporary preview post.

Matrix:

Draft previews finalize in place when the final text can reuse the preview event.
Media-only, error, and reply-target-mismatch finals cancel pending preview updates before normal delivery; an already-visible stale preview is redacted.

Tool-progress preview updates

Preview streaming can also include tool-progress updates - short status lines like "searching the web", "reading file", or "calling tool" - that appear in the same preview message while tools are running, ahead of the final reply. In Codex app-server mode, Codex preamble/commentary messages use this same preview path, so short "I am checking..." progress notes can stream into the editable draft without becoming part of the final answer. This keeps multi-step tool turns visually alive rather than silent between the first thinking preview and the final answer.

Long-running tools may emit typed progress before they return. For example, web_fetch arms a five-second timer when it starts: if the fetch is still pending, the preview can show Fetching page content...; if the fetch finishes or is canceled before then, no progress line is emitted. The later final tool result is still delivered normally to the model.

Supported surfaces:

Discord, Slack, Telegram, and Matrix stream tool-progress and Codex preamble updates into the live preview edit by default when preview streaming is active. Microsoft Teams uses its native progress stream in personal chats.
Telegram has shipped with tool-progress preview updates enabled since v2026.4.22; keeping them enabled preserves that released behavior.
Mattermost already folds tool activity into its single draft preview post (see above).
Tool-progress edits follow the active preview streaming mode; they are skipped when preview streaming is off or when block streaming has taken over the message. On Telegram, streaming.mode: "off" is final-only: generic progress chatter is also suppressed instead of being delivered as standalone status messages, while approval prompts, media payloads, and errors still route normally.
To keep preview streaming but hide tool-progress lines, set streaming.preview.toolProgress to false for that channel. To keep tool-progress lines visible while hiding command/exec text, set streaming.preview.commandText to "status" or streaming.progress.commandText to "status"; the default is "raw" to preserve released behavior. This policy is shared by draft/progress channels that use OpenClaw's compact progress renderer, including Discord, Matrix, Microsoft Teams, Mattermost, Slack draft previews, and Telegram. To disable preview edits entirely, set streaming.mode to off.
Telegram selected quote replies are an exception: when replyToMode is not "off" and selected quote text is present, OpenClaw skips the answer preview stream for that turn so tool-progress preview lines cannot render. Current-message replies without selected quote text still keep preview streaming. See Telegram channel docs for details.

Keep progress lines visible but hide raw command/exec text:

{
  "channels": {
    "telegram": {
      "streaming": {
        "mode": "partial",
        "preview": {
          "toolProgress": true,
          "commandText": "status"
        }
      }
    }
  }
}

Use the same shape under another compact progress channel key, for example channels.discord, channels.matrix, channels.msteams, channels.mattermost, or Slack draft previews. For progress-draft mode, put the same policy under streaming.progress:

{
  "channels": {
    "telegram": {
      "streaming": {
        "mode": "progress",
        "progress": {
          "toolProgress": true,
          "commandText": "status"
        }
      }
    }
  }
}

Message lifecycle refactor - target shared preview, edit, stream, and finalization design
Progress drafts - visible work-in-progress messages that update during long turns
Messages - message lifecycle and delivery
Retry - retry behavior on delivery failure
Channels - per-channel streaming support

14 KiB Raw Blame History