Files
openclaw/docs/reference/rich-output-protocol.md
scotthuang 7920af0c9e refactor: route browser screenshot vision through shared media understanding
* feat(browser): add optional vision understanding to screenshot tool

* fix(browser): wrap vision output as external content, enforce maxBytes, forward auth profiles

* fix(browser): remove no-op scope/attachments config, drop profile pass-through lacking runtime support

* feat(media-understanding): add profile/preferredProfile to DescribeImageFileWithModelParams and forward to describeImage

* style(browser): add curly braces to satisfy eslint curly rule

* fix(browser): correct tools.browser.enabled help text to match actual behavior

* fix(browser): thread agentDir/workspaceDir from plugin tool context into browser vision

* refactor(browser): move vision config from tools.browser to browser.models

The browser plugin's vision configuration now lives on the top-level
`browser` config namespace (browser.models, browser.visionEnabled,
browser.visionPrompt, etc.) instead of `tools.browser`. This aligns
with the plugin's existing config location and avoids confusion between
tool-level and plugin-level settings.

- Remove tools.browser from ToolsSchema and ToolsConfig
- Add models/vision* fields to BrowserConfig and its zod schema
- Update getBrowserVisionConfig to read from cfg.browser
- Update schema help, labels, and quality test
- Update vision.test.ts to use new config shape

* docs(browser): add screenshot vision configuration section

Document the new browser.models config for automatic screenshot
description via vision models, enabling text-only main models to
reason about web page content.

* fix(browser): remove deliverable media markers from vision result, drop unused import

P1: Vision-success path no longer exposes the raw screenshot as
deliverable media (removes MEDIA: line and details.media.mediaUrl).
This prevents channel delivery from auto-sending sensitive page content
when the intended output is a text description.

P2: Remove unused ToolsMediaUnderstandingSchema import that would fail
noUnusedLocals typecheck.

* fix(browser): add command/args fields to browser models schema

The browser vision model schema uses .strict(), so CLI-type entries
with command/args were rejected by TypeScript. Add these fields to
align with MediaUnderstandingModelSchema.

* chore(browser): remove debug console.log statements

* fix(browser): harden screenshot vision result against MEDIA: directive injection and restore image sanitization on failure fallback

ClawSweeper #84247 review round 2:

P1 (security, high): neutralize line-start MEDIA: directives in vision descriptions
before wrapping with wrapExternalContent. The agent media extractor scans every
browser tool-result text block via splitMediaFromOutput which treats line-start
MEDIA: as a trusted local-media delivery directive, and browser is on the
trusted-media allowlist. Without neutralization, page or vision-provider output
containing 'MEDIA:/tmp/secret.png' could synthesize a channel-deliverable media
artifact from untrusted content. wrapExternalContent itself does not strip
line-start directives. Introduce neutralizeMediaDirectives in vision.ts that
prepends '[neutralized] ' to any line whose trimStart() begins with MEDIA:
(case-insensitive), defanging the parser anchor while keeping the original
text human-readable.

P2 (compatibility): pass resolveRuntimeImageSanitization() to imageResultFromFile
in the vision-failure catch fallback. The non-vision screenshot path already
forwards this option (d5cc0d53b7) so configured agents.defaults.imageMaxDimensionPx
takes effect. Without this fix, any provider timeout/error silently bypasses the
sanitization guard and returns a raw full-resolution screenshot.

Regression coverage:
- vision.test.ts: 6 unit cases for neutralizeMediaDirectives (no-op fast path,
  mid-line MEDIA: untouched, line-start defanged, leading-whitespace defanged,
  case-insensitive, multiple directives per blob).
- browser-tool.test.ts: 2 integration cases that drive the full screenshot
  tool execute path:
    - 'neutralizes MEDIA: directives in vision text and does not attach media'
      asserts no line matches /^\s*MEDIA:/i in returned text, secret path text
      is preserved verbatim, details.media is absent, and imageResultFromFile
      is not called on the success path.
    - 'preserves screenshot image sanitization on vision failure fallback'
      mocks describeImageFileWithModel to reject and asserts the fallback
      imageResultFromFile call receives imageSanitization: {maxDimensionPx:1600}
      plus the 'browser screenshot vision failed' extraText.

* fix(browser): apply clawsweeper fallback media fix from PR #84247

* refactor: reuse media image understanding for browser screenshots

* refactor: use structured media delivery

* test: update music completion media instruction expectation

* fix: trim buffered reply directive padding

* test: refresh codex prompt snapshots for message media aliases

---------

Co-authored-by: scotthuang <scotthuang@tencent.com>
Co-authored-by: Peter Steinberger <steipete@gmail.com>
2026-05-31 00:00:19 +01:00

3.1 KiB

summary, read_when, title
summary read_when title
Rich output protocol for structured media, embeds, audio hints, and replies
Changing assistant output rendering in the Control UI
Debugging `[embed ...]`, structured media, reply, or audio presentation directives
Rich output protocol

Assistant output can carry a small set of delivery/render directives:

  • structured mediaUrl / mediaUrls fields for attachment delivery
  • [[audio_as_voice]] for audio presentation hints
  • [[reply_to_current]] / [[reply_to:<id>]] for reply metadata
  • [embed ...] for Control UI rich rendering

Remote media attachments must be public https: URLs. Plain http:, loopback, link-local, private, and internal hostnames are ignored as attachment directives; server-side media fetchers still enforce their own network guards.

Local media attachments can use absolute paths, workspace-relative paths, or home-relative ~/ paths. They still pass through the agent file-read policy and media type checks before delivery.

Do not emit text commands for attachments from tools, plugins, streaming blocks, browser output, or message actions. Use structured media fields instead.

Valid message-tool payload:

{ "message": "Here is your image.", "mediaUrl": "/workspace/image.png" }

Legacy final assistant reply text may still be normalized for compatibility, but it is not a general plugin/tool protocol.

Plain Markdown image syntax stays text by default. Channels that intentionally map Markdown image replies to media attachments opt in at their outbound adapter; Telegram does this so ![alt](url) can still become a media reply.

These directives are separate. Structured media fields and reply/voice tags are delivery metadata; [embed ...] is the web-only rich render path.

When block streaming is enabled, media must be carried on structured payload fields. If the same media URL is sent in a streamed block and repeated in the final assistant payload, OpenClaw delivers the attachment once and strips the duplicate from the final payload.

[embed ...]

[embed ...] is the only agent-facing rich render syntax for the Control UI.

Self-closing example:

[embed ref="cv_123" title="Status" /]

Rules:

  • [view ...] is no longer valid for new output.
  • Embed shortcodes render in the assistant message surface only.
  • Only URL-backed embeds are rendered. Use ref="..." or url="...".
  • Block-form inline HTML embed shortcodes are not rendered.
  • The web UI strips the shortcode from visible text and renders the embed inline.
  • Structured media is not an embed alias and should not be used for rich embed rendering.

Stored rendering shape

The normalized/stored assistant content block is a structured canvas item:

{
  "type": "canvas",
  "preview": {
    "kind": "canvas",
    "surface": "assistant_message",
    "render": "url",
    "viewId": "cv_123",
    "url": "/__openclaw__/canvas/documents/cv_123/index.html",
    "title": "Status",
    "preferredHeight": 320
  }
}

Stored/rendered rich blocks use this canvas shape directly. present_view is not recognized.