mirror of
https://github.com/openclaw/openclaw.git
synced 2026-06-05 18:42:53 +00:00
* feat(browser): add optional vision understanding to screenshot tool
* fix(browser): wrap vision output as external content, enforce maxBytes, forward auth profiles
* fix(browser): remove no-op scope/attachments config, drop profile pass-through lacking runtime support
* feat(media-understanding): add profile/preferredProfile to DescribeImageFileWithModelParams and forward to describeImage
* style(browser): add curly braces to satisfy eslint curly rule
* fix(browser): correct tools.browser.enabled help text to match actual behavior
* fix(browser): thread agentDir/workspaceDir from plugin tool context into browser vision
* refactor(browser): move vision config from tools.browser to browser.models
The browser plugin's vision configuration now lives on the top-level
`browser` config namespace (browser.models, browser.visionEnabled,
browser.visionPrompt, etc.) instead of `tools.browser`. This aligns
with the plugin's existing config location and avoids confusion between
tool-level and plugin-level settings.
- Remove tools.browser from ToolsSchema and ToolsConfig
- Add models/vision* fields to BrowserConfig and its zod schema
- Update getBrowserVisionConfig to read from cfg.browser
- Update schema help, labels, and quality test
- Update vision.test.ts to use new config shape
* docs(browser): add screenshot vision configuration section
Document the new browser.models config for automatic screenshot
description via vision models, enabling text-only main models to
reason about web page content.
* fix(browser): remove deliverable media markers from vision result, drop unused import
P1: Vision-success path no longer exposes the raw screenshot as
deliverable media (removes MEDIA: line and details.media.mediaUrl).
This prevents channel delivery from auto-sending sensitive page content
when the intended output is a text description.
P2: Remove unused ToolsMediaUnderstandingSchema import that would fail
noUnusedLocals typecheck.
* fix(browser): add command/args fields to browser models schema
The browser vision model schema uses .strict(), so CLI-type entries
with command/args were rejected by TypeScript. Add these fields to
align with MediaUnderstandingModelSchema.
* chore(browser): remove debug console.log statements
* fix(browser): harden screenshot vision result against MEDIA: directive injection and restore image sanitization on failure fallback
ClawSweeper #84247 review round 2:
P1 (security, high): neutralize line-start MEDIA: directives in vision descriptions
before wrapping with wrapExternalContent. The agent media extractor scans every
browser tool-result text block via splitMediaFromOutput which treats line-start
MEDIA: as a trusted local-media delivery directive, and browser is on the
trusted-media allowlist. Without neutralization, page or vision-provider output
containing 'MEDIA:/tmp/secret.png' could synthesize a channel-deliverable media
artifact from untrusted content. wrapExternalContent itself does not strip
line-start directives. Introduce neutralizeMediaDirectives in vision.ts that
prepends '[neutralized] ' to any line whose trimStart() begins with MEDIA:
(case-insensitive), defanging the parser anchor while keeping the original
text human-readable.
P2 (compatibility): pass resolveRuntimeImageSanitization() to imageResultFromFile
in the vision-failure catch fallback. The non-vision screenshot path already
forwards this option (d5cc0d53b7) so configured agents.defaults.imageMaxDimensionPx
takes effect. Without this fix, any provider timeout/error silently bypasses the
sanitization guard and returns a raw full-resolution screenshot.
Regression coverage:
- vision.test.ts: 6 unit cases for neutralizeMediaDirectives (no-op fast path,
mid-line MEDIA: untouched, line-start defanged, leading-whitespace defanged,
case-insensitive, multiple directives per blob).
- browser-tool.test.ts: 2 integration cases that drive the full screenshot
tool execute path:
- 'neutralizes MEDIA: directives in vision text and does not attach media'
asserts no line matches /^\s*MEDIA:/i in returned text, secret path text
is preserved verbatim, details.media is absent, and imageResultFromFile
is not called on the success path.
- 'preserves screenshot image sanitization on vision failure fallback'
mocks describeImageFileWithModel to reject and asserts the fallback
imageResultFromFile call receives imageSanitization: {maxDimensionPx:1600}
plus the 'browser screenshot vision failed' extraText.
* fix(browser): apply clawsweeper fallback media fix from PR #84247
* refactor: reuse media image understanding for browser screenshots
* refactor: use structured media delivery
* test: update music completion media instruction expectation
* fix: trim buffered reply directive padding
* test: refresh codex prompt snapshots for message media aliases
---------
Co-authored-by: scotthuang <scotthuang@tencent.com>
Co-authored-by: Peter Steinberger <steipete@gmail.com>
94 lines
3.1 KiB
Markdown
94 lines
3.1 KiB
Markdown
---
|
|
summary: "Rich output protocol for structured media, embeds, audio hints, and replies"
|
|
read_when:
|
|
- Changing assistant output rendering in the Control UI
|
|
- Debugging `[embed ...]`, structured media, reply, or audio presentation directives
|
|
title: "Rich output protocol"
|
|
---
|
|
|
|
Assistant output can carry a small set of delivery/render directives:
|
|
|
|
- structured `mediaUrl` / `mediaUrls` fields for attachment delivery
|
|
- `[[audio_as_voice]]` for audio presentation hints
|
|
- `[[reply_to_current]]` / `[[reply_to:<id>]]` for reply metadata
|
|
- `[embed ...]` for Control UI rich rendering
|
|
|
|
Remote media attachments must be public `https:` URLs. Plain `http:`,
|
|
loopback, link-local, private, and internal hostnames are ignored as attachment
|
|
directives; server-side media fetchers still enforce their own network guards.
|
|
|
|
Local media attachments can use absolute paths, workspace-relative paths, or
|
|
home-relative `~/` paths. They still pass through the agent file-read policy and
|
|
media type checks before delivery.
|
|
|
|
<Warning>
|
|
Do not emit text commands for attachments from tools, plugins, streaming blocks,
|
|
browser output, or message actions. Use structured media fields instead.
|
|
|
|
Valid message-tool payload:
|
|
|
|
```json
|
|
{ "message": "Here is your image.", "mediaUrl": "/workspace/image.png" }
|
|
```
|
|
|
|
Legacy final assistant reply text may still be normalized for compatibility, but
|
|
it is not a general plugin/tool protocol.
|
|
</Warning>
|
|
|
|
Plain Markdown image syntax stays text by default. Channels that intentionally
|
|
map Markdown image replies to media attachments opt in at their outbound
|
|
adapter; Telegram does this so `` can still become a media reply.
|
|
|
|
These directives are separate. Structured media fields and reply/voice tags are
|
|
delivery metadata; `[embed ...]` is the web-only rich render path.
|
|
|
|
When block streaming is enabled, media must be carried on structured payload
|
|
fields. If the same media URL is sent in a streamed block and repeated in the
|
|
final assistant payload, OpenClaw delivers the attachment once and strips the
|
|
duplicate from the final payload.
|
|
|
|
## `[embed ...]`
|
|
|
|
`[embed ...]` is the only agent-facing rich render syntax for the Control UI.
|
|
|
|
Self-closing example:
|
|
|
|
```text
|
|
[embed ref="cv_123" title="Status" /]
|
|
```
|
|
|
|
Rules:
|
|
|
|
- `[view ...]` is no longer valid for new output.
|
|
- Embed shortcodes render in the assistant message surface only.
|
|
- Only URL-backed embeds are rendered. Use `ref="..."` or `url="..."`.
|
|
- Block-form inline HTML embed shortcodes are not rendered.
|
|
- The web UI strips the shortcode from visible text and renders the embed inline.
|
|
- Structured media is not an embed alias and should not be used for rich embed rendering.
|
|
|
|
## Stored rendering shape
|
|
|
|
The normalized/stored assistant content block is a structured `canvas` item:
|
|
|
|
```json
|
|
{
|
|
"type": "canvas",
|
|
"preview": {
|
|
"kind": "canvas",
|
|
"surface": "assistant_message",
|
|
"render": "url",
|
|
"viewId": "cv_123",
|
|
"url": "/__openclaw__/canvas/documents/cv_123/index.html",
|
|
"title": "Status",
|
|
"preferredHeight": 320
|
|
}
|
|
}
|
|
```
|
|
|
|
Stored/rendered rich blocks use this `canvas` shape directly. `present_view` is not recognized.
|
|
|
|
## Related
|
|
|
|
- [RPC adapters](/reference/rpc)
|
|
- [Typebox](/concepts/typebox)
|