Files
openclaw/extensions/canvas/src/cli.test.ts
scotthuang 7920af0c9e refactor: route browser screenshot vision through shared media understanding
* feat(browser): add optional vision understanding to screenshot tool

* fix(browser): wrap vision output as external content, enforce maxBytes, forward auth profiles

* fix(browser): remove no-op scope/attachments config, drop profile pass-through lacking runtime support

* feat(media-understanding): add profile/preferredProfile to DescribeImageFileWithModelParams and forward to describeImage

* style(browser): add curly braces to satisfy eslint curly rule

* fix(browser): correct tools.browser.enabled help text to match actual behavior

* fix(browser): thread agentDir/workspaceDir from plugin tool context into browser vision

* refactor(browser): move vision config from tools.browser to browser.models

The browser plugin's vision configuration now lives on the top-level
`browser` config namespace (browser.models, browser.visionEnabled,
browser.visionPrompt, etc.) instead of `tools.browser`. This aligns
with the plugin's existing config location and avoids confusion between
tool-level and plugin-level settings.

- Remove tools.browser from ToolsSchema and ToolsConfig
- Add models/vision* fields to BrowserConfig and its zod schema
- Update getBrowserVisionConfig to read from cfg.browser
- Update schema help, labels, and quality test
- Update vision.test.ts to use new config shape

* docs(browser): add screenshot vision configuration section

Document the new browser.models config for automatic screenshot
description via vision models, enabling text-only main models to
reason about web page content.

* fix(browser): remove deliverable media markers from vision result, drop unused import

P1: Vision-success path no longer exposes the raw screenshot as
deliverable media (removes MEDIA: line and details.media.mediaUrl).
This prevents channel delivery from auto-sending sensitive page content
when the intended output is a text description.

P2: Remove unused ToolsMediaUnderstandingSchema import that would fail
noUnusedLocals typecheck.

* fix(browser): add command/args fields to browser models schema

The browser vision model schema uses .strict(), so CLI-type entries
with command/args were rejected by TypeScript. Add these fields to
align with MediaUnderstandingModelSchema.

* chore(browser): remove debug console.log statements

* fix(browser): harden screenshot vision result against MEDIA: directive injection and restore image sanitization on failure fallback

ClawSweeper #84247 review round 2:

P1 (security, high): neutralize line-start MEDIA: directives in vision descriptions
before wrapping with wrapExternalContent. The agent media extractor scans every
browser tool-result text block via splitMediaFromOutput which treats line-start
MEDIA: as a trusted local-media delivery directive, and browser is on the
trusted-media allowlist. Without neutralization, page or vision-provider output
containing 'MEDIA:/tmp/secret.png' could synthesize a channel-deliverable media
artifact from untrusted content. wrapExternalContent itself does not strip
line-start directives. Introduce neutralizeMediaDirectives in vision.ts that
prepends '[neutralized] ' to any line whose trimStart() begins with MEDIA:
(case-insensitive), defanging the parser anchor while keeping the original
text human-readable.

P2 (compatibility): pass resolveRuntimeImageSanitization() to imageResultFromFile
in the vision-failure catch fallback. The non-vision screenshot path already
forwards this option (d5cc0d53b7) so configured agents.defaults.imageMaxDimensionPx
takes effect. Without this fix, any provider timeout/error silently bypasses the
sanitization guard and returns a raw full-resolution screenshot.

Regression coverage:
- vision.test.ts: 6 unit cases for neutralizeMediaDirectives (no-op fast path,
  mid-line MEDIA: untouched, line-start defanged, leading-whitespace defanged,
  case-insensitive, multiple directives per blob).
- browser-tool.test.ts: 2 integration cases that drive the full screenshot
  tool execute path:
    - 'neutralizes MEDIA: directives in vision text and does not attach media'
      asserts no line matches /^\s*MEDIA:/i in returned text, secret path text
      is preserved verbatim, details.media is absent, and imageResultFromFile
      is not called on the success path.
    - 'preserves screenshot image sanitization on vision failure fallback'
      mocks describeImageFileWithModel to reject and asserts the fallback
      imageResultFromFile call receives imageSanitization: {maxDimensionPx:1600}
      plus the 'browser screenshot vision failed' extraText.

* fix(browser): apply clawsweeper fallback media fix from PR #84247

* refactor: reuse media image understanding for browser screenshots

* refactor: use structured media delivery

* test: update music completion media instruction expectation

* fix: trim buffered reply directive padding

* test: refresh codex prompt snapshots for message media aliases

---------

Co-authored-by: scotthuang <scotthuang@tencent.com>
Co-authored-by: Peter Steinberger <steipete@gmail.com>
2026-05-31 00:00:19 +01:00

174 lines
5.5 KiB
TypeScript

import { Command } from "commander";
import { describe, expect, it, vi } from "vitest";
import { registerNodesCanvasCommands, type CanvasCliDependencies } from "./cli.js";
function createCanvasCliDeps() {
const writtenFiles: Array<{ filePath: string; base64: string }> = [];
const runtime = {
log: vi.fn(),
error: vi.fn(),
exit: vi.fn((code: number) => {
throw new Error(`exit ${code}`);
}),
writeJson: vi.fn(),
};
const deps: CanvasCliDependencies = {
defaultRuntime: runtime,
nodesCallOpts: (cmd) =>
cmd
.option("--url <url>", "Gateway WebSocket URL")
.option("--token <token>", "Gateway token")
.option("--timeout <ms>", "Timeout in ms", "10000")
.option("--json", "Output JSON", false),
runNodesCommand: async (_label, action) => {
await action();
},
getNodesTheme: () => ({ ok: (value) => value }),
parseTimeoutMs: (raw) => (typeof raw === "string" ? Number.parseInt(raw, 10) : undefined),
resolveNodeId: async (opts) => opts.node ?? "ios-node",
buildNodeInvokeParams: ({ nodeId, command, params, timeoutMs }) => ({
nodeId,
command,
params,
...(typeof timeoutMs === "number" ? { timeoutMs } : {}),
}),
callGatewayCli: vi.fn(async () => ({
payload: {
format: "png",
base64: "aGk=",
},
})),
writeBase64ToFile: async (filePath, base64) => {
writtenFiles.push({ filePath, base64 });
},
shortenHomePath: (filePath) => filePath,
};
return { deps, runtime, writtenFiles };
}
describe("canvas CLI", () => {
it("registers under nodes and captures a snapshot media path", async () => {
const program = new Command();
program.exitOverride();
const nodes = program.command("nodes");
const { deps, runtime, writtenFiles } = createCanvasCliDeps();
registerNodesCanvasCommands(nodes, deps);
await program.parseAsync(["nodes", "canvas", "snapshot", "--node", "ios-node"], {
from: "user",
});
expect(deps.callGatewayCli).toHaveBeenCalledTimes(1);
expect(deps.callGatewayCli).toHaveBeenCalledWith(
"node.invoke",
{
node: "ios-node",
format: "jpg",
timeout: "10000",
json: false,
invokeTimeout: "20000",
},
{
nodeId: "ios-node",
command: "canvas.snapshot",
params: {
format: "jpeg",
maxWidth: undefined,
quality: undefined,
},
timeoutMs: 20000,
},
);
expect(writtenFiles).toHaveLength(1);
const [writtenFile] = writtenFiles;
if (!writtenFile) {
throw new Error("Expected canvas snapshot file");
}
expect(writtenFile.filePath).toMatch(/openclaw-canvas-snapshot-.*\.png$/);
expect(writtenFile.base64).toBe("aGk=");
expect(runtime.log).toHaveBeenCalledTimes(1);
const savedPath = runtime.log.mock.calls[0]?.[0];
expect(savedPath?.startsWith("MEDIA:")).toBe(false);
expect(savedPath?.endsWith(".png")).toBe(true);
});
it("rejects node-controlled snapshot formats before writing", async () => {
const program = new Command();
program.exitOverride();
const nodes = program.command("nodes");
const { deps, writtenFiles } = createCanvasCliDeps();
vi.mocked(deps.callGatewayCli).mockResolvedValueOnce({
payload: {
format: "/../../target.sh",
base64: "aGk=",
},
});
registerNodesCanvasCommands(nodes, deps);
await expect(
program.parseAsync(["nodes", "canvas", "snapshot", "--node", "ios-node"], {
from: "user",
}),
).rejects.toThrow(/invalid canvas\.snapshot payload/i);
expect(writtenFiles).toHaveLength(0);
});
it("rejects unsupported snapshot formats before invoking the node", async () => {
const program = new Command();
program.exitOverride();
const nodes = program.command("nodes");
const { deps, writtenFiles } = createCanvasCliDeps();
registerNodesCanvasCommands(nodes, deps);
await expect(
program.parseAsync(["nodes", "canvas", "snapshot", "--node", "ios-node", "--format", "gif"], {
from: "user",
}),
).rejects.toThrow(/invalid format: gif/i);
expect(deps.callGatewayCli).not.toHaveBeenCalled();
expect(writtenFiles).toHaveLength(0);
});
it.each([
["--max-width", "640px", "--max-width must be a positive integer."],
["--quality", "0.8x", "--quality must be a number."],
])("rejects partial numeric snapshot %s values", async (flag, value, message) => {
const program = new Command();
program.exitOverride();
const nodes = program.command("nodes");
const { deps } = createCanvasCliDeps();
registerNodesCanvasCommands(nodes, deps);
await expect(
program.parseAsync(["nodes", "canvas", "snapshot", "--node", "ios-node", flag, value], {
from: "user",
}),
).rejects.toThrow(message);
expect(deps.callGatewayCli).not.toHaveBeenCalled();
});
it.each([
["--x", "1x"],
["--y", "2px"],
["--width", "800wide"],
["--height", "600tall"],
])("rejects partial numeric present %s values", async (flag, value) => {
const program = new Command();
program.exitOverride();
const nodes = program.command("nodes");
const { deps } = createCanvasCliDeps();
registerNodesCanvasCommands(nodes, deps);
await expect(
program.parseAsync(["nodes", "canvas", "present", "--node", "ios-node", flag, value], {
from: "user",
}),
).rejects.toThrow(`${flag} must be a number.`);
expect(deps.callGatewayCli).not.toHaveBeenCalled();
});
});