Files
openclaw/src/auto-reply/reply/reply-delivery.test.ts
scotthuang 7920af0c9e refactor: route browser screenshot vision through shared media understanding
* feat(browser): add optional vision understanding to screenshot tool

* fix(browser): wrap vision output as external content, enforce maxBytes, forward auth profiles

* fix(browser): remove no-op scope/attachments config, drop profile pass-through lacking runtime support

* feat(media-understanding): add profile/preferredProfile to DescribeImageFileWithModelParams and forward to describeImage

* style(browser): add curly braces to satisfy eslint curly rule

* fix(browser): correct tools.browser.enabled help text to match actual behavior

* fix(browser): thread agentDir/workspaceDir from plugin tool context into browser vision

* refactor(browser): move vision config from tools.browser to browser.models

The browser plugin's vision configuration now lives on the top-level
`browser` config namespace (browser.models, browser.visionEnabled,
browser.visionPrompt, etc.) instead of `tools.browser`. This aligns
with the plugin's existing config location and avoids confusion between
tool-level and plugin-level settings.

- Remove tools.browser from ToolsSchema and ToolsConfig
- Add models/vision* fields to BrowserConfig and its zod schema
- Update getBrowserVisionConfig to read from cfg.browser
- Update schema help, labels, and quality test
- Update vision.test.ts to use new config shape

* docs(browser): add screenshot vision configuration section

Document the new browser.models config for automatic screenshot
description via vision models, enabling text-only main models to
reason about web page content.

* fix(browser): remove deliverable media markers from vision result, drop unused import

P1: Vision-success path no longer exposes the raw screenshot as
deliverable media (removes MEDIA: line and details.media.mediaUrl).
This prevents channel delivery from auto-sending sensitive page content
when the intended output is a text description.

P2: Remove unused ToolsMediaUnderstandingSchema import that would fail
noUnusedLocals typecheck.

* fix(browser): add command/args fields to browser models schema

The browser vision model schema uses .strict(), so CLI-type entries
with command/args were rejected by TypeScript. Add these fields to
align with MediaUnderstandingModelSchema.

* chore(browser): remove debug console.log statements

* fix(browser): harden screenshot vision result against MEDIA: directive injection and restore image sanitization on failure fallback

ClawSweeper #84247 review round 2:

P1 (security, high): neutralize line-start MEDIA: directives in vision descriptions
before wrapping with wrapExternalContent. The agent media extractor scans every
browser tool-result text block via splitMediaFromOutput which treats line-start
MEDIA: as a trusted local-media delivery directive, and browser is on the
trusted-media allowlist. Without neutralization, page or vision-provider output
containing 'MEDIA:/tmp/secret.png' could synthesize a channel-deliverable media
artifact from untrusted content. wrapExternalContent itself does not strip
line-start directives. Introduce neutralizeMediaDirectives in vision.ts that
prepends '[neutralized] ' to any line whose trimStart() begins with MEDIA:
(case-insensitive), defanging the parser anchor while keeping the original
text human-readable.

P2 (compatibility): pass resolveRuntimeImageSanitization() to imageResultFromFile
in the vision-failure catch fallback. The non-vision screenshot path already
forwards this option (d5cc0d53b7) so configured agents.defaults.imageMaxDimensionPx
takes effect. Without this fix, any provider timeout/error silently bypasses the
sanitization guard and returns a raw full-resolution screenshot.

Regression coverage:
- vision.test.ts: 6 unit cases for neutralizeMediaDirectives (no-op fast path,
  mid-line MEDIA: untouched, line-start defanged, leading-whitespace defanged,
  case-insensitive, multiple directives per blob).
- browser-tool.test.ts: 2 integration cases that drive the full screenshot
  tool execute path:
    - 'neutralizes MEDIA: directives in vision text and does not attach media'
      asserts no line matches /^\s*MEDIA:/i in returned text, secret path text
      is preserved verbatim, details.media is absent, and imageResultFromFile
      is not called on the success path.
    - 'preserves screenshot image sanitization on vision failure fallback'
      mocks describeImageFileWithModel to reject and asserts the fallback
      imageResultFromFile call receives imageSanitization: {maxDimensionPx:1600}
      plus the 'browser screenshot vision failed' extraText.

* fix(browser): apply clawsweeper fallback media fix from PR #84247

* refactor: reuse media image understanding for browser screenshots

* refactor: use structured media delivery

* test: update music completion media instruction expectation

* fix: trim buffered reply directive padding

* test: refresh codex prompt snapshots for message media aliases

---------

Co-authored-by: scotthuang <scotthuang@tencent.com>
Co-authored-by: Peter Steinberger <steipete@gmail.com>
2026-05-31 00:00:19 +01:00

416 lines
14 KiB
TypeScript

import path from "node:path";
import { describe, expect, it, vi } from "vitest";
import { getReplyPayloadMetadata, setReplyPayloadMetadata } from "../reply-payload.js";
import { createBlockReplyContentKey } from "./block-reply-pipeline.js";
import {
createBlockReplyDeliveryHandler,
normalizeReplyPayloadDirectives,
} from "./reply-delivery.js";
import type { TypingSignaler } from "./typing-mode.js";
type BlockReplyPipelineLike = NonNullable<
Parameters<typeof createBlockReplyDeliveryHandler>[0]["blockReplyPipeline"]
>;
describe("createBlockReplyDeliveryHandler", () => {
it("sends captioned media-bearing block replies when block streaming is disabled", async () => {
const onBlockReply = vi.fn(async () => {});
const normalizeStreamingText = vi.fn((payload: { text?: string }) => ({
text: payload.text,
skip: false,
}));
const directlySentBlockKeys = new Set<string>();
const typingSignals = {
signalTextDelta: vi.fn(async () => {}),
} as unknown as TypingSignaler;
const handler = createBlockReplyDeliveryHandler({
onBlockReply,
normalizeStreamingText,
applyReplyToMode: (payload) => payload,
typingSignals,
blockStreamingEnabled: false,
blockReplyPipeline: null,
directlySentBlockKeys,
});
await handler({
text: "here's the vibe",
mediaUrls: ["/tmp/generated.png"],
replyToCurrent: true,
});
const expectedPayload = {
text: "here's the vibe",
mediaUrl: "/tmp/generated.png",
mediaUrls: ["/tmp/generated.png"],
replyToCurrent: true,
replyToId: undefined,
replyToTag: undefined,
audioAsVoice: false,
};
expect(onBlockReply).toHaveBeenCalledWith(expectedPayload);
expect(directlySentBlockKeys).toEqual(new Set([createBlockReplyContentKey(expectedPayload)]));
expect(typingSignals.signalTextDelta).toHaveBeenCalledWith("here's the vibe");
});
it("sends captioned audio-as-voice block replies when block streaming is disabled", async () => {
const onBlockReply = vi.fn(async () => {});
const directlySentBlockKeys = new Set<string>();
const handler = createBlockReplyDeliveryHandler({
onBlockReply,
normalizeStreamingText: (payload) => ({ text: payload.text, skip: false }),
applyReplyToMode: (payload) => payload,
typingSignals: {
signalTextDelta: vi.fn(async () => {}),
} as unknown as TypingSignaler,
blockStreamingEnabled: false,
blockReplyPipeline: null,
directlySentBlockKeys,
});
await handler({
text: "spoken confirmation",
mediaUrls: ["/tmp/voice.opus"],
audioAsVoice: true,
});
const expectedPayload = {
text: "spoken confirmation",
mediaUrl: "/tmp/voice.opus",
mediaUrls: ["/tmp/voice.opus"],
replyToId: undefined,
replyToCurrent: undefined,
replyToTag: undefined,
audioAsVoice: true,
};
expect(onBlockReply).toHaveBeenCalledWith(expectedPayload);
expect(directlySentBlockKeys).toEqual(new Set([createBlockReplyContentKey(expectedPayload)]));
});
it("sends media-only block replies when block streaming is disabled", async () => {
const onBlockReply = vi.fn(async () => {});
const directlySentBlockKeys = new Set<string>();
const handler = createBlockReplyDeliveryHandler({
onBlockReply,
normalizeStreamingText: (payload) => ({ text: payload.text, skip: false }),
applyReplyToMode: (payload) => payload,
typingSignals: {
signalTextDelta: vi.fn(async () => {}),
} as unknown as TypingSignaler,
blockStreamingEnabled: false,
blockReplyPipeline: null,
directlySentBlockKeys,
});
await handler({
mediaUrls: ["/tmp/generated.png"],
replyToCurrent: true,
});
expect(onBlockReply).toHaveBeenCalledWith({
mediaUrl: "/tmp/generated.png",
mediaUrls: ["/tmp/generated.png"],
replyToCurrent: true,
replyToId: undefined,
replyToTag: undefined,
audioAsVoice: false,
text: undefined,
});
expect(directlySentBlockKeys).toEqual(
new Set([
createBlockReplyContentKey({
mediaUrls: ["/tmp/generated.png"],
replyToCurrent: true,
}),
]),
);
});
it("sends presentation-only block replies when block streaming is disabled", async () => {
const onBlockReply = vi.fn(async () => {});
const directlySentBlockKeys = new Set<string>();
const presentation = {
blocks: [{ type: "buttons" as const, buttons: [{ label: "Open", value: "open" }] }],
};
const handler = createBlockReplyDeliveryHandler({
onBlockReply,
normalizeStreamingText: (payload) => ({ text: payload.text, skip: false }),
applyReplyToMode: (payload) => payload,
typingSignals: {
signalTextDelta: vi.fn(async () => {}),
} as unknown as TypingSignaler,
blockStreamingEnabled: false,
blockReplyPipeline: null,
directlySentBlockKeys,
});
await handler({ presentation });
const expectedPayload = {
presentation,
text: undefined,
mediaUrl: undefined,
mediaUrls: undefined,
replyToId: undefined,
replyToCurrent: undefined,
replyToTag: undefined,
audioAsVoice: false,
};
expect(onBlockReply).toHaveBeenCalledWith(expectedPayload);
expect(directlySentBlockKeys).toEqual(new Set([createBlockReplyContentKey(expectedPayload)]));
});
it("keeps text-only block replies buffered when block streaming is disabled", async () => {
const onBlockReply = vi.fn(async () => {});
const handler = createBlockReplyDeliveryHandler({
onBlockReply,
normalizeStreamingText: (payload) => ({ text: payload.text, skip: false }),
applyReplyToMode: (payload) => payload,
typingSignals: {
signalTextDelta: vi.fn(async () => {}),
} as unknown as TypingSignaler,
blockStreamingEnabled: false,
blockReplyPipeline: null,
directlySentBlockKeys: new Set(),
});
await handler({ text: "text only" });
expect(onBlockReply).not.toHaveBeenCalled();
});
it("trims leading whitespace in block-streamed replies", async () => {
const blockReplyPipeline = {
enqueue: vi.fn(),
} as unknown as BlockReplyPipelineLike;
const handler = createBlockReplyDeliveryHandler({
onBlockReply: vi.fn(async () => {}),
normalizeStreamingText: (payload) => ({ text: payload.text, skip: false }),
applyReplyToMode: (payload) => payload,
typingSignals: {
signalTextDelta: vi.fn(async () => {}),
} as unknown as TypingSignaler,
blockStreamingEnabled: true,
blockReplyPipeline,
directlySentBlockKeys: new Set(),
});
await handler({ text: "\n\n Hello from stream" });
expect(blockReplyPipeline.enqueue).toHaveBeenCalledWith({
text: "Hello from stream",
mediaUrl: undefined,
replyToId: undefined,
replyToCurrent: undefined,
replyToTag: undefined,
audioAsVoice: false,
mediaUrls: undefined,
});
});
it("suppresses implicit current-message threading for block replies when reply threading denies it", async () => {
const blockReplyPipeline = {
enqueue: vi.fn(),
} as unknown as BlockReplyPipelineLike;
const handler = createBlockReplyDeliveryHandler({
onBlockReply: vi.fn(async () => {}),
currentMessageId: "msg-123",
replyThreading: { implicitCurrentMessage: "deny" },
normalizeStreamingText: (payload) => ({ text: payload.text, skip: false }),
applyReplyToMode: (payload) => payload,
typingSignals: {
signalTextDelta: vi.fn(async () => {}),
} as unknown as TypingSignaler,
blockStreamingEnabled: true,
blockReplyPipeline,
directlySentBlockKeys: new Set(),
});
await handler({ text: "reset intro" });
expect(blockReplyPipeline.enqueue).toHaveBeenCalledWith({
text: "reset intro",
mediaUrl: undefined,
replyToId: undefined,
replyToCurrent: undefined,
replyToTag: undefined,
audioAsVoice: false,
mediaUrls: undefined,
});
});
it("parses media directives in block replies before path normalization", () => {
const normalized = normalizeReplyPayloadDirectives({
payload: { text: "Result\nMEDIA: ./image.png" },
trimLeadingWhitespace: true,
parseMode: "auto",
});
expect(normalized.payload.text).toBe("Result");
expect(normalized.payload.mediaUrl).toBe("./image.png");
expect(normalized.payload.mediaUrls).toEqual(["./image.png"]);
});
it("parses lowercase media directives in block replies before path normalization", () => {
const normalized = normalizeReplyPayloadDirectives({
payload: { text: "media: ./report.pdf" },
trimLeadingWhitespace: true,
parseMode: "auto",
});
expect(normalized.payload.text).toBeUndefined();
expect(normalized.payload.mediaUrl).toBe("./report.pdf");
expect(normalized.payload.mediaUrls).toEqual(["./report.pdf"]);
});
it("leaves media-looking text alone when media directive parsing is disabled", () => {
const normalized = normalizeReplyPayloadDirectives({
payload: { text: "Result\nMEDIA: ./image.png" },
trimLeadingWhitespace: true,
parseMode: "auto",
extractMediaDirectives: false,
});
expect(normalized.payload.text).toBe("Result\nMEDIA: ./image.png");
expect(normalized.payload.mediaUrl).toBeUndefined();
expect(normalized.payload.mediaUrls).toBeUndefined();
});
it("does not mark plain replies as explicit reply_to_current opt-outs", () => {
const normalized = normalizeReplyPayloadDirectives({
payload: { text: "plain reply" },
trimLeadingWhitespace: true,
parseMode: "auto",
});
expect(normalized.payload.replyToCurrent).toBeUndefined();
});
it("passes structured media block replies through media path normalization", async () => {
const blockReplyPipeline = {
enqueue: vi.fn(),
} as unknown as BlockReplyPipelineLike;
const absPath = path.join("/tmp/home", "openclaw", "image.png");
const handler = createBlockReplyDeliveryHandler({
onBlockReply: vi.fn(async () => {}),
normalizeStreamingText: (payload) => ({ text: payload.text, skip: false }),
applyReplyToMode: (payload) => payload,
normalizeMediaPaths: async (payload) => ({
...payload,
mediaUrl: absPath,
mediaUrls: [absPath],
}),
typingSignals: {
signalTextDelta: vi.fn(async () => {}),
} as unknown as TypingSignaler,
blockStreamingEnabled: true,
blockReplyPipeline,
directlySentBlockKeys: new Set(),
});
await handler({ text: "Result", mediaUrl: "./image.png" });
expect(blockReplyPipeline.enqueue).toHaveBeenCalledWith({
text: "Result",
mediaUrl: absPath,
mediaUrls: [absPath],
replyToId: undefined,
replyToCurrent: undefined,
replyToTag: undefined,
audioAsVoice: false,
});
});
it("suppresses generated media-failure warning text for silent structured block replies", async () => {
const blockReplyPipeline = {
enqueue: vi.fn(),
} as unknown as BlockReplyPipelineLike;
const absPath = path.join("/tmp/home", "openclaw", "survived.png");
const handler = createBlockReplyDeliveryHandler({
onBlockReply: vi.fn(async () => {}),
normalizeStreamingText: (payload) => ({ text: payload.text, skip: false }),
applyReplyToMode: (payload) => payload,
normalizeMediaPaths: async (payload) => ({
...payload,
text: "⚠️ Media failed.",
mediaUrl: absPath,
mediaUrls: [absPath],
}),
typingSignals: {
signalTextDelta: vi.fn(async () => {}),
} as unknown as TypingSignaler,
blockStreamingEnabled: true,
blockReplyPipeline,
directlySentBlockKeys: new Set(),
});
await handler({ text: "NO_REPLY", mediaUrls: ["./missing.png", "./survived.png"] });
expect(blockReplyPipeline.enqueue).toHaveBeenCalledWith({
text: undefined,
mediaUrl: absPath,
mediaUrls: [absPath],
replyToId: undefined,
replyToCurrent: undefined,
replyToTag: false,
audioAsVoice: false,
});
});
it("preserves reply payload metadata across block-reply normalization", async () => {
const enqueue = vi.fn();
const blockReplyPipeline = {
enqueue,
} as unknown as BlockReplyPipelineLike;
const handler = createBlockReplyDeliveryHandler({
onBlockReply: vi.fn(async () => {}),
normalizeStreamingText: (payload) => ({ text: payload.text, skip: false }),
applyReplyToMode: (payload) => ({ ...payload, replyToTag: true }),
typingSignals: {
signalTextDelta: vi.fn(async () => {}),
} as unknown as TypingSignaler,
blockStreamingEnabled: true,
blockReplyPipeline,
directlySentBlockKeys: new Set(),
});
const payload = setReplyPayloadMetadata({ text: "Alpha" }, { assistantMessageIndex: 7 });
await handler(payload);
expect(enqueue).toHaveBeenCalledTimes(1);
const [firstCall] = enqueue.mock.calls;
if (!firstCall) {
throw new Error("Expected block reply pipeline enqueue call");
}
const [enqueuedPayload] = firstCall;
if (enqueuedPayload === undefined) {
throw new Error("Expected block reply pipeline payload");
}
expect(enqueuedPayload).toEqual({
text: "Alpha",
mediaUrl: undefined,
replyToId: undefined,
replyToCurrent: undefined,
replyToTag: true,
audioAsVoice: false,
mediaUrls: undefined,
});
expect(getReplyPayloadMetadata(enqueuedPayload)).toEqual({
assistantMessageIndex: 7,
});
});
});