Files
openclaw/docs/nodes/camera.md
scotthuang 7920af0c9e refactor: route browser screenshot vision through shared media understanding
* feat(browser): add optional vision understanding to screenshot tool

* fix(browser): wrap vision output as external content, enforce maxBytes, forward auth profiles

* fix(browser): remove no-op scope/attachments config, drop profile pass-through lacking runtime support

* feat(media-understanding): add profile/preferredProfile to DescribeImageFileWithModelParams and forward to describeImage

* style(browser): add curly braces to satisfy eslint curly rule

* fix(browser): correct tools.browser.enabled help text to match actual behavior

* fix(browser): thread agentDir/workspaceDir from plugin tool context into browser vision

* refactor(browser): move vision config from tools.browser to browser.models

The browser plugin's vision configuration now lives on the top-level
`browser` config namespace (browser.models, browser.visionEnabled,
browser.visionPrompt, etc.) instead of `tools.browser`. This aligns
with the plugin's existing config location and avoids confusion between
tool-level and plugin-level settings.

- Remove tools.browser from ToolsSchema and ToolsConfig
- Add models/vision* fields to BrowserConfig and its zod schema
- Update getBrowserVisionConfig to read from cfg.browser
- Update schema help, labels, and quality test
- Update vision.test.ts to use new config shape

* docs(browser): add screenshot vision configuration section

Document the new browser.models config for automatic screenshot
description via vision models, enabling text-only main models to
reason about web page content.

* fix(browser): remove deliverable media markers from vision result, drop unused import

P1: Vision-success path no longer exposes the raw screenshot as
deliverable media (removes MEDIA: line and details.media.mediaUrl).
This prevents channel delivery from auto-sending sensitive page content
when the intended output is a text description.

P2: Remove unused ToolsMediaUnderstandingSchema import that would fail
noUnusedLocals typecheck.

* fix(browser): add command/args fields to browser models schema

The browser vision model schema uses .strict(), so CLI-type entries
with command/args were rejected by TypeScript. Add these fields to
align with MediaUnderstandingModelSchema.

* chore(browser): remove debug console.log statements

* fix(browser): harden screenshot vision result against MEDIA: directive injection and restore image sanitization on failure fallback

ClawSweeper #84247 review round 2:

P1 (security, high): neutralize line-start MEDIA: directives in vision descriptions
before wrapping with wrapExternalContent. The agent media extractor scans every
browser tool-result text block via splitMediaFromOutput which treats line-start
MEDIA: as a trusted local-media delivery directive, and browser is on the
trusted-media allowlist. Without neutralization, page or vision-provider output
containing 'MEDIA:/tmp/secret.png' could synthesize a channel-deliverable media
artifact from untrusted content. wrapExternalContent itself does not strip
line-start directives. Introduce neutralizeMediaDirectives in vision.ts that
prepends '[neutralized] ' to any line whose trimStart() begins with MEDIA:
(case-insensitive), defanging the parser anchor while keeping the original
text human-readable.

P2 (compatibility): pass resolveRuntimeImageSanitization() to imageResultFromFile
in the vision-failure catch fallback. The non-vision screenshot path already
forwards this option (d5cc0d53b7) so configured agents.defaults.imageMaxDimensionPx
takes effect. Without this fix, any provider timeout/error silently bypasses the
sanitization guard and returns a raw full-resolution screenshot.

Regression coverage:
- vision.test.ts: 6 unit cases for neutralizeMediaDirectives (no-op fast path,
  mid-line MEDIA: untouched, line-start defanged, leading-whitespace defanged,
  case-insensitive, multiple directives per blob).
- browser-tool.test.ts: 2 integration cases that drive the full screenshot
  tool execute path:
    - 'neutralizes MEDIA: directives in vision text and does not attach media'
      asserts no line matches /^\s*MEDIA:/i in returned text, secret path text
      is preserved verbatim, details.media is absent, and imageResultFromFile
      is not called on the success path.
    - 'preserves screenshot image sanitization on vision failure fallback'
      mocks describeImageFileWithModel to reject and asserts the fallback
      imageResultFromFile call receives imageSanitization: {maxDimensionPx:1600}
      plus the 'browser screenshot vision failed' extraText.

* fix(browser): apply clawsweeper fallback media fix from PR #84247

* refactor: reuse media image understanding for browser screenshots

* refactor: use structured media delivery

* test: update music completion media instruction expectation

* fix: trim buffered reply directive padding

* test: refresh codex prompt snapshots for message media aliases

---------

Co-authored-by: scotthuang <scotthuang@tencent.com>
Co-authored-by: Peter Steinberger <steipete@gmail.com>
2026-05-31 00:00:19 +01:00

5.6 KiB

summary, read_when, title
summary read_when title
Camera capture (iOS/Android nodes + macOS app) for agent use: photos (jpg) and short video clips (mp4)
Adding or modifying camera capture on iOS/Android nodes or macOS
Extending agent-accessible MEDIA temp-file workflows
Camera capture

OpenClaw supports camera capture for agent workflows:

  • iOS node (paired via Gateway): capture a photo (jpg) or short video clip (mp4, with optional audio) via node.invoke.
  • Android node (paired via Gateway): capture a photo (jpg) or short video clip (mp4, with optional audio) via node.invoke.
  • macOS app (node via Gateway): capture a photo (jpg) or short video clip (mp4, with optional audio) via node.invoke.

All camera access is gated behind user-controlled settings.

iOS node

User setting (default on)

  • iOS Settings tab → CameraAllow Camera (camera.enabled)
    • Default: on (missing key is treated as enabled).
    • When off: camera.* commands return CAMERA_DISABLED.

Commands (via Gateway node.invoke)

  • camera.list

    • Response payload:
      • devices: array of { id, name, position, deviceType }
  • camera.snap

    • Params:
      • facing: front|back (default: front)
      • maxWidth: number (optional; default 1600 on the iOS node)
      • quality: 0..1 (optional; default 0.9)
      • format: currently jpg
      • delayMs: number (optional; default 0)
      • deviceId: string (optional; from camera.list)
    • Response payload:
      • format: "jpg"
      • base64: "<...>"
      • width, height
    • Payload guard: photos are recompressed to keep the base64 payload under 5 MB.
  • camera.clip

    • Params:
      • facing: front|back (default: front)
      • durationMs: number (default 3000, clamped to a max of 60000)
      • includeAudio: boolean (default true)
      • format: currently mp4
      • deviceId: string (optional; from camera.list)
    • Response payload:
      • format: "mp4"
      • base64: "<...>"
      • durationMs
      • hasAudio

Foreground requirement

Like canvas.*, the iOS node only allows camera.* commands in the foreground. Background invocations return NODE_BACKGROUND_UNAVAILABLE.

CLI helper

The easiest way to get media files is via the CLI helper, which writes decoded media to a temp file and prints the saved path.

Examples:

openclaw nodes camera snap --node <id>               # default: both front + back (2 MEDIA lines)
openclaw nodes camera snap --node <id> --facing front
openclaw nodes camera clip --node <id> --duration 3000
openclaw nodes camera clip --node <id> --no-audio

Notes:

  • nodes camera snap defaults to both facings to give the agent both views.
  • Output files are temporary (in the OS temp directory) unless you build your own wrapper.

Android node

Android user setting (default on)

  • Android Settings sheet → CameraAllow Camera (camera.enabled)
    • Default: on (missing key is treated as enabled).
    • When off: camera.* commands return CAMERA_DISABLED.

Permissions

  • Android requires runtime permissions:
    • CAMERA for both camera.snap and camera.clip.
    • RECORD_AUDIO for camera.clip when includeAudio=true.

If permissions are missing, the app will prompt when possible; if denied, camera.* requests fail with a *_PERMISSION_REQUIRED error.

Android foreground requirement

Like canvas.*, the Android node only allows camera.* commands in the foreground. Background invocations return NODE_BACKGROUND_UNAVAILABLE.

Android commands (via Gateway node.invoke)

  • camera.list
    • Response payload:
      • devices: array of { id, name, position, deviceType }

Payload guard

Photos are recompressed to keep the base64 payload under 5 MB.

macOS app

User setting (default off)

The macOS companion app exposes a checkbox:

  • Settings → General → Allow Camera (openclaw.cameraEnabled)
    • Default: off
    • When off: camera requests return "Camera disabled by user".

CLI helper (node invoke)

Use the main openclaw CLI to invoke camera commands on the macOS node.

Examples:

openclaw nodes camera list --node <id>            # list camera ids
openclaw nodes camera snap --node <id>            # prints saved path
openclaw nodes camera snap --node <id> --max-width 1280
openclaw nodes camera snap --node <id> --delay-ms 2000
openclaw nodes camera snap --node <id> --device-id <id>
openclaw nodes camera clip --node <id> --duration 10s          # prints saved path
openclaw nodes camera clip --node <id> --duration-ms 3000      # prints saved path (legacy flag)
openclaw nodes camera clip --node <id> --device-id <id>
openclaw nodes camera clip --node <id> --no-audio

Notes:

  • openclaw nodes camera snap defaults to maxWidth=1600 unless overridden.
  • On macOS, camera.snap waits delayMs (default 2000ms) after warm-up/exposure settle before capturing.
  • Photo payloads are recompressed to keep base64 under 5 MB.

Safety + practical limits

  • Camera and microphone access trigger the usual OS permission prompts (and require usage strings in Info.plist).
  • Video clips are capped (currently <= 60s) to avoid oversized node payloads (base64 overhead + message limits).

macOS screen video (OS-level)

For screen video (not camera), use the macOS companion:

openclaw nodes screen record --node <id> --duration 10s --fps 15   # prints saved path

Notes:

  • Requires macOS Screen Recording permission (TCC).