openclaw/docs/nodes/camera.md at a71b121c69f03e1a7471d5ba9d303b53e8778357

mirror of https://github.com/openclaw/openclaw.git synced 2026-06-03 22:24:06 +00:00

Files

scotthuang 7920af0c9e refactor: route browser screenshot vision through shared media understanding

* feat(browser): add optional vision understanding to screenshot tool

* fix(browser): wrap vision output as external content, enforce maxBytes, forward auth profiles

* fix(browser): remove no-op scope/attachments config, drop profile pass-through lacking runtime support

* feat(media-understanding): add profile/preferredProfile to DescribeImageFileWithModelParams and forward to describeImage

* style(browser): add curly braces to satisfy eslint curly rule

* fix(browser): correct tools.browser.enabled help text to match actual behavior

* fix(browser): thread agentDir/workspaceDir from plugin tool context into browser vision

* refactor(browser): move vision config from tools.browser to browser.models

The browser plugin's vision configuration now lives on the top-level
`browser` config namespace (browser.models, browser.visionEnabled,
browser.visionPrompt, etc.) instead of `tools.browser`. This aligns
with the plugin's existing config location and avoids confusion between
tool-level and plugin-level settings.

- Remove tools.browser from ToolsSchema and ToolsConfig
- Add models/vision* fields to BrowserConfig and its zod schema
- Update getBrowserVisionConfig to read from cfg.browser
- Update schema help, labels, and quality test
- Update vision.test.ts to use new config shape

* docs(browser): add screenshot vision configuration section

Document the new browser.models config for automatic screenshot
description via vision models, enabling text-only main models to
reason about web page content.

* fix(browser): remove deliverable media markers from vision result, drop unused import

P1: Vision-success path no longer exposes the raw screenshot as
deliverable media (removes MEDIA: line and details.media.mediaUrl).
This prevents channel delivery from auto-sending sensitive page content
when the intended output is a text description.

P2: Remove unused ToolsMediaUnderstandingSchema import that would fail
noUnusedLocals typecheck.

* fix(browser): add command/args fields to browser models schema

The browser vision model schema uses .strict(), so CLI-type entries
with command/args were rejected by TypeScript. Add these fields to
align with MediaUnderstandingModelSchema.

* chore(browser): remove debug console.log statements

* fix(browser): harden screenshot vision result against MEDIA: directive injection and restore image sanitization on failure fallback

ClawSweeper #84247 review round 2:

P1 (security, high): neutralize line-start MEDIA: directives in vision descriptions
before wrapping with wrapExternalContent. The agent media extractor scans every
browser tool-result text block via splitMediaFromOutput which treats line-start
MEDIA: as a trusted local-media delivery directive, and browser is on the
trusted-media allowlist. Without neutralization, page or vision-provider output
containing 'MEDIA:/tmp/secret.png' could synthesize a channel-deliverable media
artifact from untrusted content. wrapExternalContent itself does not strip
line-start directives. Introduce neutralizeMediaDirectives in vision.ts that
prepends '[neutralized] ' to any line whose trimStart() begins with MEDIA:
(case-insensitive), defanging the parser anchor while keeping the original
text human-readable.

P2 (compatibility): pass resolveRuntimeImageSanitization() to imageResultFromFile
in the vision-failure catch fallback. The non-vision screenshot path already
forwards this option (d5cc0d53b7) so configured agents.defaults.imageMaxDimensionPx
takes effect. Without this fix, any provider timeout/error silently bypasses the
sanitization guard and returns a raw full-resolution screenshot.

Regression coverage:
- vision.test.ts: 6 unit cases for neutralizeMediaDirectives (no-op fast path,
  mid-line MEDIA: untouched, line-start defanged, leading-whitespace defanged,
  case-insensitive, multiple directives per blob).
- browser-tool.test.ts: 2 integration cases that drive the full screenshot
  tool execute path:
    - 'neutralizes MEDIA: directives in vision text and does not attach media'
      asserts no line matches /^\s*MEDIA:/i in returned text, secret path text
      is preserved verbatim, details.media is absent, and imageResultFromFile
      is not called on the success path.
    - 'preserves screenshot image sanitization on vision failure fallback'
      mocks describeImageFileWithModel to reject and asserts the fallback
      imageResultFromFile call receives imageSanitization: {maxDimensionPx:1600}
      plus the 'browser screenshot vision failed' extraText.

* fix(browser): apply clawsweeper fallback media fix from PR #84247

* refactor: reuse media image understanding for browser screenshots

* refactor: use structured media delivery

* test: update music completion media instruction expectation

* fix: trim buffered reply directive padding

* test: refresh codex prompt snapshots for message media aliases

---------

Co-authored-by: scotthuang <scotthuang@tencent.com>
Co-authored-by: Peter Steinberger <steipete@gmail.com>

2026-05-31 00:00:19 +01:00

5.6 KiB

Raw Blame History

summary, read_when, title

summary

read_when

title

Camera capture (iOS/Android nodes + macOS app) for agent use: photos (jpg) and short video clips (mp4)

Adding or modifying camera capture on iOS/Android nodes or macOS

Extending agent-accessible MEDIA temp-file workflows

Camera capture

OpenClaw supports camera capture for agent workflows:

iOS node (paired via Gateway): capture a photo (jpg) or short video clip (mp4, with optional audio) via node.invoke.
Android node (paired via Gateway): capture a photo (jpg) or short video clip (mp4, with optional audio) via node.invoke.
macOS app (node via Gateway): capture a photo (jpg) or short video clip (mp4, with optional audio) via node.invoke.

All camera access is gated behind user-controlled settings.

iOS node

User setting (default on)

iOS Settings tab → Camera → Allow Camera (camera.enabled)
- Default: on (missing key is treated as enabled).
- When off: camera.* commands return CAMERA_DISABLED.

Commands (via Gateway `node.invoke`)

camera.list
- Response payload:
  - devices: array of { id, name, position, deviceType }
camera.snap
- Params:
  - facing: front|back (default: front)
  - maxWidth: number (optional; default 1600 on the iOS node)
  - quality: 0..1 (optional; default 0.9)
  - format: currently jpg
  - delayMs: number (optional; default 0)
  - deviceId: string (optional; from camera.list)
- Response payload:
  - format: "jpg"
  - base64: "<...>"
  - width, height
- Payload guard: photos are recompressed to keep the base64 payload under 5 MB.
camera.clip
- Params:
  - facing: front|back (default: front)
  - durationMs: number (default 3000, clamped to a max of 60000)
  - includeAudio: boolean (default true)
  - format: currently mp4
  - deviceId: string (optional; from camera.list)
- Response payload:
  - format: "mp4"
  - base64: "<...>"
  - durationMs
  - hasAudio

Foreground requirement

Like canvas.*, the iOS node only allows camera.* commands in the foreground. Background invocations return NODE_BACKGROUND_UNAVAILABLE.

CLI helper

The easiest way to get media files is via the CLI helper, which writes decoded media to a temp file and prints the saved path.

Examples:

openclaw nodes camera snap --node <id>               # default: both front + back (2 MEDIA lines)
openclaw nodes camera snap --node <id> --facing front
openclaw nodes camera clip --node <id> --duration 3000
openclaw nodes camera clip --node <id> --no-audio

Notes:

nodes camera snap defaults to both facings to give the agent both views.
Output files are temporary (in the OS temp directory) unless you build your own wrapper.

Android node

Android user setting (default on)

Android Settings sheet → Camera → Allow Camera (camera.enabled)
- Default: on (missing key is treated as enabled).
- When off: camera.* commands return CAMERA_DISABLED.

Permissions

Android requires runtime permissions:
- CAMERA for both camera.snap and camera.clip.
- RECORD_AUDIO for camera.clip when includeAudio=true.

If permissions are missing, the app will prompt when possible; if denied, camera.* requests fail with a *_PERMISSION_REQUIRED error.

Android foreground requirement

Like canvas.*, the Android node only allows camera.* commands in the foreground. Background invocations return NODE_BACKGROUND_UNAVAILABLE.

Android commands (via Gateway `node.invoke`)

camera.list
- Response payload:
  - devices: array of { id, name, position, deviceType }

Payload guard

Photos are recompressed to keep the base64 payload under 5 MB.

macOS app

User setting (default off)

The macOS companion app exposes a checkbox:

Settings → General → Allow Camera (openclaw.cameraEnabled)
- Default: off
- When off: camera requests return "Camera disabled by user".

CLI helper (node invoke)

Use the main openclaw CLI to invoke camera commands on the macOS node.

Examples:

openclaw nodes camera list --node <id>            # list camera ids
openclaw nodes camera snap --node <id>            # prints saved path
openclaw nodes camera snap --node <id> --max-width 1280
openclaw nodes camera snap --node <id> --delay-ms 2000
openclaw nodes camera snap --node <id> --device-id <id>
openclaw nodes camera clip --node <id> --duration 10s          # prints saved path
openclaw nodes camera clip --node <id> --duration-ms 3000      # prints saved path (legacy flag)
openclaw nodes camera clip --node <id> --device-id <id>
openclaw nodes camera clip --node <id> --no-audio

Notes:

openclaw nodes camera snap defaults to maxWidth=1600 unless overridden.
On macOS, camera.snap waits delayMs (default 2000ms) after warm-up/exposure settle before capturing.
Photo payloads are recompressed to keep base64 under 5 MB.

Safety + practical limits

Camera and microphone access trigger the usual OS permission prompts (and require usage strings in Info.plist).
Video clips are capped (currently <= 60s) to avoid oversized node payloads (base64 overhead + message limits).

macOS screen video (OS-level)

For screen video (not camera), use the macOS companion:

openclaw nodes screen record --node <id> --duration 10s --fps 15   # prints saved path

Notes:

Requires macOS Screen Recording permission (TCC).

5.6 KiB Raw Blame History

iOS node

User setting (default on)

Commands (via Gateway node.invoke)

Foreground requirement

CLI helper

Android node

Android user setting (default on)

Permissions

Android foreground requirement

Android commands (via Gateway node.invoke)

Payload guard

macOS app

User setting (default off)

CLI helper (node invoke)

Safety + practical limits

macOS screen video (OS-level)

Related

5.6 KiB

Raw Blame History

Commands (via Gateway `node.invoke`)

Android commands (via Gateway `node.invoke`)