mirror of
https://github.com/openclaw/openclaw.git
synced 2026-06-03 22:34:05 +00:00
* feat(browser): add optional vision understanding to screenshot tool
* fix(browser): wrap vision output as external content, enforce maxBytes, forward auth profiles
* fix(browser): remove no-op scope/attachments config, drop profile pass-through lacking runtime support
* feat(media-understanding): add profile/preferredProfile to DescribeImageFileWithModelParams and forward to describeImage
* style(browser): add curly braces to satisfy eslint curly rule
* fix(browser): correct tools.browser.enabled help text to match actual behavior
* fix(browser): thread agentDir/workspaceDir from plugin tool context into browser vision
* refactor(browser): move vision config from tools.browser to browser.models
The browser plugin's vision configuration now lives on the top-level
`browser` config namespace (browser.models, browser.visionEnabled,
browser.visionPrompt, etc.) instead of `tools.browser`. This aligns
with the plugin's existing config location and avoids confusion between
tool-level and plugin-level settings.
- Remove tools.browser from ToolsSchema and ToolsConfig
- Add models/vision* fields to BrowserConfig and its zod schema
- Update getBrowserVisionConfig to read from cfg.browser
- Update schema help, labels, and quality test
- Update vision.test.ts to use new config shape
* docs(browser): add screenshot vision configuration section
Document the new browser.models config for automatic screenshot
description via vision models, enabling text-only main models to
reason about web page content.
* fix(browser): remove deliverable media markers from vision result, drop unused import
P1: Vision-success path no longer exposes the raw screenshot as
deliverable media (removes MEDIA: line and details.media.mediaUrl).
This prevents channel delivery from auto-sending sensitive page content
when the intended output is a text description.
P2: Remove unused ToolsMediaUnderstandingSchema import that would fail
noUnusedLocals typecheck.
* fix(browser): add command/args fields to browser models schema
The browser vision model schema uses .strict(), so CLI-type entries
with command/args were rejected by TypeScript. Add these fields to
align with MediaUnderstandingModelSchema.
* chore(browser): remove debug console.log statements
* fix(browser): harden screenshot vision result against MEDIA: directive injection and restore image sanitization on failure fallback
ClawSweeper #84247 review round 2:
P1 (security, high): neutralize line-start MEDIA: directives in vision descriptions
before wrapping with wrapExternalContent. The agent media extractor scans every
browser tool-result text block via splitMediaFromOutput which treats line-start
MEDIA: as a trusted local-media delivery directive, and browser is on the
trusted-media allowlist. Without neutralization, page or vision-provider output
containing 'MEDIA:/tmp/secret.png' could synthesize a channel-deliverable media
artifact from untrusted content. wrapExternalContent itself does not strip
line-start directives. Introduce neutralizeMediaDirectives in vision.ts that
prepends '[neutralized] ' to any line whose trimStart() begins with MEDIA:
(case-insensitive), defanging the parser anchor while keeping the original
text human-readable.
P2 (compatibility): pass resolveRuntimeImageSanitization() to imageResultFromFile
in the vision-failure catch fallback. The non-vision screenshot path already
forwards this option (d5cc0d53b7) so configured agents.defaults.imageMaxDimensionPx
takes effect. Without this fix, any provider timeout/error silently bypasses the
sanitization guard and returns a raw full-resolution screenshot.
Regression coverage:
- vision.test.ts: 6 unit cases for neutralizeMediaDirectives (no-op fast path,
mid-line MEDIA: untouched, line-start defanged, leading-whitespace defanged,
case-insensitive, multiple directives per blob).
- browser-tool.test.ts: 2 integration cases that drive the full screenshot
tool execute path:
- 'neutralizes MEDIA: directives in vision text and does not attach media'
asserts no line matches /^\s*MEDIA:/i in returned text, secret path text
is preserved verbatim, details.media is absent, and imageResultFromFile
is not called on the success path.
- 'preserves screenshot image sanitization on vision failure fallback'
mocks describeImageFileWithModel to reject and asserts the fallback
imageResultFromFile call receives imageSanitization: {maxDimensionPx:1600}
plus the 'browser screenshot vision failed' extraText.
* fix(browser): apply clawsweeper fallback media fix from PR #84247
* refactor: reuse media image understanding for browser screenshots
* refactor: use structured media delivery
* test: update music completion media instruction expectation
* fix: trim buffered reply directive padding
* test: refresh codex prompt snapshots for message media aliases
---------
Co-authored-by: scotthuang <scotthuang@tencent.com>
Co-authored-by: Peter Steinberger <steipete@gmail.com>
167 lines
5.6 KiB
Markdown
167 lines
5.6 KiB
Markdown
---
|
|
summary: "Camera capture (iOS/Android nodes + macOS app) for agent use: photos (jpg) and short video clips (mp4)"
|
|
read_when:
|
|
- Adding or modifying camera capture on iOS/Android nodes or macOS
|
|
- Extending agent-accessible MEDIA temp-file workflows
|
|
title: "Camera capture"
|
|
---
|
|
|
|
OpenClaw supports **camera capture** for agent workflows:
|
|
|
|
- **iOS node** (paired via Gateway): capture a **photo** (`jpg`) or **short video clip** (`mp4`, with optional audio) via `node.invoke`.
|
|
- **Android node** (paired via Gateway): capture a **photo** (`jpg`) or **short video clip** (`mp4`, with optional audio) via `node.invoke`.
|
|
- **macOS app** (node via Gateway): capture a **photo** (`jpg`) or **short video clip** (`mp4`, with optional audio) via `node.invoke`.
|
|
|
|
All camera access is gated behind **user-controlled settings**.
|
|
|
|
## iOS node
|
|
|
|
### User setting (default on)
|
|
|
|
- iOS Settings tab → **Camera** → **Allow Camera** (`camera.enabled`)
|
|
- Default: **on** (missing key is treated as enabled).
|
|
- When off: `camera.*` commands return `CAMERA_DISABLED`.
|
|
|
|
### Commands (via Gateway `node.invoke`)
|
|
|
|
- `camera.list`
|
|
- Response payload:
|
|
- `devices`: array of `{ id, name, position, deviceType }`
|
|
|
|
- `camera.snap`
|
|
- Params:
|
|
- `facing`: `front|back` (default: `front`)
|
|
- `maxWidth`: number (optional; default `1600` on the iOS node)
|
|
- `quality`: `0..1` (optional; default `0.9`)
|
|
- `format`: currently `jpg`
|
|
- `delayMs`: number (optional; default `0`)
|
|
- `deviceId`: string (optional; from `camera.list`)
|
|
- Response payload:
|
|
- `format: "jpg"`
|
|
- `base64: "<...>"`
|
|
- `width`, `height`
|
|
- Payload guard: photos are recompressed to keep the base64 payload under 5 MB.
|
|
|
|
- `camera.clip`
|
|
- Params:
|
|
- `facing`: `front|back` (default: `front`)
|
|
- `durationMs`: number (default `3000`, clamped to a max of `60000`)
|
|
- `includeAudio`: boolean (default `true`)
|
|
- `format`: currently `mp4`
|
|
- `deviceId`: string (optional; from `camera.list`)
|
|
- Response payload:
|
|
- `format: "mp4"`
|
|
- `base64: "<...>"`
|
|
- `durationMs`
|
|
- `hasAudio`
|
|
|
|
### Foreground requirement
|
|
|
|
Like `canvas.*`, the iOS node only allows `camera.*` commands in the **foreground**. Background invocations return `NODE_BACKGROUND_UNAVAILABLE`.
|
|
|
|
### CLI helper
|
|
|
|
The easiest way to get media files is via the CLI helper, which writes decoded media to a temp file and prints the saved path.
|
|
|
|
Examples:
|
|
|
|
```bash
|
|
openclaw nodes camera snap --node <id> # default: both front + back (2 MEDIA lines)
|
|
openclaw nodes camera snap --node <id> --facing front
|
|
openclaw nodes camera clip --node <id> --duration 3000
|
|
openclaw nodes camera clip --node <id> --no-audio
|
|
```
|
|
|
|
Notes:
|
|
|
|
- `nodes camera snap` defaults to **both** facings to give the agent both views.
|
|
- Output files are temporary (in the OS temp directory) unless you build your own wrapper.
|
|
|
|
## Android node
|
|
|
|
### Android user setting (default on)
|
|
|
|
- Android Settings sheet → **Camera** → **Allow Camera** (`camera.enabled`)
|
|
- Default: **on** (missing key is treated as enabled).
|
|
- When off: `camera.*` commands return `CAMERA_DISABLED`.
|
|
|
|
### Permissions
|
|
|
|
- Android requires runtime permissions:
|
|
- `CAMERA` for both `camera.snap` and `camera.clip`.
|
|
- `RECORD_AUDIO` for `camera.clip` when `includeAudio=true`.
|
|
|
|
If permissions are missing, the app will prompt when possible; if denied, `camera.*` requests fail with a
|
|
`*_PERMISSION_REQUIRED` error.
|
|
|
|
### Android foreground requirement
|
|
|
|
Like `canvas.*`, the Android node only allows `camera.*` commands in the **foreground**. Background invocations return `NODE_BACKGROUND_UNAVAILABLE`.
|
|
|
|
### Android commands (via Gateway `node.invoke`)
|
|
|
|
- `camera.list`
|
|
- Response payload:
|
|
- `devices`: array of `{ id, name, position, deviceType }`
|
|
|
|
### Payload guard
|
|
|
|
Photos are recompressed to keep the base64 payload under 5 MB.
|
|
|
|
## macOS app
|
|
|
|
### User setting (default off)
|
|
|
|
The macOS companion app exposes a checkbox:
|
|
|
|
- **Settings → General → Allow Camera** (`openclaw.cameraEnabled`)
|
|
- Default: **off**
|
|
- When off: camera requests return "Camera disabled by user".
|
|
|
|
### CLI helper (node invoke)
|
|
|
|
Use the main `openclaw` CLI to invoke camera commands on the macOS node.
|
|
|
|
Examples:
|
|
|
|
```bash
|
|
openclaw nodes camera list --node <id> # list camera ids
|
|
openclaw nodes camera snap --node <id> # prints saved path
|
|
openclaw nodes camera snap --node <id> --max-width 1280
|
|
openclaw nodes camera snap --node <id> --delay-ms 2000
|
|
openclaw nodes camera snap --node <id> --device-id <id>
|
|
openclaw nodes camera clip --node <id> --duration 10s # prints saved path
|
|
openclaw nodes camera clip --node <id> --duration-ms 3000 # prints saved path (legacy flag)
|
|
openclaw nodes camera clip --node <id> --device-id <id>
|
|
openclaw nodes camera clip --node <id> --no-audio
|
|
```
|
|
|
|
Notes:
|
|
|
|
- `openclaw nodes camera snap` defaults to `maxWidth=1600` unless overridden.
|
|
- On macOS, `camera.snap` waits `delayMs` (default 2000ms) after warm-up/exposure settle before capturing.
|
|
- Photo payloads are recompressed to keep base64 under 5 MB.
|
|
|
|
## Safety + practical limits
|
|
|
|
- Camera and microphone access trigger the usual OS permission prompts (and require usage strings in Info.plist).
|
|
- Video clips are capped (currently `<= 60s`) to avoid oversized node payloads (base64 overhead + message limits).
|
|
|
|
## macOS screen video (OS-level)
|
|
|
|
For _screen_ video (not camera), use the macOS companion:
|
|
|
|
```bash
|
|
openclaw nodes screen record --node <id> --duration 10s --fps 15 # prints saved path
|
|
```
|
|
|
|
Notes:
|
|
|
|
- Requires macOS **Screen Recording** permission (TCC).
|
|
|
|
## Related
|
|
|
|
- [Image and media support](/nodes/images)
|
|
- [Media understanding](/nodes/media-understanding)
|
|
- [Location command](/nodes/location-command)
|