openclaw/docs/nodes/index.md at cf315ddef69f321f987488fc2cee036c715f9ca4

mirror of https://github.com/openclaw/openclaw.git synced 2026-06-03 21:24:05 +00:00

Files

scotthuang 7920af0c9e refactor: route browser screenshot vision through shared media understanding

* feat(browser): add optional vision understanding to screenshot tool

* fix(browser): wrap vision output as external content, enforce maxBytes, forward auth profiles

* fix(browser): remove no-op scope/attachments config, drop profile pass-through lacking runtime support

* feat(media-understanding): add profile/preferredProfile to DescribeImageFileWithModelParams and forward to describeImage

* style(browser): add curly braces to satisfy eslint curly rule

* fix(browser): correct tools.browser.enabled help text to match actual behavior

* fix(browser): thread agentDir/workspaceDir from plugin tool context into browser vision

* refactor(browser): move vision config from tools.browser to browser.models

The browser plugin's vision configuration now lives on the top-level
`browser` config namespace (browser.models, browser.visionEnabled,
browser.visionPrompt, etc.) instead of `tools.browser`. This aligns
with the plugin's existing config location and avoids confusion between
tool-level and plugin-level settings.

- Remove tools.browser from ToolsSchema and ToolsConfig
- Add models/vision* fields to BrowserConfig and its zod schema
- Update getBrowserVisionConfig to read from cfg.browser
- Update schema help, labels, and quality test
- Update vision.test.ts to use new config shape

* docs(browser): add screenshot vision configuration section

Document the new browser.models config for automatic screenshot
description via vision models, enabling text-only main models to
reason about web page content.

* fix(browser): remove deliverable media markers from vision result, drop unused import

P1: Vision-success path no longer exposes the raw screenshot as
deliverable media (removes MEDIA: line and details.media.mediaUrl).
This prevents channel delivery from auto-sending sensitive page content
when the intended output is a text description.

P2: Remove unused ToolsMediaUnderstandingSchema import that would fail
noUnusedLocals typecheck.

* fix(browser): add command/args fields to browser models schema

The browser vision model schema uses .strict(), so CLI-type entries
with command/args were rejected by TypeScript. Add these fields to
align with MediaUnderstandingModelSchema.

* chore(browser): remove debug console.log statements

* fix(browser): harden screenshot vision result against MEDIA: directive injection and restore image sanitization on failure fallback

ClawSweeper #84247 review round 2:

P1 (security, high): neutralize line-start MEDIA: directives in vision descriptions
before wrapping with wrapExternalContent. The agent media extractor scans every
browser tool-result text block via splitMediaFromOutput which treats line-start
MEDIA: as a trusted local-media delivery directive, and browser is on the
trusted-media allowlist. Without neutralization, page or vision-provider output
containing 'MEDIA:/tmp/secret.png' could synthesize a channel-deliverable media
artifact from untrusted content. wrapExternalContent itself does not strip
line-start directives. Introduce neutralizeMediaDirectives in vision.ts that
prepends '[neutralized] ' to any line whose trimStart() begins with MEDIA:
(case-insensitive), defanging the parser anchor while keeping the original
text human-readable.

P2 (compatibility): pass resolveRuntimeImageSanitization() to imageResultFromFile
in the vision-failure catch fallback. The non-vision screenshot path already
forwards this option (d5cc0d53b7) so configured agents.defaults.imageMaxDimensionPx
takes effect. Without this fix, any provider timeout/error silently bypasses the
sanitization guard and returns a raw full-resolution screenshot.

Regression coverage:
- vision.test.ts: 6 unit cases for neutralizeMediaDirectives (no-op fast path,
  mid-line MEDIA: untouched, line-start defanged, leading-whitespace defanged,
  case-insensitive, multiple directives per blob).
- browser-tool.test.ts: 2 integration cases that drive the full screenshot
  tool execute path:
    - 'neutralizes MEDIA: directives in vision text and does not attach media'
      asserts no line matches /^\s*MEDIA:/i in returned text, secret path text
      is preserved verbatim, details.media is absent, and imageResultFromFile
      is not called on the success path.
    - 'preserves screenshot image sanitization on vision failure fallback'
      mocks describeImageFileWithModel to reject and asserts the fallback
      imageResultFromFile call receives imageSanitization: {maxDimensionPx:1600}
      plus the 'browser screenshot vision failed' extraText.

* fix(browser): apply clawsweeper fallback media fix from PR #84247

* refactor: reuse media image understanding for browser screenshots

* refactor: use structured media delivery

* test: update music completion media instruction expectation

* fix: trim buffered reply directive padding

* test: refresh codex prompt snapshots for message media aliases

---------

Co-authored-by: scotthuang <scotthuang@tencent.com>
Co-authored-by: Peter Steinberger <steipete@gmail.com>

2026-05-31 00:00:19 +01:00

17 KiB

Raw Blame History

summary, read_when, title

summary

read_when

title

Nodes: pairing, capabilities, permissions, and CLI helpers for canvas/camera/screen/device/notifications/system

Pairing iOS/Android nodes to a gateway

Using node canvas/camera for agent context

Adding new node commands or CLI helpers

Nodes

A node is a companion device (macOS/iOS/Android/headless) that connects to the Gateway WebSocket (same port as operators) with role: "node" and exposes a command surface (e.g. canvas.*, camera.*, device.*, notifications.*, system.*) via node.invoke. Protocol details: Gateway protocol.

Legacy transport: Bridge protocol (TCP JSONL; historical only for current nodes).

macOS can also run in node mode: the menubar app connects to the Gateway's WS server and exposes its local canvas/camera commands as a node (so openclaw nodes … works against this Mac). In remote gateway mode, browser automation is handled by the CLI node host (openclaw node run or the installed node service), not by the native app node.

Notes:

Nodes are peripherals, not gateways. They don't run the gateway service.
Telegram/WhatsApp/etc. messages land on the gateway, not on nodes.
Troubleshooting runbook: /nodes/troubleshooting

Pairing + status

WS nodes use device pairing. Nodes present a device identity during connect; the Gateway creates a device pairing request for role: node. Approve via the devices CLI (or UI).

Quick CLI:

openclaw devices list
openclaw devices approve <requestId>
openclaw devices reject <requestId>
openclaw nodes status
openclaw nodes describe --node <idOrNameOrIp>

If a node retries with changed auth details (role/scopes/public key), the prior pending request is superseded and a new requestId is created. Re-run openclaw devices list before approving.

Notes:

nodes status marks a node as paired when its device pairing role includes node.
The device pairing record is the durable approved-role contract. Token rotation stays inside that contract; it cannot upgrade a paired node into a different role that pairing approval never granted.
node.pair.* (CLI: openclaw nodes pending/approve/reject/remove/rename) is a separate gateway-owned node pairing store; it does not gate the WS connect handshake.
openclaw nodes remove --node <id|name|ip> deletes stale entries from that separate gateway-owned node pairing store.
Approval scope follows the pending request's declared commands:
- commandless request: operator.pairing
- non-exec node commands: operator.pairing + operator.write
- system.run / system.run.prepare / system.which: operator.pairing + operator.admin

Remote node host (system.run)

Use a node host when your Gateway runs on one machine and you want commands to execute on another. The model still talks to the gateway; the gateway forwards exec calls to the node host when host=node is selected.

What runs where

Gateway host: receives messages, runs the model, routes tool calls.
Node host: executes system.run/system.which on the node machine.
Approvals: enforced on the node host via ~/.openclaw/exec-approvals.json.

Approval note:

Approval-backed node runs bind exact request context.
For direct shell/runtime file executions, OpenClaw also best-effort binds one concrete local file operand and denies the run if that file changes before execution.
If OpenClaw cannot identify exactly one concrete local file for an interpreter/runtime command, approval-backed execution is denied instead of pretending full runtime coverage. Use sandboxing, separate hosts, or an explicit trusted allowlist/full workflow for broader interpreter semantics.

Start a node host (foreground)

On the node machine:

openclaw node run --host <gateway-host> --port 18789 --display-name "Build Node"

Remote gateway via SSH tunnel (loopback bind)

If the Gateway binds to loopback (gateway.bind=loopback, default in local mode), remote node hosts cannot connect directly. Create an SSH tunnel and point the node host at the local end of the tunnel.

Example (node host -> gateway host):

# Terminal A (keep running): forward local 18790 -> gateway 127.0.0.1:18789
ssh -N -L 18790:127.0.0.1:18789 user@gateway-host

# Terminal B: export the gateway token and connect through the tunnel
export OPENCLAW_GATEWAY_TOKEN="<gateway-token>"
openclaw node run --host 127.0.0.1 --port 18790 --display-name "Build Node"

Notes:

openclaw node run supports token or password auth.
Env vars are preferred: OPENCLAW_GATEWAY_TOKEN / OPENCLAW_GATEWAY_PASSWORD.
Config fallback is gateway.auth.token / gateway.auth.password.
In local mode, node host intentionally ignores gateway.remote.token / gateway.remote.password.
In remote mode, gateway.remote.token / gateway.remote.password are eligible per remote precedence rules.
If active local gateway.auth.* SecretRefs are configured but unresolved, node-host auth fails closed.
Node-host auth resolution only honors OPENCLAW_GATEWAY_* env vars.

Start a node host (service)

openclaw node install --host <gateway-host> --port 18789 --display-name "Build Node"
openclaw node start
openclaw node restart

Pair + name

On the gateway host:

openclaw devices list
openclaw devices approve <requestId>
openclaw nodes status

If the node retries with changed auth details, re-run openclaw devices list and approve the current requestId.

Naming options:

--display-name on openclaw node run / openclaw node install (persists in ~/.openclaw/node.json on the node).
openclaw nodes rename --node <id|name|ip> --name "Build Node" (gateway override).

Allowlist the commands

Exec approvals are per node host. Add allowlist entries from the gateway:

openclaw approvals allowlist add --node <id|name|ip> "/usr/bin/uname"
openclaw approvals allowlist add --node <id|name|ip> "/usr/bin/sw_vers"

Approvals live on the node host at ~/.openclaw/exec-approvals.json.

Point exec at the node

Configure defaults (gateway config):

openclaw config set tools.exec.host node
openclaw config set tools.exec.security allowlist
openclaw config set tools.exec.node "<id-or-name>"

Or per session:

/exec host=node security=allowlist node=<id-or-name>

Once set, any exec call with host=node runs on the node host (subject to the node allowlist/approvals).

host=auto will not implicitly choose the node on its own, but an explicit per-call host=node request is allowed from auto. If you want node exec to be the default for the session, set tools.exec.host=node or /exec host=node ... explicitly.

Invoking commands

Low-level (raw RPC):

openclaw nodes invoke --node <idOrNameOrIp> --command canvas.eval --params '{"javaScript":"location.href"}'

Higher-level helpers exist for the common "give the agent a MEDIA attachment" workflows.

Command policy

Node commands must pass two gates before they can be invoked:

The node must declare the command in its WebSocket connect.commands list.
The gateway's platform policy must allow the declared command.

Windows and macOS companion nodes allow safe declared commands such as canvas.*, camera.list, location.get, and screen.snapshot by default. Trusted nodes that advertise the talk capability or declare talk.* commands also allow declared push-to-talk commands (talk.ptt.start, talk.ptt.stop, talk.ptt.cancel, talk.ptt.once) by default, independent of platform label. Dangerous or privacy-heavy commands such as camera.snap, camera.clip, and screen.record still require explicit opt-in with gateway.nodes.allowCommands. gateway.nodes.denyCommands always wins over defaults and extra allowlist entries.

Plugin-owned node commands can add a Gateway node-invoke policy. That policy runs after the allowlist check and before forwarding to the node, so raw node.invoke, CLI helpers, and dedicated agent tools share the same plugin permission boundary. Dangerous plugin node commands still require explicit gateway.nodes.allowCommands opt-in.

After a node changes its declared command list, reject the old device pairing and approve the new request so the gateway stores the updated command snapshot.

Screenshots (canvas snapshots)

If the node is showing the Canvas (WebView), canvas.snapshot returns { format, base64 }.

CLI helper (writes to a temp file and prints the saved path):

openclaw nodes canvas snapshot --node <idOrNameOrIp> --format png
openclaw nodes canvas snapshot --node <idOrNameOrIp> --format jpg --max-width 1200 --quality 0.9

Canvas controls

openclaw nodes canvas present --node <idOrNameOrIp> --target https://example.com
openclaw nodes canvas hide --node <idOrNameOrIp>
openclaw nodes canvas navigate https://example.com --node <idOrNameOrIp>
openclaw nodes canvas eval --node <idOrNameOrIp> --js "document.title"

Notes:

canvas present accepts URLs or local file paths (--target), plus optional --x/--y/--width/--height for positioning.
canvas eval accepts inline JS (--js) or a positional arg.

A2UI (Canvas)

openclaw nodes canvas a2ui push --node <idOrNameOrIp> --text "Hello"
openclaw nodes canvas a2ui push --node <idOrNameOrIp> --jsonl ./payload.jsonl
openclaw nodes canvas a2ui reset --node <idOrNameOrIp>

Notes:

Only A2UI v0.8 JSONL is supported (v0.9/createSurface is rejected).

Photos + videos (node camera)

Photos (jpg):

openclaw nodes camera list --node <idOrNameOrIp>
openclaw nodes camera snap --node <idOrNameOrIp>            # default: both facings (2 MEDIA lines)
openclaw nodes camera snap --node <idOrNameOrIp> --facing front

Video clips (mp4):

openclaw nodes camera clip --node <idOrNameOrIp> --duration 10s
openclaw nodes camera clip --node <idOrNameOrIp> --duration 3000 --no-audio

Notes:

The node must be foregrounded for canvas.* and camera.* (background calls return NODE_BACKGROUND_UNAVAILABLE).
Clip duration is clamped (currently <= 60s) to avoid oversized base64 payloads.
Android will prompt for CAMERA/RECORD_AUDIO permissions when possible; denied permissions fail with *_PERMISSION_REQUIRED.

Screen recordings (nodes)

Supported nodes expose screen.record (mp4). Example:

openclaw nodes screen record --node <idOrNameOrIp> --duration 10s --fps 10
openclaw nodes screen record --node <idOrNameOrIp> --duration 10s --fps 10 --no-audio

Notes:

screen.record availability depends on node platform.
Screen recordings are clamped to <= 60s.
--no-audio disables microphone capture on supported platforms.
Use --screen <index> to select a display when multiple screens are available.

Location (nodes)

Nodes expose location.get when Location is enabled in settings.

CLI helper:

openclaw nodes location get --node <idOrNameOrIp>
openclaw nodes location get --node <idOrNameOrIp> --accuracy precise --max-age 15000 --location-timeout 10000

Notes:

Location is off by default.
"Always" requires system permission; background fetch is best-effort.
The response includes lat/lon, accuracy (meters), and timestamp.

SMS (Android nodes)

Android nodes can expose sms.send when the user grants SMS permission and the device supports telephony.

Low-level invoke:

openclaw nodes invoke --node <idOrNameOrIp> --command sms.send --params '{"to":"+15555550123","message":"Hello from OpenClaw"}'

Notes:

The permission prompt must be accepted on the Android device before the capability is advertised.
Wi-Fi-only devices without telephony will not advertise sms.send.

Android device + personal data commands

Android nodes can advertise additional command families when the corresponding capabilities are enabled.

Available families:

device.status, device.info, device.permissions, device.health
notifications.list, notifications.actions
photos.latest
contacts.search, contacts.add
calendar.events, calendar.add
callLog.search
sms.search
motion.activity, motion.pedometer

Example invokes:

openclaw nodes invoke --node <idOrNameOrIp> --command device.status --params '{}'
openclaw nodes invoke --node <idOrNameOrIp> --command notifications.list --params '{}'
openclaw nodes invoke --node <idOrNameOrIp> --command photos.latest --params '{"limit":1}'

Notes:

Motion commands are capability-gated by available sensors.

System commands (node host / mac node)

The macOS node exposes system.run, system.notify, and system.execApprovals.get/set. The headless node host exposes system.run, system.which, and system.execApprovals.get/set.

Examples:

openclaw nodes notify --node <idOrNameOrIp> --title "Ping" --body "Gateway ready"
openclaw nodes invoke --node <idOrNameOrIp> --command system.which --params '{"name":"git"}'

Notes:

system.run returns stdout/stderr/exit code in the payload.
Shell execution now goes through the exec tool with host=node; nodes remains the direct-RPC surface for explicit node commands.
nodes invoke does not expose system.run or system.run.prepare; those stay on the exec path only.
The exec path prepares a canonical systemRunPlan before approval. Once an approval is granted, the gateway forwards that stored plan, not any later caller-edited command/cwd/session fields.
system.notify respects notification permission state on the macOS app.
Unrecognized node platform / deviceFamily metadata uses a conservative default allowlist that excludes system.run and system.which. If you intentionally need those commands for an unknown platform, add them explicitly via gateway.nodes.allowCommands.
system.run supports --cwd, --env KEY=VAL, --command-timeout, and --needs-screen-recording.
For shell wrappers (bash|sh|zsh ... -c/-lc), request-scoped --env values are reduced to an explicit allowlist (TERM, LANG, LC_*, COLORTERM, NO_COLOR, FORCE_COLOR).
For allow-always decisions in allowlist mode, known dispatch wrappers (env, nice, nohup, stdbuf, timeout) persist inner executable paths instead of wrapper paths. If unwrapping is not safe, no allowlist entry is persisted automatically.
On Windows node hosts in allowlist mode, shell-wrapper runs via cmd.exe /c require approval (allowlist entry alone does not auto-allow the wrapper form).
system.notify supports --priority <passive|active|timeSensitive> and --delivery <system|overlay|auto>.
Node hosts ignore PATH overrides and strip dangerous startup/shell keys (DYLD_*, LD_*, NODE_OPTIONS, NODE_REDIRECT_WARNINGS, NODE_REPL_EXTERNAL_MODULE, NODE_REPL_HISTORY, NODE_V8_COVERAGE, PYTHON*, PERL*, RUBYOPT, SHELLOPTS, PS4). If you need extra PATH entries, configure the node host service environment (or install tools in standard locations) instead of passing PATH via --env.
On macOS node mode, system.run is gated by exec approvals in the macOS app (Settings → Exec approvals). Ask/allowlist/full behave the same as the headless node host; denied prompts return SYSTEM_RUN_DENIED.
On headless node host, system.run is gated by exec approvals (~/.openclaw/exec-approvals.json).

Exec node binding

When multiple nodes are available, you can bind exec to a specific node. This sets the default node for exec host=node (and can be overridden per agent).

Global default:

openclaw config set tools.exec.node "node-id-or-name"

Per-agent override:

openclaw config get agents.list
openclaw config set 'agents.list[0].tools.exec.node' "node-id-or-name"

Unset to allow any node:

openclaw config unset tools.exec.node
openclaw config unset 'agents.list[0].tools.exec.node'

Permissions map

Nodes may include a permissions map in node.list / node.describe, keyed by permission name (e.g. screenRecording, accessibility) with boolean values (true = granted).

Headless node host (cross-platform)

OpenClaw can run a headless node host (no UI) that connects to the Gateway WebSocket and exposes system.run / system.which. This is useful on Linux/Windows or for running a minimal node alongside a server.

Start it:

openclaw node run --host <gateway-host> --port 18789

Notes:

Pairing is still required (the Gateway will show a device pairing prompt).
The node host stores its node id, token, display name, and gateway connection info in ~/.openclaw/node.json.
Exec approvals are enforced locally via ~/.openclaw/exec-approvals.json (see Exec approvals).
On macOS, the headless node host executes system.run locally by default. Set OPENCLAW_NODE_EXEC_HOST=app to route system.run through the companion app exec host; add OPENCLAW_NODE_EXEC_FALLBACK=0 to require the app host and fail closed if it is unavailable.
Add --tls / --tls-fingerprint when the Gateway WS uses TLS.

Mac node mode

The macOS menubar app connects to the Gateway WS server as a node (so openclaw nodes … works against this Mac).
In remote mode, the app opens an SSH tunnel for the Gateway port and connects to localhost.

17 KiB Raw Blame History

Pairing + status

Remote node host (system.run)

What runs where

Start a node host (foreground)

Remote gateway via SSH tunnel (loopback bind)

Start a node host (service)

Pair + name

Allowlist the commands

Point exec at the node

Invoking commands

Command policy

Screenshots (canvas snapshots)

Canvas controls

A2UI (Canvas)

Photos + videos (node camera)

Screen recordings (nodes)

Location (nodes)

SMS (Android nodes)

Android device + personal data commands

System commands (node host / mac node)

Exec node binding

Permissions map

Headless node host (cross-platform)

Mac node mode

17 KiB

Raw Blame History