openclaw/docs/tools/browser-control.md at 4dad7bd93b6caae036342fd4efbdb47c217b459f

mirror of https://github.com/openclaw/openclaw.git synced 2026-06-03 17:14:06 +00:00

Files

scotthuang 7920af0c9e refactor: route browser screenshot vision through shared media understanding

* feat(browser): add optional vision understanding to screenshot tool

* fix(browser): wrap vision output as external content, enforce maxBytes, forward auth profiles

* fix(browser): remove no-op scope/attachments config, drop profile pass-through lacking runtime support

* feat(media-understanding): add profile/preferredProfile to DescribeImageFileWithModelParams and forward to describeImage

* style(browser): add curly braces to satisfy eslint curly rule

* fix(browser): correct tools.browser.enabled help text to match actual behavior

* fix(browser): thread agentDir/workspaceDir from plugin tool context into browser vision

* refactor(browser): move vision config from tools.browser to browser.models

The browser plugin's vision configuration now lives on the top-level
`browser` config namespace (browser.models, browser.visionEnabled,
browser.visionPrompt, etc.) instead of `tools.browser`. This aligns
with the plugin's existing config location and avoids confusion between
tool-level and plugin-level settings.

- Remove tools.browser from ToolsSchema and ToolsConfig
- Add models/vision* fields to BrowserConfig and its zod schema
- Update getBrowserVisionConfig to read from cfg.browser
- Update schema help, labels, and quality test
- Update vision.test.ts to use new config shape

* docs(browser): add screenshot vision configuration section

Document the new browser.models config for automatic screenshot
description via vision models, enabling text-only main models to
reason about web page content.

* fix(browser): remove deliverable media markers from vision result, drop unused import

P1: Vision-success path no longer exposes the raw screenshot as
deliverable media (removes MEDIA: line and details.media.mediaUrl).
This prevents channel delivery from auto-sending sensitive page content
when the intended output is a text description.

P2: Remove unused ToolsMediaUnderstandingSchema import that would fail
noUnusedLocals typecheck.

* fix(browser): add command/args fields to browser models schema

The browser vision model schema uses .strict(), so CLI-type entries
with command/args were rejected by TypeScript. Add these fields to
align with MediaUnderstandingModelSchema.

* chore(browser): remove debug console.log statements

* fix(browser): harden screenshot vision result against MEDIA: directive injection and restore image sanitization on failure fallback

ClawSweeper #84247 review round 2:

P1 (security, high): neutralize line-start MEDIA: directives in vision descriptions
before wrapping with wrapExternalContent. The agent media extractor scans every
browser tool-result text block via splitMediaFromOutput which treats line-start
MEDIA: as a trusted local-media delivery directive, and browser is on the
trusted-media allowlist. Without neutralization, page or vision-provider output
containing 'MEDIA:/tmp/secret.png' could synthesize a channel-deliverable media
artifact from untrusted content. wrapExternalContent itself does not strip
line-start directives. Introduce neutralizeMediaDirectives in vision.ts that
prepends '[neutralized] ' to any line whose trimStart() begins with MEDIA:
(case-insensitive), defanging the parser anchor while keeping the original
text human-readable.

P2 (compatibility): pass resolveRuntimeImageSanitization() to imageResultFromFile
in the vision-failure catch fallback. The non-vision screenshot path already
forwards this option (d5cc0d53b7) so configured agents.defaults.imageMaxDimensionPx
takes effect. Without this fix, any provider timeout/error silently bypasses the
sanitization guard and returns a raw full-resolution screenshot.

Regression coverage:
- vision.test.ts: 6 unit cases for neutralizeMediaDirectives (no-op fast path,
  mid-line MEDIA: untouched, line-start defanged, leading-whitespace defanged,
  case-insensitive, multiple directives per blob).
- browser-tool.test.ts: 2 integration cases that drive the full screenshot
  tool execute path:
    - 'neutralizes MEDIA: directives in vision text and does not attach media'
      asserts no line matches /^\s*MEDIA:/i in returned text, secret path text
      is preserved verbatim, details.media is absent, and imageResultFromFile
      is not called on the success path.
    - 'preserves screenshot image sanitization on vision failure fallback'
      mocks describeImageFileWithModel to reject and asserts the fallback
      imageResultFromFile call receives imageSanitization: {maxDimensionPx:1600}
      plus the 'browser screenshot vision failed' extraText.

* fix(browser): apply clawsweeper fallback media fix from PR #84247

* refactor: reuse media image understanding for browser screenshots

* refactor: use structured media delivery

* test: update music completion media instruction expectation

* fix: trim buffered reply directive padding

* test: refresh codex prompt snapshots for message media aliases

---------

Co-authored-by: scotthuang <scotthuang@tencent.com>
Co-authored-by: Peter Steinberger <steipete@gmail.com>

2026-05-31 00:00:19 +01:00

17 KiB

Raw Blame History

summary, read_when, title

summary

read_when

title

OpenClaw browser control API, CLI reference, and scripting actions

Scripting or debugging the agent browser via the local control API

Looking for the `openclaw browser` CLI reference

Adding custom browser automation with snapshots and refs

Browser control API

For setup, configuration, and troubleshooting, see Browser. This page is the reference for the local control HTTP API, the openclaw browser CLI, and scripting patterns (snapshots, refs, waits, debug flows).

Control API (optional)

For local integrations only, the Gateway exposes a small loopback HTTP API:

Status/start/stop: GET /, POST /start, POST /stop
Tabs: GET /tabs, POST /tabs/open, POST /tabs/focus, DELETE /tabs/:targetId
Snapshot/screenshot: GET /snapshot, POST /screenshot
Actions: POST /navigate, POST /act
Hooks: POST /hooks/file-chooser, POST /hooks/dialog
Downloads: POST /download, POST /wait/download
Permissions: POST /permissions/grant
Debugging: GET /console, POST /pdf
Debugging: GET /errors, GET /requests, POST /trace/start, POST /trace/stop, POST /highlight
Network: POST /response/body
State: GET /cookies, POST /cookies/set, POST /cookies/clear
State: GET /storage/:kind, POST /storage/:kind/set, POST /storage/:kind/clear
Settings: POST /set/offline, POST /set/headers, POST /set/credentials, POST /set/geolocation, POST /set/media, POST /set/timezone, POST /set/locale, POST /set/device

All endpoints accept ?profile=<name>. POST /start?headless=true requests a one-shot headless launch for local managed profiles without changing persisted browser config; attach-only, remote CDP, and existing-session profiles reject that override because OpenClaw does not launch those browser processes.

If shared-secret gateway auth is configured, browser HTTP routes require auth too:

Authorization: Bearer <gateway token>
x-openclaw-password: <gateway password> or HTTP Basic auth with that password

Notes:

This standalone loopback browser API does not consume trusted-proxy or Tailscale Serve identity headers.
If gateway.auth.mode is none or trusted-proxy, these loopback browser routes do not inherit those identity-bearing modes; keep them loopback-only.

`/act` error contract

POST /act uses a structured error response for route-level validation and policy failures:

{ "error": "<message>", "code": "ACT_*" }

Current code values:

ACT_KIND_REQUIRED (HTTP 400): kind is missing or unrecognized.
ACT_INVALID_REQUEST (HTTP 400): action payload failed normalization or validation.
ACT_SELECTOR_UNSUPPORTED (HTTP 400): selector was used with an unsupported action kind.
ACT_EVALUATE_DISABLED (HTTP 403): evaluate (or wait --fn) is disabled by config.
ACT_TARGET_ID_MISMATCH (HTTP 403): top-level or batched targetId conflicts with request target.
ACT_EXISTING_SESSION_UNSUPPORTED (HTTP 501): action is not supported for existing-session profiles.

Other runtime failures may still return { "error": "<message>" } without a code field.

Playwright requirement

Some features (navigate/act/AI snapshot/role snapshot, element screenshots, PDF) require Playwright. If Playwright isn't installed, those endpoints return a clear 501 error.

What still works without Playwright:

ARIA snapshots
Role-style accessibility snapshots (--interactive, --compact, --depth, --efficient) when a per-tab CDP WebSocket is available. This is a fallback for inspection and ref discovery; Playwright remains the primary action engine.
Page screenshots for the managed openclaw browser when a per-tab CDP WebSocket is available
Page screenshots for existing-session / Chrome MCP profiles
existing-session ref-based screenshots (--ref) from snapshot output

What still needs Playwright:

navigate
act
AI snapshots that depend on Playwright's native AI snapshot format
CSS-selector element screenshots (--element)
full browser PDF export

Element screenshots also reject --full-page; the route returns fullPage is not supported for element screenshots.

If you see Playwright is not available in this gateway build, the packaged Gateway is missing the core browser runtime dependency. Reinstall or update OpenClaw, then restart the gateway. For Docker, also install the Chromium browser binaries as shown below.

Docker Playwright install

If your Gateway runs in Docker, avoid npx playwright (npm override conflicts). For custom images, bake Chromium into the image:

OPENCLAW_INSTALL_BROWSER=1 ./scripts/docker/setup.sh

For an existing image, install through the bundled CLI instead:

docker compose run --rm openclaw-cli \
  node /app/node_modules/playwright-core/cli.js install chromium

To persist browser downloads, set PLAYWRIGHT_BROWSERS_PATH (for example, /home/node/.cache/ms-playwright) and make sure /home/node is persisted via OPENCLAW_HOME_VOLUME or a bind mount. OpenClaw auto-detects the persisted Chromium on Linux. See Docker.

How it works (internal)

A small loopback control server accepts HTTP requests and connects to Chromium-based browsers via CDP. Advanced actions (click/type/snapshot/PDF) go through Playwright on top of CDP; when Playwright is missing, only non-Playwright operations are available. The agent sees one stable interface while local/remote browsers and profiles swap freely underneath.

CLI quick reference

All commands accept --browser-profile <name> to target a specific profile, and --json for machine-readable output.

openclaw browser status
openclaw browser start
openclaw browser start --headless # one-shot local managed headless launch
openclaw browser stop            # also clears emulation on attach-only/remote CDP
openclaw browser tabs
openclaw browser tab             # shortcut for current tab
openclaw browser tab new
openclaw browser tab select 2
openclaw browser tab close 2
openclaw browser open https://example.com
openclaw browser focus abcd1234
openclaw browser close abcd1234

openclaw browser screenshot
openclaw browser screenshot --full-page
openclaw browser screenshot --ref 12        # or --ref e12
openclaw browser screenshot --labels
openclaw browser snapshot
openclaw browser snapshot --format aria --limit 200
openclaw browser snapshot --interactive --compact --depth 6
openclaw browser snapshot --efficient
openclaw browser snapshot --labels
openclaw browser snapshot --urls
openclaw browser snapshot --selector "#main" --interactive
openclaw browser snapshot --frame "iframe#main" --interactive
openclaw browser console --level error
openclaw browser errors --clear
openclaw browser requests --filter api --clear
openclaw browser pdf
openclaw browser responsebody "**/api" --max-chars 5000

openclaw browser navigate https://example.com
openclaw browser resize 1280 720
openclaw browser click 12 --double           # or e12 for role refs
openclaw browser click-coords 120 340        # viewport coordinates
openclaw browser type 23 "hello" --submit
openclaw browser press Enter
openclaw browser hover 44
openclaw browser scrollintoview e12
openclaw browser drag 10 11
openclaw browser select 9 OptionA OptionB
openclaw browser download e12 report.pdf
openclaw browser waitfordownload report.pdf
openclaw browser upload /tmp/openclaw/uploads/file.pdf
openclaw browser upload media://inbound/file.pdf
openclaw browser fill --fields '[{"ref":"1","type":"text","value":"Ada"}]'
openclaw browser dialog --accept
openclaw browser dialog --dismiss --dialog-id d1
openclaw browser wait --text "Done"
openclaw browser wait "#main" --url "**/dash" --load networkidle --fn "window.ready===true"
openclaw browser evaluate --fn '(el) => el.textContent' --ref 7
openclaw browser evaluate --timeout-ms 30000 --fn 'async () => { await window.ready; return true; }'
openclaw browser highlight e12
openclaw browser trace start
openclaw browser trace stop

openclaw browser cookies
openclaw browser cookies set session abc123 --url "https://example.com"
openclaw browser cookies clear
openclaw browser storage local get
openclaw browser storage local set theme dark
openclaw browser storage session clear
openclaw browser set offline on
openclaw browser set headers --headers-json '{"X-Debug":"1"}'
openclaw browser set credentials user pass            # --clear to remove
openclaw browser set geo 37.7749 -122.4194 --origin "https://example.com"
openclaw browser set media dark
openclaw browser set timezone America/New_York
openclaw browser set locale en-US
openclaw browser set device "iPhone 14"

Notes:

upload and dialog are arming calls; run them before the click/press that triggers the chooser/dialog. If an action opens a modal, the action response includes blockedByDialog and browserState.dialogs.pending; pass that dialogId to respond directly. Dialogs handled outside OpenClaw appear under browserState.dialogs.recent.
click/type/etc require a ref from snapshot (numeric 12, role ref e12, or actionable ARIA ref ax12). CSS selectors are intentionally not supported for actions. Use click-coords when the visible viewport position is the only reliable target.
Download and trace paths are constrained to OpenClaw temp roots: /tmp/openclaw{,/downloads} (fallback: ${os.tmpdir()}/openclaw/...).
upload accepts files from the OpenClaw temp uploads root and OpenClaw-managed inbound media. Managed inbound media can be referenced as media://inbound/<id>, sandbox-relative media/inbound/<id>, or a resolved path inside the managed inbound media directory. Nested media refs, traversal, symlinks, hardlinks, and arbitrary local paths are still rejected.
upload can also set file inputs directly via --input-ref or --element.

Stable tab ids and labels survive Chromium raw-target replacement when OpenClaw can prove the replacement tab, such as same URL or a single old tab becoming a single new tab after form submission. Raw target ids are still volatile; prefer suggestedTargetId from tabs in scripts.

Snapshot flags at a glance:

--format ai (default with Playwright): AI snapshot with numeric refs (aria-ref="<n>").
--format aria: accessibility tree with axN refs. When Playwright is available, OpenClaw binds refs with backend DOM ids to the live page so follow-up actions can use them; otherwise treat the output as inspection-only.
--efficient (or --mode efficient): compact role snapshot preset. Set browser.snapshotDefaults.mode: "efficient" to make this the default (see Gateway configuration).
--interactive, --compact, --depth, --selector force a role snapshot with ref=e12 refs. --frame "<iframe>" scopes role snapshots to an iframe.
--labels adds a viewport-only screenshot with overlayed ref labels and prints the saved path.
--urls appends discovered link destinations to AI snapshots.

Snapshots and refs

OpenClaw supports two "snapshot" styles:

AI snapshot (numeric refs): openclaw browser snapshot (default; --format ai)
- Output: a text snapshot that includes numeric refs.
- Actions: openclaw browser click 12, openclaw browser type 23 "hello".
- Internally, the ref is resolved via Playwright's aria-ref.
Role snapshot (role refs like e12): openclaw browser snapshot --interactive (or --compact, --depth, --selector, --frame)
- Output: a role-based list/tree with [ref=e12] (and optional [nth=1]).
- Actions: openclaw browser click e12, openclaw browser highlight e12.
- Internally, the ref is resolved via getByRole(...) (plus nth() for duplicates).
- Add --labels to include a viewport screenshot with overlayed e12 labels.
- Add --urls when link text is ambiguous and the agent needs concrete navigation targets.
ARIA snapshot (ARIA refs like ax12): openclaw browser snapshot --format aria
- Output: the accessibility tree as structured nodes.
- Actions: openclaw browser click ax12 works when the snapshot path can bind the ref through Playwright and Chrome backend DOM ids.
If Playwright is unavailable, ARIA snapshots can still be useful for inspection, but refs may not be actionable. Re-snapshot with --format ai or --interactive when you need action refs.
Docker proof for the raw-CDP fallback path: pnpm test:docker:browser-cdp-snapshot starts Chromium with CDP, runs browser doctor --deep, and verifies role snapshots include link URLs, cursor-promoted clickables, and iframe metadata.

Ref behavior:

Refs are not stable across navigations; if something fails, re-run snapshot and use a fresh ref.
/act returns the current raw targetId after action-triggered replacement when it can prove the replacement tab. Keep using stable tab ids/labels for follow-up commands.
If the role snapshot was taken with --frame, role refs are scoped to that iframe until the next role snapshot.
Unknown or stale axN refs fail fast instead of falling through to Playwright's aria-ref selector. Run a fresh snapshot on the same tab when that happens.

Wait power-ups

You can wait on more than just time/text:

Wait for URL (globs supported by Playwright):
- openclaw browser wait --url "**/dash"
Wait for load state:
- openclaw browser wait --load networkidle
Wait for a JS predicate:
- openclaw browser wait --fn "window.ready===true"
Wait for a selector to become visible:
- openclaw browser wait "#main"

These can be combined:

openclaw browser wait "#main" \
  --url "**/dash" \
  --load networkidle \
  --fn "window.ready===true" \
  --timeout-ms 15000

Debug workflows

When an action fails (e.g. "not visible", "strict mode violation", "covered"):

openclaw browser snapshot --interactive
Use click <ref> / type <ref> (prefer role refs in interactive mode)
If it still fails: openclaw browser highlight <ref> to see what Playwright is targeting
If the page behaves oddly:
- openclaw browser errors --clear
- openclaw browser requests --filter api --clear
For deep debugging: record a trace:
- openclaw browser trace start
- reproduce the issue
- openclaw browser trace stop (prints TRACE:<path>)

JSON output

--json is for scripting and structured tooling.

Examples:

openclaw browser status --json
openclaw browser snapshot --interactive --json
openclaw browser requests --filter api --json
openclaw browser cookies --json

Role snapshots in JSON include refs plus a small stats block (lines/chars/refs/interactive) so tools can reason about payload size and density.

State and environment knobs

These are useful for "make the site behave like X" workflows:

Cookies: cookies, cookies set, cookies clear
Storage: storage local|session get|set|clear
Offline: set offline on|off
Headers: set headers --headers-json '{"X-Debug":"1"}' (legacy set headers --json '{"X-Debug":"1"}' remains supported)
HTTP basic auth: set credentials user pass (or --clear)
Geolocation: set geo <lat> <lon> --origin "https://example.com" (or --clear)
Media: set media dark|light|no-preference|none
Timezone / locale: set timezone ..., set locale ...
Device / viewport:
- set device "iPhone 14" (Playwright device presets)
- set viewport 1280 720

Security and privacy

The openclaw browser profile may contain logged-in sessions; treat it as sensitive.
browser act kind=evaluate / openclaw browser evaluate and wait --fn execute arbitrary JavaScript in the page context. Prompt injection can steer this. Disable it with browser.evaluateEnabled=false if you do not need it.
Use openclaw browser evaluate --timeout-ms <ms> when the page-side function may need longer than the default evaluate timeout.
For logins and anti-bot notes (X/Twitter, etc.), see Browser login + X/Twitter posting.
Keep the Gateway/node host private (loopback or tailnet-only).
Remote CDP endpoints are powerful; tunnel and protect them.

Strict-mode example (block private/internal destinations by default):

{
  browser: {
    ssrfPolicy: {
      dangerouslyAllowPrivateNetwork: false,
      hostnameAllowlist: ["*.example.com", "example.com"],
      allowedHostnames: ["localhost"], // optional exact allow
    },
  },
}

Browser - overview, configuration, profiles, security
Browser login - signing in to sites
Browser Linux troubleshooting
Browser WSL2 troubleshooting

17 KiB Raw Blame History