Files
openclaw/docs/start/openclaw.md
scotthuang 7920af0c9e refactor: route browser screenshot vision through shared media understanding
* feat(browser): add optional vision understanding to screenshot tool

* fix(browser): wrap vision output as external content, enforce maxBytes, forward auth profiles

* fix(browser): remove no-op scope/attachments config, drop profile pass-through lacking runtime support

* feat(media-understanding): add profile/preferredProfile to DescribeImageFileWithModelParams and forward to describeImage

* style(browser): add curly braces to satisfy eslint curly rule

* fix(browser): correct tools.browser.enabled help text to match actual behavior

* fix(browser): thread agentDir/workspaceDir from plugin tool context into browser vision

* refactor(browser): move vision config from tools.browser to browser.models

The browser plugin's vision configuration now lives on the top-level
`browser` config namespace (browser.models, browser.visionEnabled,
browser.visionPrompt, etc.) instead of `tools.browser`. This aligns
with the plugin's existing config location and avoids confusion between
tool-level and plugin-level settings.

- Remove tools.browser from ToolsSchema and ToolsConfig
- Add models/vision* fields to BrowserConfig and its zod schema
- Update getBrowserVisionConfig to read from cfg.browser
- Update schema help, labels, and quality test
- Update vision.test.ts to use new config shape

* docs(browser): add screenshot vision configuration section

Document the new browser.models config for automatic screenshot
description via vision models, enabling text-only main models to
reason about web page content.

* fix(browser): remove deliverable media markers from vision result, drop unused import

P1: Vision-success path no longer exposes the raw screenshot as
deliverable media (removes MEDIA: line and details.media.mediaUrl).
This prevents channel delivery from auto-sending sensitive page content
when the intended output is a text description.

P2: Remove unused ToolsMediaUnderstandingSchema import that would fail
noUnusedLocals typecheck.

* fix(browser): add command/args fields to browser models schema

The browser vision model schema uses .strict(), so CLI-type entries
with command/args were rejected by TypeScript. Add these fields to
align with MediaUnderstandingModelSchema.

* chore(browser): remove debug console.log statements

* fix(browser): harden screenshot vision result against MEDIA: directive injection and restore image sanitization on failure fallback

ClawSweeper #84247 review round 2:

P1 (security, high): neutralize line-start MEDIA: directives in vision descriptions
before wrapping with wrapExternalContent. The agent media extractor scans every
browser tool-result text block via splitMediaFromOutput which treats line-start
MEDIA: as a trusted local-media delivery directive, and browser is on the
trusted-media allowlist. Without neutralization, page or vision-provider output
containing 'MEDIA:/tmp/secret.png' could synthesize a channel-deliverable media
artifact from untrusted content. wrapExternalContent itself does not strip
line-start directives. Introduce neutralizeMediaDirectives in vision.ts that
prepends '[neutralized] ' to any line whose trimStart() begins with MEDIA:
(case-insensitive), defanging the parser anchor while keeping the original
text human-readable.

P2 (compatibility): pass resolveRuntimeImageSanitization() to imageResultFromFile
in the vision-failure catch fallback. The non-vision screenshot path already
forwards this option (d5cc0d53b7) so configured agents.defaults.imageMaxDimensionPx
takes effect. Without this fix, any provider timeout/error silently bypasses the
sanitization guard and returns a raw full-resolution screenshot.

Regression coverage:
- vision.test.ts: 6 unit cases for neutralizeMediaDirectives (no-op fast path,
  mid-line MEDIA: untouched, line-start defanged, leading-whitespace defanged,
  case-insensitive, multiple directives per blob).
- browser-tool.test.ts: 2 integration cases that drive the full screenshot
  tool execute path:
    - 'neutralizes MEDIA: directives in vision text and does not attach media'
      asserts no line matches /^\s*MEDIA:/i in returned text, secret path text
      is preserved verbatim, details.media is absent, and imageResultFromFile
      is not called on the success path.
    - 'preserves screenshot image sanitization on vision failure fallback'
      mocks describeImageFileWithModel to reject and asserts the fallback
      imageResultFromFile call receives imageSanitization: {maxDimensionPx:1600}
      plus the 'browser screenshot vision failed' extraText.

* fix(browser): apply clawsweeper fallback media fix from PR #84247

* refactor: reuse media image understanding for browser screenshots

* refactor: use structured media delivery

* test: update music completion media instruction expectation

* fix: trim buffered reply directive padding

* test: refresh codex prompt snapshots for message media aliases

---------

Co-authored-by: scotthuang <scotthuang@tencent.com>
Co-authored-by: Peter Steinberger <steipete@gmail.com>
2026-05-31 00:00:19 +01:00

9.0 KiB

summary, read_when, title
summary read_when title
End-to-end guide for running OpenClaw as a personal assistant with safety cautions
Onboarding a new assistant instance
Reviewing safety/permission implications
Personal assistant setup

OpenClaw is a self-hosted gateway that connects Discord, Google Chat, iMessage, Matrix, Microsoft Teams, Signal, Slack, Telegram, WhatsApp, Zalo, and more to AI agents. This guide covers the "personal assistant" setup: a dedicated WhatsApp number that behaves like your always-on AI assistant.

⚠️ Safety first

You're putting an agent in a position to:

  • run commands on your machine (depending on your tool policy)
  • read/write files in your workspace
  • send messages back out via WhatsApp/Telegram/Discord/Mattermost and other bundled channels

Start conservative:

  • Always set channels.whatsapp.allowFrom (never run open-to-the-world on your personal Mac).
  • Use a dedicated WhatsApp number for the assistant.
  • Heartbeats now default to every 30 minutes. Disable until you trust the setup by setting agents.defaults.heartbeat.every: "0m".

Prerequisites

  • OpenClaw installed and onboarded - see Getting Started if you haven't done this yet
  • A second phone number (SIM/eSIM/prepaid) for the assistant

You want this:

flowchart TB
    A["<b>Your Phone (personal)<br></b><br>Your WhatsApp<br>+1-555-YOU"] -- message --> B["<b>Second Phone (assistant)<br></b><br>Assistant WA<br>+1-555-ASSIST"]
    B -- linked via QR --> C["<b>Your Mac (openclaw)<br></b><br>AI agent"]

If you link your personal WhatsApp to OpenClaw, every message to you becomes "agent input". That's rarely what you want.

5-minute quick start

  1. Pair WhatsApp Web (shows QR; scan with the assistant phone):
openclaw channels login
  1. Start the Gateway (leave it running):
openclaw gateway --port 18789
  1. Put a minimal config in ~/.openclaw/openclaw.json:
{
  gateway: { mode: "local" },
  channels: { whatsapp: { allowFrom: ["+15555550123"] } },
}

Now message the assistant number from your allowlisted phone.

When onboarding finishes, OpenClaw auto-opens the dashboard and prints a clean (non-tokenized) link. If the dashboard prompts for auth, paste the configured shared secret into Control UI settings. Onboarding uses a token by default (gateway.auth.token), but password auth works too if you switched gateway.auth.mode to password. To reopen later: openclaw dashboard.

Give the agent a workspace (AGENTS)

OpenClaw reads operating instructions and "memory" from its workspace directory.

By default, OpenClaw uses ~/.openclaw/workspace as the agent workspace, and will create it (plus starter AGENTS.md, SOUL.md, TOOLS.md, IDENTITY.md, USER.md, HEARTBEAT.md) automatically on setup/first agent run. BOOTSTRAP.md is only created when the workspace is brand new (it should not come back after you delete it). MEMORY.md is optional (not auto-created); when present, it is loaded for normal sessions. Subagent sessions only inject AGENTS.md and TOOLS.md.

Treat this folder like OpenClaw's memory and make it a git repo (ideally private) so your `AGENTS.md` and memory files are backed up. If git is installed, brand-new workspaces are auto-initialized.
openclaw setup

Full workspace layout + backup guide: Agent workspace Memory workflow: Memory

Optional: choose a different workspace with agents.defaults.workspace (supports ~).

{
  agents: {
    defaults: {
      workspace: "~/.openclaw/workspace",
    },
  },
}

If you already ship your own workspace files from a repo, you can disable bootstrap file creation entirely:

{
  agents: {
    defaults: {
      skipBootstrap: true,
    },
  },
}

The config that turns it into "an assistant"

OpenClaw defaults to a good assistant setup, but you'll usually want to tune:

  • persona/instructions in SOUL.md
  • thinking defaults (if desired)
  • heartbeats (once you trust it)

Example:

{
  logging: { level: "info" },
  agents: {
    defaults: {
      model: { primary: "anthropic/claude-opus-4-6" },
      workspace: "~/.openclaw/workspace",
      thinkingDefault: "high",
      timeoutSeconds: 1800,
      // Start with 0; enable later.
      heartbeat: { every: "0m" },
    },
    list: [
      {
        id: "main",
        default: true,
        groupChat: {
          mentionPatterns: ["@openclaw", "openclaw"],
        },
      },
    ],
  },
  channels: {
    whatsapp: {
      allowFrom: ["+15555550123"],
      groups: {
        "*": { requireMention: true },
      },
    },
  },
  session: {
    scope: "per-sender",
    resetTriggers: ["/new", "/reset"],
    reset: {
      mode: "daily",
      atHour: 4,
      idleMinutes: 10080,
    },
  },
}

Sessions and memory

  • Session files: ~/.openclaw/agents/<agentId>/sessions/{{SessionId}}.jsonl
  • Session metadata (token usage, last route, etc): ~/.openclaw/agents/<agentId>/sessions/sessions.json (legacy: ~/.openclaw/sessions/sessions.json)
  • /new or /reset starts a fresh session for that chat (configurable via resetTriggers). If sent alone, OpenClaw acknowledges the reset without invoking the model.
  • /compact [instructions] compacts the session context and reports the remaining context budget.

Heartbeats (proactive mode)

By default, OpenClaw runs a heartbeat every 30 minutes with the prompt: Read HEARTBEAT.md if it exists (workspace context). Follow it strictly. Do not infer or repeat old tasks from prior chats. If nothing needs attention, reply HEARTBEAT_OK. Set agents.defaults.heartbeat.every: "0m" to disable.

  • If HEARTBEAT.md exists but is effectively empty (only blank lines and markdown headers like # Heading), OpenClaw skips the heartbeat run to save API calls.
  • If the file is missing, the heartbeat still runs and the model decides what to do.
  • If the agent replies with HEARTBEAT_OK (optionally with short padding; see agents.defaults.heartbeat.ackMaxChars), OpenClaw suppresses outbound delivery for that heartbeat.
  • By default, heartbeat delivery to DM-style user:<id> targets is allowed. Set agents.defaults.heartbeat.directPolicy: "block" to suppress direct-target delivery while keeping heartbeat runs active.
  • Heartbeats run full agent turns - shorter intervals burn more tokens.
{
  agents: {
    defaults: {
      heartbeat: { every: "30m" },
    },
  },
}

Media in and out

Inbound attachments (images/audio/docs) can be surfaced to your command via templates:

  • {{MediaPath}} (local temp file path)
  • {{MediaUrl}} (pseudo-URL)
  • {{Transcript}} (if audio transcription is enabled)

Outbound attachments from the agent use structured media fields on the message tool or reply payload, such as media, mediaUrl, mediaUrls, path, or filePath. Example message-tool arguments:

{
  "message": "Here's the screenshot.",
  "mediaUrl": "https://example.com/screenshot.png"
}

OpenClaw sends structured media alongside the text. Legacy final assistant replies may still be normalized for compatibility, but tool output, browser output, streaming blocks, and message actions do not parse text as attachment commands.

Local-path behavior follows the same file-read trust model as the agent:

  • If tools.fs.workspaceOnly is true, outbound local media paths stay restricted to the OpenClaw temp root, the media cache, agent workspace paths, and sandbox-generated files.
  • If tools.fs.workspaceOnly is false, outbound local media can use host-local files the agent is already allowed to read.
  • Local paths can be absolute, workspace-relative, or home-relative with ~/.
  • Host-local sends still only allow media and safe document types (images, audio, video, PDF, and Office documents). Plain text and secret-like files are not treated as sendable media.

That means generated images/files outside the workspace can now send when your fs policy already allows those reads, without reopening arbitrary host-text attachment exfiltration.

Operations checklist

openclaw status          # local status (creds, sessions, queued events)
openclaw status --all    # full diagnosis (read-only, pasteable)
openclaw status --deep   # asks the gateway for a live health probe with channel probes when supported
openclaw health --json   # gateway health snapshot (WS; default can return a fresh cached snapshot)

Logs live under /tmp/openclaw/ (default: openclaw-YYYY-MM-DD.log).

Next steps