openclaw/docs/tools/web-fetch.md at 4dad7bd93b6caae036342fd4efbdb47c217b459f

mirror of https://github.com/openclaw/openclaw.git synced 2026-06-03 19:24:05 +00:00

Files

Peter Steinberger aff6d079d3 fix(agents): add typed tool progress updates

Add a general typed tool-progress contract so long-running non-exec tools can emit public channel progress without overloading model-facing tool content.

`web_fetch` now uses the generic delayed progress helper: it shows `Fetching page content...` only when the fetch is still pending after five seconds, clears the timer on completion/abort, passes the abort signal into guarded fetch, and avoids provider fallback or cached success after cancellation. The subscriber path accepts only explicit `visibility: "channel"` and `privacy: "public"` progress metadata, while untyped tool partials and exec output keep their existing behavior.

Docs now explain typed progress, delayed producer examples, and the `web_fetch` timing behavior.

Proof: `pnpm test src/agents/tools/web-tools.fetch.test.ts src/agents/embedded-agent-subscribe.handlers.tools.test.ts -- --run`; `pnpm docs:check-mdx`; changed-file `pnpm exec oxlint ...`; `git diff --check`; autoreview clean.

2026-05-29 11:06:13 +01:00

6.7 KiB

Raw Blame History

summary, read_when, title, sidebarTitle

summary

read_when

title

sidebarTitle

web_fetch tool -- HTTP fetch with readable content extraction

You want to fetch a URL and extract readable content

You need to configure web_fetch or its Firecrawl fallback

You want to understand web_fetch limits and caching

Web fetch

Web Fetch

The web_fetch tool does a plain HTTP GET and extracts readable content (HTML to markdown or text). It does not execute JavaScript.

For JS-heavy sites or login-protected pages, use the Web Browser instead.

Quick start

web_fetch is enabled by default -- no configuration needed. The agent can call it immediately:

await web_fetch({ url: "https://example.com/article" });

Tool parameters

URL to fetch. `http(s)` only. Output format after main-content extraction. Truncate output to this many characters.

How it works

Sends an HTTP GET with a Chrome-like User-Agent and `Accept-Language` header. Blocks private/internal hostnames and re-checks redirects. Runs Readability (main-content extraction) on the HTML response. If Readability fails and Firecrawl is configured, retries through the Firecrawl API with bot-circumvention mode. Results are cached for 15 minutes (configurable) to reduce repeated fetches of the same URL.

Progress updates

web_fetch emits a public progress line only when the fetch is still pending after five seconds:

Fetching page content...

Fast cache hits and quick network responses finish before the timer fires, so they do not show a progress line. If the call is canceled, the timer is cleared. When the fetch eventually completes, the agent receives the normal tool result; the progress line is only channel UI state and never contains fetched page content.

Config

{
  tools: {
    web: {
      fetch: {
        enabled: true, // default: true
        provider: "firecrawl", // optional; omit for auto-detect
        maxChars: 50000, // max output chars
        maxCharsCap: 50000, // hard cap for maxChars param
        maxResponseBytes: 2000000, // max download size before truncation
        timeoutSeconds: 30,
        cacheTtlMinutes: 15,
        maxRedirects: 3,
        useTrustedEnvProxy: false, // let a trusted HTTP(S) env proxy resolve DNS
        readability: true, // use Readability extraction
        userAgent: "Mozilla/5.0 ...", // override User-Agent
        ssrfPolicy: {
          allowRfc2544BenchmarkRange: true, // opt-in for trusted fake-IP proxies using 198.18.0.0/15
          allowIpv6UniqueLocalRange: true, // opt-in for trusted fake-IP proxies using fc00::/7
        },
      },
    },
  },
}

Firecrawl fallback

If Readability extraction fails, web_fetch can fall back to Firecrawl for bot-circumvention and better extraction:

{
  tools: {
    web: {
      fetch: {
        provider: "firecrawl", // optional; omit for auto-detect from available credentials
      },
    },
  },
  plugins: {
    entries: {
      firecrawl: {
        enabled: true,
        config: {
          webFetch: {
            apiKey: "fc-...", // optional if FIRECRAWL_API_KEY is set
            baseUrl: "https://api.firecrawl.dev",
            onlyMainContent: true,
            maxAgeMs: 86400000, // cache duration (1 day)
            timeoutSeconds: 60,
          },
        },
      },
    },
  },
}

plugins.entries.firecrawl.config.webFetch.apiKey supports SecretRef objects. Legacy tools.web.fetch.firecrawl.* config is auto-migrated by openclaw doctor --fix.

If Firecrawl is enabled and its SecretRef is unresolved with no `FIRECRAWL_API_KEY` env fallback, gateway startup fails fast. Firecrawl `baseUrl` overrides are locked down: hosted traffic uses `https://api.firecrawl.dev`; self-hosted overrides must target private or internal endpoints, and `http://` is accepted only for those private targets.

Current runtime behavior:

tools.web.fetch.provider selects the fetch fallback provider explicitly.
If provider is omitted, OpenClaw auto-detects the first ready web-fetch provider from available credentials. Non-sandboxed web_fetch can use installed plugins that declare contracts.webFetchProviders and register a matching provider at runtime. Today the bundled provider is Firecrawl.
Sandboxed web_fetch calls stay limited to bundled providers.
If Readability is disabled, web_fetch skips straight to the selected provider fallback. If no provider is available, it fails closed.

Trusted env proxy

If your deployment requires web_fetch to go through a trusted outbound HTTP(S) proxy, set tools.web.fetch.useTrustedEnvProxy: true.

In this mode, OpenClaw still applies hostname-based SSRF checks before sending the request, but it lets the proxy resolve DNS instead of doing local DNS pinning. Enable this only when the proxy is operator-controlled and enforces outbound policy after DNS resolution.

If no HTTP(S) proxy env var is configured, or the target host is excluded by `NO_PROXY`, `web_fetch` falls back to the normal strict path with local DNS pinning.

Limits and safety

maxChars is clamped to tools.web.fetch.maxCharsCap
Response body is capped at maxResponseBytes before parsing; oversized responses are truncated with a warning
Private/internal hostnames are blocked
tools.web.fetch.ssrfPolicy.allowRfc2544BenchmarkRange and tools.web.fetch.ssrfPolicy.allowIpv6UniqueLocalRange are narrow opt-ins for trusted fake-IP proxy stacks; leave them unset unless your proxy owns those synthetic ranges and enforces its own destination policy
Redirects are checked and limited by maxRedirects
useTrustedEnvProxy is an explicit opt-in and should only be enabled for operator-controlled proxies that still enforce outbound policy after DNS resolution
web_fetch is best-effort -- some sites need the Web Browser

Tool profiles

If you use tool profiles or allowlists, add web_fetch or group:web:

{
  tools: {
    allow: ["web_fetch"],
    // or: allow: ["group:web"]  (includes web_fetch, web_search, and x_search)
  },
}

Web Search -- search the web with multiple providers
Web Browser -- full browser automation for JS-heavy sites
Firecrawl -- Firecrawl search and scrape tools

6.7 KiB Raw Blame History