mirror of
https://github.com/openclaw/openclaw.git
synced 2026-05-04 22:50:22 +00:00
docs(tools): restructure web tools IA and rewrite web.md
Navigation restructure: - "Browser" group -> "Web Browser" - New "Web Tools" group containing Web Fetch, Web Search, and all 7 search provider sub-pages - Other tools (btw, diffs, etc.) stay at top level New page: - tools/web-fetch.md: dedicated web_fetch reference with Steps, config, Firecrawl fallback, limits Rewritten page: - tools/web.md: now "Web Search" -- focused search overview with Steps quick-start, CardGroup provider picker, Tabs for key storage, provider comparison table, auto-detection, parameters, examples. Removed all inline provider setup (lives in sub-pages) and web_fetch content (now in dedicated page). Final sidebar: Tools ├── Web Browser (browser, login, troubleshooting) ├── Web Tools │ ├── Web Fetch │ ├── Web Search │ └── Brave / Firecrawl / Gemini / Grok / Kimi / Perplexity / Tavily ├── btw, diffs, exec, ...
This commit is contained in:
135
docs/tools/web-fetch.md
Normal file
135
docs/tools/web-fetch.md
Normal file
@@ -0,0 +1,135 @@
|
||||
---
|
||||
summary: "web_fetch tool -- HTTP fetch with readable content extraction"
|
||||
read_when:
|
||||
- You want to fetch a URL and extract readable content
|
||||
- You need to configure web_fetch or its Firecrawl fallback
|
||||
- You want to understand web_fetch limits and caching
|
||||
title: "Web Fetch"
|
||||
sidebarTitle: "Web Fetch"
|
||||
---
|
||||
|
||||
# Web Fetch
|
||||
|
||||
The `web_fetch` tool does a plain HTTP GET and extracts readable content
|
||||
(HTML to markdown or text). It does **not** execute JavaScript.
|
||||
|
||||
For JS-heavy sites or login-protected pages, use the
|
||||
[Web Browser](/tools/browser) instead.
|
||||
|
||||
## Quick start
|
||||
|
||||
`web_fetch` is **enabled by default** -- no configuration needed. The agent can
|
||||
call it immediately:
|
||||
|
||||
```javascript
|
||||
await web_fetch({ url: "https://example.com/article" });
|
||||
```
|
||||
|
||||
## Tool parameters
|
||||
|
||||
| Parameter | Type | Description |
|
||||
| ------------- | -------- | ---------------------------------------- |
|
||||
| `url` | `string` | URL to fetch (required, http/https only) |
|
||||
| `extractMode` | `string` | `"markdown"` (default) or `"text"` |
|
||||
| `maxChars` | `number` | Truncate output to this many chars |
|
||||
|
||||
## How it works
|
||||
|
||||
<Steps>
|
||||
<Step title="Fetch">
|
||||
Sends an HTTP GET with a Chrome-like User-Agent and `Accept-Language`
|
||||
header. Blocks private/internal hostnames and re-checks redirects.
|
||||
</Step>
|
||||
<Step title="Extract">
|
||||
Runs Readability (main-content extraction) on the HTML response.
|
||||
</Step>
|
||||
<Step title="Fallback (optional)">
|
||||
If Readability fails and Firecrawl is configured, retries through the
|
||||
Firecrawl API with bot-circumvention mode.
|
||||
</Step>
|
||||
<Step title="Cache">
|
||||
Results are cached for 15 minutes (configurable) to reduce repeated
|
||||
fetches of the same URL.
|
||||
</Step>
|
||||
</Steps>
|
||||
|
||||
## Config
|
||||
|
||||
```json5
|
||||
{
|
||||
tools: {
|
||||
web: {
|
||||
fetch: {
|
||||
enabled: true, // default: true
|
||||
maxChars: 50000, // max output chars
|
||||
maxCharsCap: 50000, // hard cap for maxChars param
|
||||
maxResponseBytes: 2000000, // max download size before truncation
|
||||
timeoutSeconds: 30,
|
||||
cacheTtlMinutes: 15,
|
||||
maxRedirects: 3,
|
||||
readability: true, // use Readability extraction
|
||||
userAgent: "Mozilla/5.0 ...", // override User-Agent
|
||||
},
|
||||
},
|
||||
},
|
||||
}
|
||||
```
|
||||
|
||||
## Firecrawl fallback
|
||||
|
||||
If Readability extraction fails, `web_fetch` can fall back to
|
||||
[Firecrawl](/tools/firecrawl) for bot-circumvention and better extraction:
|
||||
|
||||
```json5
|
||||
{
|
||||
tools: {
|
||||
web: {
|
||||
fetch: {
|
||||
firecrawl: {
|
||||
enabled: true,
|
||||
apiKey: "fc-...", // optional if FIRECRAWL_API_KEY is set
|
||||
baseUrl: "https://api.firecrawl.dev",
|
||||
onlyMainContent: true,
|
||||
maxAgeMs: 86400000, // cache duration (1 day)
|
||||
timeoutSeconds: 60,
|
||||
},
|
||||
},
|
||||
},
|
||||
},
|
||||
}
|
||||
```
|
||||
|
||||
`tools.web.fetch.firecrawl.apiKey` supports SecretRef objects.
|
||||
|
||||
<Note>
|
||||
If Firecrawl is enabled and its SecretRef is unresolved with no
|
||||
`FIRECRAWL_API_KEY` env fallback, gateway startup fails fast.
|
||||
</Note>
|
||||
|
||||
## Limits and safety
|
||||
|
||||
- `maxChars` is clamped to `tools.web.fetch.maxCharsCap`
|
||||
- Response body is capped at `maxResponseBytes` before parsing; oversized
|
||||
responses are truncated with a warning
|
||||
- Private/internal hostnames are blocked
|
||||
- Redirects are checked and limited by `maxRedirects`
|
||||
- `web_fetch` is best-effort -- some sites need the [Web Browser](/tools/browser)
|
||||
|
||||
## Tool profiles
|
||||
|
||||
If you use tool profiles or allowlists, add `web_fetch` or `group:web`:
|
||||
|
||||
```json5
|
||||
{
|
||||
tools: {
|
||||
allow: ["web_fetch"],
|
||||
// or: allow: ["group:web"] (includes both web_fetch and web_search)
|
||||
},
|
||||
}
|
||||
```
|
||||
|
||||
## Related
|
||||
|
||||
- [Web Search](/tools/web) -- search the web with multiple providers
|
||||
- [Web Browser](/tools/browser) -- full browser automation for JS-heavy sites
|
||||
- [Firecrawl](/tools/firecrawl) -- Firecrawl search and scrape tools
|
||||
Reference in New Issue
Block a user