mirror of
https://github.com/openclaw/openclaw.git
synced 2026-04-06 23:01:08 +00:00
256 lines
10 KiB
Markdown
256 lines
10 KiB
Markdown
---
|
|
title: "Prompt Caching"
|
|
summary: "Prompt caching knobs, merge order, provider behavior, and tuning patterns"
|
|
read_when:
|
|
- You want to reduce prompt token costs with cache retention
|
|
- You need per-agent cache behavior in multi-agent setups
|
|
- You are tuning heartbeat and cache-ttl pruning together
|
|
---
|
|
|
|
# Prompt caching
|
|
|
|
Prompt caching means the model provider can reuse unchanged prompt prefixes (usually system/developer instructions and other stable context) across turns instead of re-processing them every time. OpenClaw normalizes provider usage into `cacheRead` and `cacheWrite` where the upstream API exposes those counters directly.
|
|
|
|
Why this matters: lower token cost, faster responses, and more predictable performance for long-running sessions. Without caching, repeated prompts pay the full prompt cost on every turn even when most input did not change.
|
|
|
|
This page covers all cache-related knobs that affect prompt reuse and token cost.
|
|
|
|
Provider references:
|
|
|
|
- Anthropic prompt caching: [https://platform.claude.com/docs/en/build-with-claude/prompt-caching](https://platform.claude.com/docs/en/build-with-claude/prompt-caching)
|
|
- OpenAI prompt caching: [https://developers.openai.com/api/docs/guides/prompt-caching](https://developers.openai.com/api/docs/guides/prompt-caching)
|
|
- OpenAI API headers and request IDs: [https://developers.openai.com/api/reference/overview](https://developers.openai.com/api/reference/overview)
|
|
- Anthropic request IDs and errors: [https://platform.claude.com/docs/en/api/errors](https://platform.claude.com/docs/en/api/errors)
|
|
|
|
## Primary knobs
|
|
|
|
### `cacheRetention` (global default, model, and per-agent)
|
|
|
|
Set cache retention as a global default for all models:
|
|
|
|
```yaml
|
|
agents:
|
|
defaults:
|
|
params:
|
|
cacheRetention: "long" # none | short | long
|
|
```
|
|
|
|
Override per-model:
|
|
|
|
```yaml
|
|
agents:
|
|
defaults:
|
|
models:
|
|
"anthropic/claude-opus-4-6":
|
|
params:
|
|
cacheRetention: "short" # none | short | long
|
|
```
|
|
|
|
Per-agent override:
|
|
|
|
```yaml
|
|
agents:
|
|
list:
|
|
- id: "alerts"
|
|
params:
|
|
cacheRetention: "none"
|
|
```
|
|
|
|
Config merge order:
|
|
|
|
1. `agents.defaults.params` (global default — applies to all models)
|
|
2. `agents.defaults.models["provider/model"].params` (per-model override)
|
|
3. `agents.list[].params` (matching agent id; overrides by key)
|
|
|
|
### `contextPruning.mode: "cache-ttl"`
|
|
|
|
Prunes old tool-result context after cache TTL windows so post-idle requests do not re-cache oversized history.
|
|
|
|
```yaml
|
|
agents:
|
|
defaults:
|
|
contextPruning:
|
|
mode: "cache-ttl"
|
|
ttl: "1h"
|
|
```
|
|
|
|
See [Session Pruning](/concepts/session-pruning) for full behavior.
|
|
|
|
### Heartbeat keep-warm
|
|
|
|
Heartbeat can keep cache windows warm and reduce repeated cache writes after idle gaps.
|
|
|
|
```yaml
|
|
agents:
|
|
defaults:
|
|
heartbeat:
|
|
every: "55m"
|
|
```
|
|
|
|
Per-agent heartbeat is supported at `agents.list[].heartbeat`.
|
|
|
|
## Provider behavior
|
|
|
|
### Anthropic (direct API)
|
|
|
|
- `cacheRetention` is supported.
|
|
- With Anthropic API-key auth profiles, OpenClaw seeds `cacheRetention: "short"` for Anthropic model refs when unset.
|
|
- Anthropic native Messages responses expose both `cache_read_input_tokens` and `cache_creation_input_tokens`, so OpenClaw can show both `cacheRead` and `cacheWrite`.
|
|
- For native Anthropic requests, `cacheRetention: "short"` maps to the default 5-minute ephemeral cache, and `cacheRetention: "long"` upgrades to the 1-hour TTL only on direct `api.anthropic.com` hosts.
|
|
|
|
### OpenAI (direct API)
|
|
|
|
- Prompt caching is automatic on supported recent models. OpenClaw does not need to inject block-level cache markers.
|
|
- OpenClaw uses `prompt_cache_key` to keep cache routing stable across turns and uses `prompt_cache_retention: "24h"` only when `cacheRetention: "long"` is selected on direct OpenAI hosts.
|
|
- OpenAI responses expose cached prompt tokens via `usage.prompt_tokens_details.cached_tokens` (or `input_tokens_details.cached_tokens` on Responses API events). OpenClaw maps that to `cacheRead`.
|
|
- OpenAI does not expose a separate cache-write token counter, so `cacheWrite` stays `0` on OpenAI paths even when the provider is warming a cache.
|
|
- OpenAI returns useful tracing and rate-limit headers such as `x-request-id`, `openai-processing-ms`, and `x-ratelimit-*`, but cache-hit accounting should come from the usage payload, not from headers.
|
|
- In practice, OpenAI often behaves like an initial-prefix cache rather than Anthropic-style moving full-history reuse. Stable long-prefix text turns can land near a `4864` cached-token plateau in current live probes, while tool-heavy or MCP-style transcripts often plateau near `4608` cached tokens even on exact repeats.
|
|
|
|
### Amazon Bedrock
|
|
|
|
- Anthropic Claude model refs (`amazon-bedrock/*anthropic.claude*`) support explicit `cacheRetention` pass-through.
|
|
- Non-Anthropic Bedrock models are forced to `cacheRetention: "none"` at runtime.
|
|
|
|
### OpenRouter Anthropic models
|
|
|
|
For `openrouter/anthropic/*` model refs, OpenClaw injects Anthropic `cache_control` on system/developer prompt blocks to improve prompt-cache reuse.
|
|
|
|
### Other providers
|
|
|
|
If the provider does not support this cache mode, `cacheRetention` has no effect.
|
|
|
|
## Tuning patterns
|
|
|
|
### Mixed traffic (recommended default)
|
|
|
|
Keep a long-lived baseline on your main agent, disable caching on bursty notifier agents:
|
|
|
|
```yaml
|
|
agents:
|
|
defaults:
|
|
model:
|
|
primary: "anthropic/claude-opus-4-6"
|
|
models:
|
|
"anthropic/claude-opus-4-6":
|
|
params:
|
|
cacheRetention: "long"
|
|
list:
|
|
- id: "research"
|
|
default: true
|
|
heartbeat:
|
|
every: "55m"
|
|
- id: "alerts"
|
|
params:
|
|
cacheRetention: "none"
|
|
```
|
|
|
|
### Cost-first baseline
|
|
|
|
- Set baseline `cacheRetention: "short"`.
|
|
- Enable `contextPruning.mode: "cache-ttl"`.
|
|
- Keep heartbeat below your TTL only for agents that benefit from warm caches.
|
|
|
|
## Cache diagnostics
|
|
|
|
OpenClaw exposes dedicated cache-trace diagnostics for embedded agent runs.
|
|
|
|
## Live regression tests
|
|
|
|
OpenClaw keeps one combined live cache regression gate for repeated prefixes, tool turns, image turns, MCP-style tool transcripts, and an Anthropic no-cache control.
|
|
|
|
- `src/agents/live-cache-regression.live.test.ts`
|
|
- `src/agents/live-cache-regression-baseline.ts`
|
|
|
|
Run the narrow live gate with:
|
|
|
|
```sh
|
|
OPENCLAW_LIVE_TEST=1 OPENCLAW_LIVE_CACHE_TEST=1 pnpm test:live:cache
|
|
```
|
|
|
|
The baseline file stores the most recent observed live numbers plus the provider-specific regression floors used by the test.
|
|
The runner also uses fresh per-run session IDs and prompt namespaces so previous cache state does not pollute the current regression sample.
|
|
|
|
These tests intentionally do not use identical success criteria across providers.
|
|
|
|
### Anthropic live expectations
|
|
|
|
- Expect explicit warmup writes via `cacheWrite`.
|
|
- Expect near-full history reuse on repeated turns because Anthropic cache control advances the cache breakpoint through the conversation.
|
|
- Current live assertions still use high hit-rate thresholds for stable, tool, and image paths.
|
|
|
|
### OpenAI live expectations
|
|
|
|
- Expect `cacheRead` only. `cacheWrite` remains `0`.
|
|
- Treat repeated-turn cache reuse as a provider-specific plateau, not as Anthropic-style moving full-history reuse.
|
|
- Current live assertions use conservative floor checks derived from observed live behavior on `gpt-5.4-mini`:
|
|
- stable prefix: `cacheRead >= 4608`, hit rate `>= 0.90`
|
|
- tool transcript: `cacheRead >= 4096`, hit rate `>= 0.85`
|
|
- image transcript: `cacheRead >= 3840`, hit rate `>= 0.82`
|
|
- MCP-style transcript: `cacheRead >= 4096`, hit rate `>= 0.85`
|
|
|
|
Fresh combined live verification on 2026-04-04 landed at:
|
|
|
|
- stable prefix: `cacheRead=4864`, hit rate `0.966`
|
|
- tool transcript: `cacheRead=4608`, hit rate `0.896`
|
|
- image transcript: `cacheRead=4864`, hit rate `0.954`
|
|
- MCP-style transcript: `cacheRead=4608`, hit rate `0.891`
|
|
|
|
Recent local wall-clock time for the combined gate was about `88s`.
|
|
|
|
Why the assertions differ:
|
|
|
|
- Anthropic exposes explicit cache breakpoints and moving conversation-history reuse.
|
|
- OpenAI prompt caching is still exact-prefix sensitive, but the effective reusable prefix in live Responses traffic can plateau earlier than the full prompt.
|
|
- Because of that, comparing Anthropic and OpenAI by a single cross-provider percentage threshold creates false regressions.
|
|
|
|
### `diagnostics.cacheTrace` config
|
|
|
|
```yaml
|
|
diagnostics:
|
|
cacheTrace:
|
|
enabled: true
|
|
filePath: "~/.openclaw/logs/cache-trace.jsonl" # optional
|
|
includeMessages: false # default true
|
|
includePrompt: false # default true
|
|
includeSystem: false # default true
|
|
```
|
|
|
|
Defaults:
|
|
|
|
- `filePath`: `$OPENCLAW_STATE_DIR/logs/cache-trace.jsonl`
|
|
- `includeMessages`: `true`
|
|
- `includePrompt`: `true`
|
|
- `includeSystem`: `true`
|
|
|
|
### Env toggles (one-off debugging)
|
|
|
|
- `OPENCLAW_CACHE_TRACE=1` enables cache tracing.
|
|
- `OPENCLAW_CACHE_TRACE_FILE=/path/to/cache-trace.jsonl` overrides output path.
|
|
- `OPENCLAW_CACHE_TRACE_MESSAGES=0|1` toggles full message payload capture.
|
|
- `OPENCLAW_CACHE_TRACE_PROMPT=0|1` toggles prompt text capture.
|
|
- `OPENCLAW_CACHE_TRACE_SYSTEM=0|1` toggles system prompt capture.
|
|
|
|
### What to inspect
|
|
|
|
- Cache trace events are JSONL and include staged snapshots like `session:loaded`, `prompt:before`, `stream:context`, and `session:after`.
|
|
- Per-turn cache token impact is visible in normal usage surfaces via `cacheRead` and `cacheWrite` (for example `/usage full` and session usage summaries).
|
|
- For Anthropic, expect both `cacheRead` and `cacheWrite` when caching is active.
|
|
- For OpenAI, expect `cacheRead` on cache hits and `cacheWrite` to remain `0`; OpenAI does not publish a separate cache-write token field.
|
|
- If you need request tracing, log request IDs and rate-limit headers separately from cache metrics. OpenClaw's current cache-trace output is focused on prompt/session shape and normalized token usage rather than raw provider response headers.
|
|
|
|
## Quick troubleshooting
|
|
|
|
- High `cacheWrite` on most turns: check for volatile system-prompt inputs and verify model/provider supports your cache settings.
|
|
- High `cacheWrite` on Anthropic: often means the cache breakpoint is landing on content that changes every request.
|
|
- Low OpenAI `cacheRead`: verify the stable prefix is at the front, the repeated prefix is at least 1024 tokens, and the same `prompt_cache_key` is reused for turns that should share a cache.
|
|
- No effect from `cacheRetention`: confirm model key matches `agents.defaults.models["provider/model"]`.
|
|
- Bedrock Nova/Mistral requests with cache settings: expected runtime force to `none`.
|
|
|
|
Related docs:
|
|
|
|
- [Anthropic](/providers/anthropic)
|
|
- [Token Use and Costs](/reference/token-use)
|
|
- [Session Pruning](/concepts/session-pruning)
|
|
- [Gateway Configuration Reference](/gateway/configuration-reference)
|