10 KiB
title, summary, read_when
| title | summary | read_when | |||
|---|---|---|---|---|---|
| Prompt Caching | Prompt caching knobs, merge order, provider behavior, and tuning patterns |
|
Prompt caching
Prompt caching means the model provider can reuse unchanged prompt prefixes (usually system/developer instructions and other stable context) across turns instead of re-processing them every time. OpenClaw normalizes provider usage into cacheRead and cacheWrite where the upstream API exposes those counters directly.
Why this matters: lower token cost, faster responses, and more predictable performance for long-running sessions. Without caching, repeated prompts pay the full prompt cost on every turn even when most input did not change.
This page covers all cache-related knobs that affect prompt reuse and token cost.
Provider references:
- Anthropic prompt caching: https://platform.claude.com/docs/en/build-with-claude/prompt-caching
- OpenAI prompt caching: https://developers.openai.com/api/docs/guides/prompt-caching
- OpenAI API headers and request IDs: https://developers.openai.com/api/reference/overview
- Anthropic request IDs and errors: https://platform.claude.com/docs/en/api/errors
Primary knobs
cacheRetention (global default, model, and per-agent)
Set cache retention as a global default for all models:
agents:
defaults:
params:
cacheRetention: "long" # none | short | long
Override per-model:
agents:
defaults:
models:
"anthropic/claude-opus-4-6":
params:
cacheRetention: "short" # none | short | long
Per-agent override:
agents:
list:
- id: "alerts"
params:
cacheRetention: "none"
Config merge order:
agents.defaults.params(global default — applies to all models)agents.defaults.models["provider/model"].params(per-model override)agents.list[].params(matching agent id; overrides by key)
Legacy cacheControlTtl
Legacy values are still accepted and mapped:
5m->short1h->long
Prefer cacheRetention for new config.
contextPruning.mode: "cache-ttl"
Prunes old tool-result context after cache TTL windows so post-idle requests do not re-cache oversized history.
agents:
defaults:
contextPruning:
mode: "cache-ttl"
ttl: "1h"
See Session Pruning for full behavior.
Heartbeat keep-warm
Heartbeat can keep cache windows warm and reduce repeated cache writes after idle gaps.
agents:
defaults:
heartbeat:
every: "55m"
Per-agent heartbeat is supported at agents.list[].heartbeat.
Provider behavior
Anthropic (direct API)
cacheRetentionis supported.- With Anthropic API-key auth profiles, OpenClaw seeds
cacheRetention: "short"for Anthropic model refs when unset. - Anthropic native Messages responses expose both
cache_read_input_tokensandcache_creation_input_tokens, so OpenClaw can show bothcacheReadandcacheWrite. - For native Anthropic requests,
cacheRetention: "short"maps to the default 5-minute ephemeral cache, andcacheRetention: "long"upgrades to the 1-hour TTL only on directapi.anthropic.comhosts.
OpenAI (direct API)
- Prompt caching is automatic on supported recent models. OpenClaw does not need to inject block-level cache markers.
- OpenClaw uses
prompt_cache_keyto keep cache routing stable across turns and usesprompt_cache_retention: "24h"only whencacheRetention: "long"is selected on direct OpenAI hosts. - OpenAI responses expose cached prompt tokens via
usage.prompt_tokens_details.cached_tokens(orinput_tokens_details.cached_tokenson Responses API events). OpenClaw maps that tocacheRead. - OpenAI does not expose a separate cache-write token counter, so
cacheWritestays0on OpenAI paths even when the provider is warming a cache. - OpenAI returns useful tracing and rate-limit headers such as
x-request-id,openai-processing-ms, andx-ratelimit-*, but cache-hit accounting should come from the usage payload, not from headers. - In practice, OpenAI often behaves like an initial-prefix cache rather than Anthropic-style moving full-history reuse. Stable long-prefix text turns can land near a
4864cached-token plateau in current live probes, while tool-heavy or MCP-style transcripts often plateau near4608cached tokens even on exact repeats.
Amazon Bedrock
- Anthropic Claude model refs (
amazon-bedrock/*anthropic.claude*) support explicitcacheRetentionpass-through. - Non-Anthropic Bedrock models are forced to
cacheRetention: "none"at runtime.
OpenRouter Anthropic models
For openrouter/anthropic/* model refs, OpenClaw injects Anthropic cache_control on system/developer prompt blocks to improve prompt-cache reuse.
Other providers
If the provider does not support this cache mode, cacheRetention has no effect.
Tuning patterns
Mixed traffic (recommended default)
Keep a long-lived baseline on your main agent, disable caching on bursty notifier agents:
agents:
defaults:
model:
primary: "anthropic/claude-opus-4-6"
models:
"anthropic/claude-opus-4-6":
params:
cacheRetention: "long"
list:
- id: "research"
default: true
heartbeat:
every: "55m"
- id: "alerts"
params:
cacheRetention: "none"
Cost-first baseline
- Set baseline
cacheRetention: "short". - Enable
contextPruning.mode: "cache-ttl". - Keep heartbeat below your TTL only for agents that benefit from warm caches.
Cache diagnostics
OpenClaw exposes dedicated cache-trace diagnostics for embedded agent runs.
Live regression tests
OpenClaw keeps provider-specific live cache probes for repeated prefixes, tool turns, image turns, and MCP-style tool transcripts.
src/agents/pi-embedded-runner.cache.live.test.tssrc/agents/pi-mcp-style.cache.live.test.ts
These tests intentionally do not use identical success criteria across providers.
Anthropic live expectations
- Expect explicit warmup writes via
cacheWrite. - Expect near-full history reuse on repeated turns because Anthropic cache control advances the cache breakpoint through the conversation.
- Current live assertions still use high hit-rate thresholds for stable, tool, and image paths.
OpenAI live expectations
- Expect
cacheReadonly.cacheWriteremains0. - Treat repeated-turn cache reuse as a provider-specific plateau, not as Anthropic-style moving full-history reuse.
- Current live assertions use conservative floor checks derived from observed live behavior on
gpt-5.4-mini:- stable prefix:
cacheRead >= 4608, hit rate>= 0.90 - tool transcript:
cacheRead >= 4096, hit rate>= 0.85 - image transcript:
cacheRead >= 3840, hit rate>= 0.82 - MCP-style transcript:
cacheRead >= 4096, hit rate>= 0.85
- stable prefix:
Fresh OpenAI verification on 2026-04-04 landed at:
- stable prefix:
cacheRead=4864, hit rate0.971 - tool transcript:
cacheRead=4608, hit rate0.900 - image transcript:
cacheRead=4864, hit rate0.959 - MCP-style transcript:
cacheRead=4608, hit rate0.895
Why the assertions differ:
- Anthropic exposes explicit cache breakpoints and moving conversation-history reuse.
- OpenAI prompt caching is still exact-prefix sensitive, but the effective reusable prefix in live Responses traffic can plateau earlier than the full prompt.
- Because of that, comparing Anthropic and OpenAI by a single cross-provider percentage threshold creates false regressions.
diagnostics.cacheTrace config
diagnostics:
cacheTrace:
enabled: true
filePath: "~/.openclaw/logs/cache-trace.jsonl" # optional
includeMessages: false # default true
includePrompt: false # default true
includeSystem: false # default true
Defaults:
filePath:$OPENCLAW_STATE_DIR/logs/cache-trace.jsonlincludeMessages:trueincludePrompt:trueincludeSystem:true
Env toggles (one-off debugging)
OPENCLAW_CACHE_TRACE=1enables cache tracing.OPENCLAW_CACHE_TRACE_FILE=/path/to/cache-trace.jsonloverrides output path.OPENCLAW_CACHE_TRACE_MESSAGES=0|1toggles full message payload capture.OPENCLAW_CACHE_TRACE_PROMPT=0|1toggles prompt text capture.OPENCLAW_CACHE_TRACE_SYSTEM=0|1toggles system prompt capture.
What to inspect
- Cache trace events are JSONL and include staged snapshots like
session:loaded,prompt:before,stream:context, andsession:after. - Per-turn cache token impact is visible in normal usage surfaces via
cacheReadandcacheWrite(for example/usage fulland session usage summaries). - For Anthropic, expect both
cacheReadandcacheWritewhen caching is active. - For OpenAI, expect
cacheReadon cache hits andcacheWriteto remain0; OpenAI does not publish a separate cache-write token field. - If you need request tracing, log request IDs and rate-limit headers separately from cache metrics. OpenClaw's current cache-trace output is focused on prompt/session shape and normalized token usage rather than raw provider response headers.
Quick troubleshooting
- High
cacheWriteon most turns: check for volatile system-prompt inputs and verify model/provider supports your cache settings. - High
cacheWriteon Anthropic: often means the cache breakpoint is landing on content that changes every request. - Low OpenAI
cacheRead: verify the stable prefix is at the front, the repeated prefix is at least 1024 tokens, and the sameprompt_cache_keyis reused for turns that should share a cache. - No effect from
cacheRetention: confirm model key matchesagents.defaults.models["provider/model"]. - Bedrock Nova/Mistral requests with cache settings: expected runtime force to
none.
Related docs: