vultr/openclaw

Fork 0

mirror of https://github.com/openclaw/openclaw.git synced 2026-04-06 14:51:08 +00:00

Files

Peter Steinberger cdb572d703 test: tune live cache assertions

2026-04-04 15:18:09 +09:00

10 KiB

Raw Blame History

title, summary, read_when

title

summary

read_when

Prompt Caching

Prompt caching knobs, merge order, provider behavior, and tuning patterns

You want to reduce prompt token costs with cache retention

You need per-agent cache behavior in multi-agent setups

You are tuning heartbeat and cache-ttl pruning together

Prompt caching

Prompt caching means the model provider can reuse unchanged prompt prefixes (usually system/developer instructions and other stable context) across turns instead of re-processing them every time. OpenClaw normalizes provider usage into cacheRead and cacheWrite where the upstream API exposes those counters directly.

Why this matters: lower token cost, faster responses, and more predictable performance for long-running sessions. Without caching, repeated prompts pay the full prompt cost on every turn even when most input did not change.

This page covers all cache-related knobs that affect prompt reuse and token cost.

Provider references:

Anthropic prompt caching: https://platform.claude.com/docs/en/build-with-claude/prompt-caching
OpenAI prompt caching: https://developers.openai.com/api/docs/guides/prompt-caching
OpenAI API headers and request IDs: https://developers.openai.com/api/reference/overview
Anthropic request IDs and errors: https://platform.claude.com/docs/en/api/errors

Primary knobs

`cacheRetention` (global default, model, and per-agent)

Set cache retention as a global default for all models:

agents:
  defaults:
    params:
      cacheRetention: "long" # none | short | long

Override per-model:

agents:
  defaults:
    models:
      "anthropic/claude-opus-4-6":
        params:
          cacheRetention: "short" # none | short | long

Per-agent override:

agents:
  list:
    - id: "alerts"
      params:
        cacheRetention: "none"

Config merge order:

agents.defaults.params (global default — applies to all models)
agents.defaults.models["provider/model"].params (per-model override)
agents.list[].params (matching agent id; overrides by key)

Legacy `cacheControlTtl`

Legacy values are still accepted and mapped:

5m -> short
1h -> long

Prefer cacheRetention for new config.

`contextPruning.mode: "cache-ttl"`

Prunes old tool-result context after cache TTL windows so post-idle requests do not re-cache oversized history.

agents:
  defaults:
    contextPruning:
      mode: "cache-ttl"
      ttl: "1h"

See Session Pruning for full behavior.

Heartbeat keep-warm

Heartbeat can keep cache windows warm and reduce repeated cache writes after idle gaps.

agents:
  defaults:
    heartbeat:
      every: "55m"

Per-agent heartbeat is supported at agents.list[].heartbeat.

Provider behavior

Anthropic (direct API)

cacheRetention is supported.
With Anthropic API-key auth profiles, OpenClaw seeds cacheRetention: "short" for Anthropic model refs when unset.
Anthropic native Messages responses expose both cache_read_input_tokens and cache_creation_input_tokens, so OpenClaw can show both cacheRead and cacheWrite.
For native Anthropic requests, cacheRetention: "short" maps to the default 5-minute ephemeral cache, and cacheRetention: "long" upgrades to the 1-hour TTL only on direct api.anthropic.com hosts.

OpenAI (direct API)

Prompt caching is automatic on supported recent models. OpenClaw does not need to inject block-level cache markers.
OpenClaw uses prompt_cache_key to keep cache routing stable across turns and uses prompt_cache_retention: "24h" only when cacheRetention: "long" is selected on direct OpenAI hosts.
OpenAI responses expose cached prompt tokens via usage.prompt_tokens_details.cached_tokens (or input_tokens_details.cached_tokens on Responses API events). OpenClaw maps that to cacheRead.
OpenAI does not expose a separate cache-write token counter, so cacheWrite stays 0 on OpenAI paths even when the provider is warming a cache.
OpenAI returns useful tracing and rate-limit headers such as x-request-id, openai-processing-ms, and x-ratelimit-*, but cache-hit accounting should come from the usage payload, not from headers.
In practice, OpenAI often behaves like an initial-prefix cache rather than Anthropic-style moving full-history reuse. Stable long-prefix text turns can land near a 4864 cached-token plateau in current live probes, while tool-heavy or MCP-style transcripts often plateau near 4608 cached tokens even on exact repeats.

Amazon Bedrock

Anthropic Claude model refs (amazon-bedrock/*anthropic.claude*) support explicit cacheRetention pass-through.
Non-Anthropic Bedrock models are forced to cacheRetention: "none" at runtime.

OpenRouter Anthropic models

For openrouter/anthropic/* model refs, OpenClaw injects Anthropic cache_control on system/developer prompt blocks to improve prompt-cache reuse.

Other providers

If the provider does not support this cache mode, cacheRetention has no effect.

Tuning patterns

Mixed traffic (recommended default)

Keep a long-lived baseline on your main agent, disable caching on bursty notifier agents:

agents:
  defaults:
    model:
      primary: "anthropic/claude-opus-4-6"
    models:
      "anthropic/claude-opus-4-6":
        params:
          cacheRetention: "long"
  list:
    - id: "research"
      default: true
      heartbeat:
        every: "55m"
    - id: "alerts"
      params:
        cacheRetention: "none"

Cost-first baseline

Set baseline cacheRetention: "short".
Enable contextPruning.mode: "cache-ttl".
Keep heartbeat below your TTL only for agents that benefit from warm caches.

Cache diagnostics

OpenClaw exposes dedicated cache-trace diagnostics for embedded agent runs.

Live regression tests

OpenClaw keeps provider-specific live cache probes for repeated prefixes, tool turns, image turns, and MCP-style tool transcripts.

src/agents/pi-embedded-runner.cache.live.test.ts
src/agents/pi-mcp-style.cache.live.test.ts

These tests intentionally do not use identical success criteria across providers.

Anthropic live expectations

Expect explicit warmup writes via cacheWrite.
Expect near-full history reuse on repeated turns because Anthropic cache control advances the cache breakpoint through the conversation.
Current live assertions still use high hit-rate thresholds for stable, tool, and image paths.

OpenAI live expectations

Expect cacheRead only. cacheWrite remains 0.
Treat repeated-turn cache reuse as a provider-specific plateau, not as Anthropic-style moving full-history reuse.
Current live assertions use conservative floor checks derived from observed live behavior on gpt-5.4-mini:
- stable prefix: cacheRead >= 4608, hit rate >= 0.90
- tool transcript: cacheRead >= 4096, hit rate >= 0.85
- image transcript: cacheRead >= 3840, hit rate >= 0.82
- MCP-style transcript: cacheRead >= 4096, hit rate >= 0.85

Fresh OpenAI verification on 2026-04-04 landed at:

stable prefix: cacheRead=4864, hit rate 0.971
tool transcript: cacheRead=4608, hit rate 0.900
image transcript: cacheRead=4864, hit rate 0.959
MCP-style transcript: cacheRead=4608, hit rate 0.895

Why the assertions differ:

Anthropic exposes explicit cache breakpoints and moving conversation-history reuse.
OpenAI prompt caching is still exact-prefix sensitive, but the effective reusable prefix in live Responses traffic can plateau earlier than the full prompt.
Because of that, comparing Anthropic and OpenAI by a single cross-provider percentage threshold creates false regressions.

`diagnostics.cacheTrace` config

diagnostics:
  cacheTrace:
    enabled: true
    filePath: "~/.openclaw/logs/cache-trace.jsonl" # optional
    includeMessages: false # default true
    includePrompt: false # default true
    includeSystem: false # default true

Defaults:

filePath: $OPENCLAW_STATE_DIR/logs/cache-trace.jsonl
includeMessages: true
includePrompt: true
includeSystem: true

Env toggles (one-off debugging)

OPENCLAW_CACHE_TRACE=1 enables cache tracing.
OPENCLAW_CACHE_TRACE_FILE=/path/to/cache-trace.jsonl overrides output path.
OPENCLAW_CACHE_TRACE_MESSAGES=0|1 toggles full message payload capture.
OPENCLAW_CACHE_TRACE_PROMPT=0|1 toggles prompt text capture.
OPENCLAW_CACHE_TRACE_SYSTEM=0|1 toggles system prompt capture.

What to inspect

Cache trace events are JSONL and include staged snapshots like session:loaded, prompt:before, stream:context, and session:after.
Per-turn cache token impact is visible in normal usage surfaces via cacheRead and cacheWrite (for example /usage full and session usage summaries).
For Anthropic, expect both cacheRead and cacheWrite when caching is active.
For OpenAI, expect cacheRead on cache hits and cacheWrite to remain 0; OpenAI does not publish a separate cache-write token field.
If you need request tracing, log request IDs and rate-limit headers separately from cache metrics. OpenClaw's current cache-trace output is focused on prompt/session shape and normalized token usage rather than raw provider response headers.

Quick troubleshooting

High cacheWrite on most turns: check for volatile system-prompt inputs and verify model/provider supports your cache settings.
High cacheWrite on Anthropic: often means the cache breakpoint is landing on content that changes every request.
Low OpenAI cacheRead: verify the stable prefix is at the front, the repeated prefix is at least 1024 tokens, and the same prompt_cache_key is reused for turns that should share a cache.
No effect from cacheRetention: confirm model key matches agents.defaults.models["provider/model"].
Bedrock Nova/Mistral requests with cache settings: expected runtime force to none.

Related docs:

10 KiB Raw Blame History