Aller au contenu principal
William Balance
← Back to blog

Anthropic Prompt Caching in Production: What You Actually Save

Published on April 29, 2026 · 4 min read
  • Claude
  • Prompt Caching
  • Performance
  • TypeScript
  • Anthropic
  • Cost

I opened my Anthropic bill for April: 3× higher than March. Not a disaster — usage went up, that’s expected. But looking closely, 60% of the cost was avoidable. The cause: an 8,000-token system prompt replayed on every agent call, with no cache. Three changes to the payload and the next month’s bill dropped by 40%.

Anthropic prompt caching has been out for almost two years. Yet I keep seeing production integrations where it’s not enabled — either out of unfamiliarity or because the upside is misunderstood. Here’s what I’ve learned after wiring it into 6-7 client projects.

How it works, in 30 seconds

You mark segments of your prompt with cache_control: { type: "ephemeral" }. Anthropic stores those segments in an ephemeral server-side cache. On the next call, if every token preceding the cache_control matches exactly, those segments are re-read from the cache instead of being reprocessed.

Cost: +25% over normal input on the first write. 10% of normal input on cache reads. Default TTL is 5 minutes (extensible to 1 hour as an opt-in). Minimum cacheable size: 1,024 tokens on Sonnet/Opus, 2,048 on Haiku.

The math is simple: replay a 5,000-token segment four times within 5 minutes and you pay 1.25 + 0.10 × 3 = 1.55× the normal cost instead of 4×. That’s -61%.

When it’s worth it

  • Long, stable system prompts (> 2,000 tokens): tone of voice, task instructions, few-shot examples. The most cost-effective case.
  • Tool definitions in an agent: 10-20 tools = 3,000-8,000 tokens of JSON schema, identical at every loop turn. Massive savings.
  • Fixed RAG context: legal docs, terms of service, company policy. If you pass the same 50 chunks on every user request, cache them.
  • Iterative agent workflows: a loop of 10 tool calls reusing the same baggage = 9 cache reads at 10% each.

When it doesn’t help (or costs more)

  • Short sessions: one call per user, no reuse. You pay the 25% write surcharge with no read to amortize it.
  • Prompts < 1,024 tokens: below the threshold, cache_control is silently ignored. The API doesn’t warn you.
  • Dynamic prefix: if any variable changes before the cached zone, the entire prefix is invalidated. Put dynamic segments after cached ones, never before.
  • TTL exceeded: past 5 minutes idle, the cache is flushed. Fine if the user comes back — but it ruins the math on sporadic traffic.

The pattern that works

Four segments, in this order:

  1. System rule (rarely changes) → cache_control
  2. Tool definitions (change on each release) → cache_control
  3. Fixed context (RAG, user profile, session data) → cache_control if > 1,024 tokens
  4. User turn (varies on every call) → no cache_control
import Anthropic from "@anthropic-ai/sdk";

const client = new Anthropic();

const response = await client.messages.create({
  model: "claude-sonnet-4-6",
  max_tokens: 1024,
  system: [
    { type: "text", text: SYSTEM_RULE, cache_control: { type: "ephemeral" } },
    { type: "text", text: TOOL_INSTRUCTIONS, cache_control: { type: "ephemeral" } },
    { type: "text", text: USER_PROFILE, cache_control: { type: "ephemeral" } },
  ],
  tools: TOOLS,
  messages: [{ role: "user", content: userInput }],
});

console.log(response.usage);
// cache_creation_input_tokens vs cache_read_input_tokens

The usage.cache_read_input_tokens field is your friend. If you don’t monitor it, you won’t know whether the cache hits. I’ve seen three projects where caching was enabled but never actually used — a single character of difference in the system prompt on every call (timestamp, UUID), systematic invalidation.

Production gotchas

  • Invisible debugging: a cache miss never raises an error. You just watch the bill grow. Log cache_read_input_tokens on every call and alert if the ratio drops.
  • Measuring hit rate: before enabling caching, capture a day of production calls. Count how many prompts could have been reused. If < 30%, caching costs more than it saves.
  • Releases invalidate everything: each deploy modifying the system prompt forces a full write for every active user. On a service with 10k concurrent users, that hurts. Atomic releases outside peak hours recommended.
  • Multi-tenancy: if each tenant has a slightly different system prompt, you end up with N separate caches. Factor the common trunk into cache_control and put tenant-specific parts after.

When NOT to use it

If you make fewer than 100 calls per day, don’t bother with caching. The complexity isn’t worth it, and the first-write surcharge never gets amortized. Turn it on when you see your agent looping and the same baggage replaying 5 times — not before.

Otherwise, it’s probably the highest-leverage optimization on a production LLM app.

Got a project in mind?

30 free minutes to talk it through, AI or full-stack.