Anthropic Prompt Caching in Production: What You Actually Save
- Claude
- Prompt Caching
- Performance
- TypeScript
- Anthropic
- Cost
I opened my Anthropic bill for April: 3× higher than March. Not a disaster — usage went up, that’s expected. But looking closely, 60% of the cost was avoidable. The cause: an 8,000-token system prompt replayed on every agent call, with no cache. Three changes to the payload and the next month’s bill dropped by 40%.
Anthropic prompt caching has been out for almost two years. Yet I keep seeing production integrations where it’s not enabled — either out of unfamiliarity or because the upside is misunderstood. Here’s what I’ve learned after wiring it into 6-7 client projects.
How it works, in 30 seconds
You mark segments of your prompt with cache_control: { type: "ephemeral" }. Anthropic stores those segments in an ephemeral server-side cache. On the next call, if every token preceding the cache_control matches exactly, those segments are re-read from the cache instead of being reprocessed.
Cost: +25% over normal input on the first write. 10% of normal input on cache reads. Default TTL is 5 minutes (extensible to 1 hour as an opt-in). Minimum cacheable size: 1,024 tokens on Sonnet/Opus, 2,048 on Haiku.
The math is simple: replay a 5,000-token segment four times within 5 minutes and you pay 1.25 + 0.10 × 3 = 1.55× the normal cost instead of 4×. That’s -61%.
When it’s worth it
- Long, stable system prompts (> 2,000 tokens): tone of voice, task instructions, few-shot examples. The most cost-effective case.
- Tool definitions in an agent: 10-20 tools = 3,000-8,000 tokens of JSON schema, identical at every loop turn. Massive savings.
- Fixed RAG context: legal docs, terms of service, company policy. If you pass the same 50 chunks on every user request, cache them.
- Iterative agent workflows: a loop of 10 tool calls reusing the same baggage = 9 cache reads at 10% each.
When it doesn’t help (or costs more)
- Short sessions: one call per user, no reuse. You pay the 25% write surcharge with no read to amortize it.
- Prompts < 1,024 tokens: below the threshold,
cache_controlis silently ignored. The API doesn’t warn you. - Dynamic prefix: if any variable changes before the cached zone, the entire prefix is invalidated. Put dynamic segments after cached ones, never before.
- TTL exceeded: past 5 minutes idle, the cache is flushed. Fine if the user comes back — but it ruins the math on sporadic traffic.
The pattern that works
Four segments, in this order:
- System rule (rarely changes) → cache_control
- Tool definitions (change on each release) → cache_control
- Fixed context (RAG, user profile, session data) → cache_control if > 1,024 tokens
- User turn (varies on every call) → no cache_control
import Anthropic from "@anthropic-ai/sdk";
const client = new Anthropic();
const response = await client.messages.create({
model: "claude-sonnet-4-6",
max_tokens: 1024,
system: [
{ type: "text", text: SYSTEM_RULE, cache_control: { type: "ephemeral" } },
{ type: "text", text: TOOL_INSTRUCTIONS, cache_control: { type: "ephemeral" } },
{ type: "text", text: USER_PROFILE, cache_control: { type: "ephemeral" } },
],
tools: TOOLS,
messages: [{ role: "user", content: userInput }],
});
console.log(response.usage);
// cache_creation_input_tokens vs cache_read_input_tokens
The usage.cache_read_input_tokens field is your friend. If you don’t monitor it, you won’t know whether the cache hits. I’ve seen three projects where caching was enabled but never actually used — a single character of difference in the system prompt on every call (timestamp, UUID), systematic invalidation.
Production gotchas
- Invisible debugging: a cache miss never raises an error. You just watch the bill grow. Log
cache_read_input_tokenson every call and alert if the ratio drops. - Measuring hit rate: before enabling caching, capture a day of production calls. Count how many prompts could have been reused. If < 30%, caching costs more than it saves.
- Releases invalidate everything: each deploy modifying the system prompt forces a full write for every active user. On a service with 10k concurrent users, that hurts. Atomic releases outside peak hours recommended.
- Multi-tenancy: if each tenant has a slightly different system prompt, you end up with N separate caches. Factor the common trunk into cache_control and put tenant-specific parts after.
When NOT to use it
If you make fewer than 100 calls per day, don’t bother with caching. The complexity isn’t worth it, and the first-write surcharge never gets amortized. Turn it on when you see your agent looping and the same baggage replaying 5 times — not before.
Otherwise, it’s probably the highest-leverage optimization on a production LLM app.