Prompt Caching Didn’t “Save” Me — It Exposed How Dumb My Prompts Were
I’m looking at my API usage thinking:
“Why am I paying full price for the same system prompt again?”
Because that’s exactly what I was doing.
Every request included the same giant prefix:
- system rules
- formatting instructions
- tool schemas
- “do RAG like this”
- “cite sources like that”
- agent scaffolding
And I was re-sending it hundreds of times per day like a psychopath.
Then I learned the part nobody internalizes:
Prompt caching is automatic… but your prompt structure can still make cache hits effectively zero.
Let me save you from the same mistake.
The Misconception: “Caching Just Works”
Most people assume:
“If caching is automatic, I’ll automatically save money.”
Nope.
Caching is automatic, but cacheability is on you.
OpenAI caches the longest previously-computed prefix (starting at 1,024 tokens, growing in increments). If the prefix changes, you don't get the discount.
Claude also caches prompt prefixes, but you need to think in terms of stable prefix blocks and cache lifetime.
What Prompt Caching Actually Is (No Vibes, Just Mechanics)
OpenAI
- Routes requests to servers that recently processed the same prompt prefix
- Can reduce input token costs up to 90% and latency significantly (per their docs)
- Works automatically on supported requests (no code changes)
- Caching starts when your prompt prefix reaches 1,024 tokens
Claude (Anthropic)
- Caches the full prefix
- Default cache lifetime is 5 minutes, refreshed when reused
- Optional 1-hour cache exists at additional cost
So yes: real feature, real savings, real production value.
The “Discount” Is Real: Cached Input Is Cheaper Than Input
OpenAI literally lists Cached Input pricing separately on their pricing tables (for many models).
Example from OpenAI’s pricing page:
- GPT-4.1 input: $3.00 / 1M tokens
- GPT-4.1 cached input: $0.75 / 1M tokens
- output: $12.00 / 1M tokens
That’s a 4× discount on the cached part.
If your product repeats big prompt prefixes, that’s not “nice to have.”
That’s rent money.
The One Metric That Matters: cached_tokens
If you don’t log cache hits, you’re just telling yourself bedtime stories.
OpenAI exposes cache hits via a cached token count in usage. Also: if your prompt is under the caching threshold, the cached count will be zero.
Rule: instrument this in production and watch it daily:
- cached_tokens per request
- cache hit rate (cached_tokens / prompt_tokens)
- cost per request (input vs cached input vs output)
Real Cost Breakdown (Why This Adds Up Fast)
Let's do brutally simple math using OpenAI's published pricing example (GPT-4.1).
Assume you send:
- 10,000 tokens stable prefix (policies + schemas + instructions)
- 500 tokens user-specific content
- 800 tokens output
- 1,000 requests/day
Without caching
Input/day = 10,500,000 tokens = 10.5M
Cost(input) = 10.5 × $3.00 = $31.50/day
Output/day = 800,000 tokens = 0.8M
Cost(output) = 0.8 × $12.00 = $9.60/day
Total ≈ $41.10/day
With caching (prefix hits)
Cached input/day = 10M → 10 × $0.75 = $7.50/day
Non-cached input/day = 0.5M → 0.5 × $3.00 = $1.50/day
Output stays = $9.60/day
Total ≈ $18.60/day
That’s a ~55% reduction… without changing models… without “optimizing” anything…
just by stopping prompt self-sabotage.
The Hidden Money Pits (How You Accidentally Get 0 Cache Hits)
I’ve seen these kill caching in real apps:
#1 “Helpful” dynamic junk at the top
- timestamps (“Today is…”)
- random request IDs
- rotating “examples”
- debug metadata
If it’s near the top, it poisons the prefix.
#2 Non-deterministic formatting
- JSON keys in different order
- inconsistent whitespace
- tool schema changes per request
Caching doesn’t care that it’s “equivalent.” It cares that it’s the same prefix.
#3 Shoving RAG output into the prefix
If your “stable” portion includes retrieved docs, it’s not stable.
Put RAG results at the end.
#4 Forgetting the threshold
OpenAI's caching behavior starts at 1,024 tokens for the cached prefix. If your stable prefix is 600–900 tokens, you'll see zero cached tokens and wonder why.
My Survival Strategy (How to Actually Get Cache Hits)
Here’s the prompt structure that stops bleeding money:
-
Stable Prefix Block
- system rules
- formatting requirements
- tool schemas
- policies
Keep this as identical as possible.
-
Semi-stable Block
- product instructions that change occasionally
Version it (v1, v2) instead of mutating per request.
- product instructions that change occasionally
-
Volatile Suffix
- user message
- retrieved context (RAG)
- current state
This can change freely.
OpenAI caches the longest previously-computed prefix (starting at the threshold), so the more of your prompt that is truly stable, the bigger your savings.
Claude: treat caching as "prefix blocks with a lifetime," default 5 minutes unless you explicitly extend it.
Don’t Guess. Calculate It on TokenBurner
If you want to do this like an adult:
- Paste your full “stable prefix” into the estimator
- Add your typical RAG context size
- Multiply by your daily request volume
- Compare input vs cached input economics
And if your workflow is RAG-heavy, don’t forget the other half of the bill:
- vectors storage & reads/writes (Vector DB)
- chunking strategy (chunks × embedding cost × retrieval behavior)
You can sanity-check those too:
/playground/vector-db-cost/playground/rag-chunk-visualizer
Conclusion: Prompt Caching Is the Only Discount You Can Actually Earn
You don’t “enable” caching.
You deserve caching by writing prompts that stop changing at the top.
If your cached_tokens is near zero, it’s not because caching is broken.
It’s because your prompt is.
(And yes, that’s fixable.)
Sources (Official)
- OpenAI Prompt Caching guide (automatic, savings, 1024+ threshold)
- OpenAI announcement (prefix caching starting at 1,024 tokens, increments)
- OpenAI pricing tables (cached input pricing listed)
- Claude prompt caching docs (5m default, optional 1h)