Prompt Caching Didn’t “Save” Me — It Exposed How Dumb My Prompts Were

I’m looking at my API usage thinking:

“Why am I paying full price for the same system prompt again?”

Because that’s exactly what I was doing.

Every request included the same giant prefix:

  • system rules
  • formatting instructions
  • tool schemas
  • “do RAG like this”
  • “cite sources like that”
  • agent scaffolding

And I was re-sending it hundreds of times per day like a psychopath.

Then I learned the part nobody internalizes:

Prompt caching is automatic… but your prompt structure can still make cache hits effectively zero.

Let me save you from the same mistake.


The Misconception: “Caching Just Works”

Most people assume:

“If caching is automatic, I’ll automatically save money.”

Nope.

Caching is automatic, but cacheability is on you.

OpenAI caches the longest previously-computed prefix (starting at 1,024 tokens, growing in increments). If the prefix changes, you don't get the discount.

Claude also caches prompt prefixes, but you need to think in terms of stable prefix blocks and cache lifetime.


What Prompt Caching Actually Is (No Vibes, Just Mechanics)

OpenAI

  • Routes requests to servers that recently processed the same prompt prefix
  • Can reduce input token costs up to 90% and latency significantly (per their docs)
  • Works automatically on supported requests (no code changes)
  • Caching starts when your prompt prefix reaches 1,024 tokens

Claude (Anthropic)

  • Caches the full prefix
  • Default cache lifetime is 5 minutes, refreshed when reused
  • Optional 1-hour cache exists at additional cost

So yes: real feature, real savings, real production value.


The “Discount” Is Real: Cached Input Is Cheaper Than Input

OpenAI literally lists Cached Input pricing separately on their pricing tables (for many models).

Example from OpenAI’s pricing page:

  • GPT-4.1 input: $3.00 / 1M tokens
  • GPT-4.1 cached input: $0.75 / 1M tokens
  • output: $12.00 / 1M tokens

That’s a 4× discount on the cached part.

If your product repeats big prompt prefixes, that’s not “nice to have.”
That’s rent money.


The One Metric That Matters: cached_tokens

If you don’t log cache hits, you’re just telling yourself bedtime stories.

OpenAI exposes cache hits via a cached token count in usage. Also: if your prompt is under the caching threshold, the cached count will be zero.

Rule: instrument this in production and watch it daily:

  • cached_tokens per request
  • cache hit rate (cached_tokens / prompt_tokens)
  • cost per request (input vs cached input vs output)

Real Cost Breakdown (Why This Adds Up Fast)

Let's do brutally simple math using OpenAI's published pricing example (GPT-4.1).

Assume you send:

  • 10,000 tokens stable prefix (policies + schemas + instructions)
  • 500 tokens user-specific content
  • 800 tokens output
  • 1,000 requests/day

Without caching

Input/day = 10,500,000 tokens = 10.5M
Cost(input) = 10.5 × $3.00 = $31.50/day

Output/day = 800,000 tokens = 0.8M
Cost(output) = 0.8 × $12.00 = $9.60/day

Total ≈ $41.10/day

With caching (prefix hits)

Cached input/day = 10M → 10 × $0.75 = $7.50/day
Non-cached input/day = 0.5M → 0.5 × $3.00 = $1.50/day
Output stays = $9.60/day

Total ≈ $18.60/day

That’s a ~55% reduction… without changing models… without “optimizing” anything…
just by stopping prompt self-sabotage.


The Hidden Money Pits (How You Accidentally Get 0 Cache Hits)

I’ve seen these kill caching in real apps:

#1 “Helpful” dynamic junk at the top

  • timestamps (“Today is…”)
  • random request IDs
  • rotating “examples”
  • debug metadata

If it’s near the top, it poisons the prefix.

#2 Non-deterministic formatting

  • JSON keys in different order
  • inconsistent whitespace
  • tool schema changes per request

Caching doesn’t care that it’s “equivalent.” It cares that it’s the same prefix.

#3 Shoving RAG output into the prefix

If your “stable” portion includes retrieved docs, it’s not stable.

Put RAG results at the end.

#4 Forgetting the threshold

OpenAI's caching behavior starts at 1,024 tokens for the cached prefix. If your stable prefix is 600–900 tokens, you'll see zero cached tokens and wonder why.


My Survival Strategy (How to Actually Get Cache Hits)

Here’s the prompt structure that stops bleeding money:

  1. Stable Prefix Block

    • system rules
    • formatting requirements
    • tool schemas
    • policies
      Keep this as identical as possible.
  2. Semi-stable Block

    • product instructions that change occasionally
      Version it (v1, v2) instead of mutating per request.
  3. Volatile Suffix

    • user message
    • retrieved context (RAG)
    • current state
      This can change freely.

OpenAI caches the longest previously-computed prefix (starting at the threshold), so the more of your prompt that is truly stable, the bigger your savings.

Claude: treat caching as "prefix blocks with a lifetime," default 5 minutes unless you explicitly extend it.


Don’t Guess. Calculate It on TokenBurner

If you want to do this like an adult:

  1. Paste your full “stable prefix” into the estimator
  2. Add your typical RAG context size
  3. Multiply by your daily request volume
  4. Compare input vs cached input economics
Calculate your caching upside
Paste your prompt prefix and see what cached input would save — in dollars, not vibes.
Open Calculator

And if your workflow is RAG-heavy, don’t forget the other half of the bill:

  • vectors storage & reads/writes (Vector DB)
  • chunking strategy (chunks × embedding cost × retrieval behavior)

You can sanity-check those too:

  • /playground/vector-db-cost
  • /playground/rag-chunk-visualizer

Conclusion: Prompt Caching Is the Only Discount You Can Actually Earn

You don’t “enable” caching.

You deserve caching by writing prompts that stop changing at the top.

If your cached_tokens is near zero, it’s not because caching is broken.

It’s because your prompt is.

(And yes, that’s fixable.)


Sources (Official)

  • OpenAI Prompt Caching guide (automatic, savings, 1024+ threshold)
  • OpenAI announcement (prefix caching starting at 1,024 tokens, increments)
  • OpenAI pricing tables (cached input pricing listed)
  • Claude prompt caching docs (5m default, optional 1h)