# TL;DR

Prompt caching is automatic but requires stable prompt prefixes (1,024+ tokens for OpenAI)
Common failures: dynamic timestamps, non-deterministic formatting, RAG output in prefix
OpenAI cached input costs ~75% less than regular input (e.g., $0.75 vs $3.00 per 1M tokens)
Fix: separate stable prefix (system rules, schemas) from volatile suffix (user message, RAG context)
Measure: log cached_tokens per request and track cache hit rate

# Who This Is For

Developers building LLM applications with repeated prompt prefixes (system prompts, tool schemas, RAG instructions). You're sending 100+ requests/day and want to reduce input token costs.

# Assumptions & Inputs

Stable prefix: 5,000-10,000 tokens (system rules, tool schemas, instructions)
Volatile suffix: 500-2,000 tokens (user message, retrieved context)
Request volume: 1,000+ requests/day
Model: GPT-4.1 or similar with cached input pricing

# Prompt Caching Didn't "Save" Me — It Exposed How Dumb My Prompts Were

I'm looking at my API usage thinking:

"Why am I paying full price for the same system prompt again?"

Because that's exactly what I was doing.

Every request included the same giant prefix:

system rules
formatting instructions
tool schemas
"do RAG like this"
"cite sources like that"
agent scaffolding

And I was re-sending it hundreds of times per day.

Then I learned the part nobody internalizes:

Prompt caching is automatic… but your prompt structure can still make cache hits effectively zero.

# The Misconception: "Caching Just Works"

Most people assume:

"If caching is automatic, I'll automatically save money."

Caching is automatic, but cacheability is on you.

OpenAI caches the longest previously-computed prefix (starting at 1,024 tokens, growing in increments). If the prefix changes, you don't get the discount.

Claude also caches prompt prefixes, but you need to think in terms of stable prefix blocks and cache lifetime.

# What Prompt Caching Actually Is (No Vibes, Just Mechanics)

# OpenAI

Routes requests to servers that recently processed the same prompt prefix
Can reduce input token costs up to 90% and latency significantly (per their docs)
Works automatically on supported requests (no code changes)
Caching starts when your prompt prefix reaches 1,024 tokens

# Claude (Anthropic)

Caches the full prefix
Default cache lifetime is 5 minutes, refreshed when reused
Optional 1-hour cache exists at additional cost

So yes: real feature, real savings, real production value.

# The “Discount” Is Real: Cached Input Is Cheaper Than Input

OpenAI literally lists Cached Input pricing separately on their pricing tables (for many models).

Example from OpenAI’s pricing page:

GPT-4.1 input: $3.00 / 1M tokens
GPT-4.1 cached input: $0.75 / 1M tokens
output: $12.00 / 1M tokens

That’s a 4× discount on the cached part.

If your product repeats big prompt prefixes, that’s not “nice to have.”
That’s rent money.

# The One Metric That Matters: `cached_tokens`

If you don’t log cache hits, you’re just telling yourself bedtime stories.

OpenAI exposes cache hits via a cached token count in usage. Also: if your prompt is under the caching threshold, the cached count will be zero.

Rule: instrument this in production and watch it daily:

cached_tokens per request
cache hit rate (cached_tokens / prompt_tokens)
cost per request (input vs cached input vs output)

# Real Cost Breakdown (Why This Adds Up Fast)

Let's do the math using OpenAI's published pricing example (GPT-4.1).

Assume you send:

10,000 tokens stable prefix (policies + schemas + instructions)
500 tokens user-specific content
800 tokens output
1,000 requests/day

# Without caching

Input/day = 10,500,000 tokens = 10.5M
Cost(input) = 10.5 × $3.00 = $31.50/day

Output/day = 800,000 tokens = 0.8M
Cost(output) = 0.8 × $12.00 = $9.60/day

Total ≈ $41.10/day

# With caching (prefix hits)

Cached input/day = 10M → 10 × $0.75 = $7.50/day
Non-cached input/day = 0.5M → 0.5 × $3.00 = $1.50/day
Output stays = $9.60/day

Total ≈ $18.60/day

That’s a ~55% reduction… without changing models… without “optimizing” anything…
just by stopping prompt self-sabotage.

# Common Failures (How You Accidentally Get 0 Cache Hits)

I've seen these kill caching in real apps:

# #1 “Helpful” dynamic junk at the top

timestamps (“Today is…”)
random request IDs
rotating “examples”
debug metadata

If it’s near the top, it poisons the prefix.

# #2 Non-deterministic formatting

JSON keys in different order
inconsistent whitespace
tool schema changes per request

Caching doesn’t care that it’s “equivalent.” It cares that it’s the same prefix.

# #3 Shoving RAG output into the prefix

If your “stable” portion includes retrieved docs, it’s not stable.

Put RAG results at the end.

# #4 Forgetting the threshold

OpenAI's caching behavior starts at 1,024 tokens for the cached prefix. If your stable prefix is 600–900 tokens, you'll see zero cached tokens and wonder why.

# My Survival Strategy (How to Actually Get Cache Hits)

Here’s the prompt structure that stops bleeding money:

Stable Prefix Block
- system rules
- formatting requirements
- tool schemas
- policies
  Keep this as identical as possible.
Semi-stable Block
- product instructions that change occasionally
  Version it (v1, v2) instead of mutating per request.
Volatile Suffix
- user message
- retrieved context (RAG)
- current state
  This can change freely.

OpenAI caches the longest previously-computed prefix (starting at the threshold), so the more of your prompt that is truly stable, the bigger your savings.

Claude: treat caching as "prefix blocks with a lifetime," default 5 minutes unless you explicitly extend it.

# Don’t Guess. Calculate It on TokenBurner

If you want to do this like an adult:

Paste your full “stable prefix” into the estimator
Add your typical RAG context size
Multiply by your daily request volume
Compare input vs cached input economics

Calculate your caching upside

Paste your prompt prefix and see what cached input would save — in dollars, not vibes.

Open Calculator

And if your workflow is RAG-heavy, don’t forget the other half of the bill:

vectors storage & reads/writes (Vector DB)
chunking strategy (chunks × embedding cost × retrieval behavior)

You can sanity-check those too:

/playground/vector-db-cost
/playground/rag-chunk-visualizer

# Conclusion: Prompt Caching Is the Only Discount You Can Actually Earn

You don't "enable" caching.

You earn caching by writing prompts that stop changing at the top.

If your cached_tokens is near zero, it's not because caching is broken.

It's because your prompt is.

(And yes, that's fixable.)

For more on reducing token costs, see RAG cost optimization and Cursor model selection.

# Sources (Official)

OpenAI Prompt Caching guide (automatic, savings, 1024+ threshold)
OpenAI announcement (prefix caching starting at 1,024 tokens, increments)
OpenAI pricing tables (cached input pricing listed)
Claude prompt caching docs (5m default, optional 1h)

Prompt Caching: How to Get Cache Hits and Reduce Costs