Prompt Caching Cost Optimization: The 80% Savings Most LLM Workflows Miss (2026)
Prompt Caching Cost Optimization: The 80% Savings Most LLM Workflows Miss (2026)
Prompt caching is the single largest cost-reduction technique available in LLM APIs in 2026. Used correctly, it converts a $20 research session into a $4 one. Most developers I talk to either don't use it or use it in ways that miss most of the savings.
This piece is the full breakdown — when caching wins, how to implement it correctly, and the common mistakes.
How caching actually works
When you call an LLM API, your input gets tokenized and fed through the model's attention mechanism. The most expensive part of inference is the first pass through this attention computation. Once the model has "read" your context, subsequent generation is much cheaper.
Prompt caching exposes this internal optimization. You tell the API "this prefix is stable; remember the work you did processing it." On the next call with the same prefix:
- The server checks: have I processed this prefix recently?
- If yes (cache hit): skip re-processing, use the stored intermediate state. Charge you ~10% of the standard input rate.
- If no (cache miss): full processing, full price, store result for next time.
For Claude (Anthropic API):
client.messages.create(
model="claude-opus-4-7",
system=[
{"type": "text", "text": "You are a research assistant."},
{
"type": "text",
"text": LONG_DOCUMENT, # ← cache this
"cache_control": {"type": "ephemeral"}
}
],
messages=[{"role": "user", "content": current_question}]
)
For OpenAI (GPT-5.5):
client.chat.completions.create(
model="gpt-5.5",
messages=[
{"role": "system", "content": LONG_DOCUMENT}, # auto-cached if long
{"role": "user", "content": current_question}
],
# OpenAI caches automatically for messages > 1024 tokens
)
OpenAI caches automatically for sufficiently long messages. Anthropic requires explicit cache_control markers but gives you finer control.
Real-world cost impact
Consider a research workflow: load a 200k-token corpus into context, ask 8 questions across follow-up turns.
Without caching (Claude Opus 4.7):
Turn 1: 200k input + 2k output = $3.00 + $0.15 = $3.15
Turn 2: 202k input + 2k output = $3.03 + $0.15 = $3.18 ← re-processes the 200k corpus
Turn 3: 204k input + 2k output = $3.06 + $0.15 = $3.21
... (each turn re-reads the whole conversation)
Turn 8: 218k input + 2k output = $3.27 + $0.15 = $3.42
Total: ~$26.00
With caching:
Turn 1: 200k input (cached) + 2k output = $3.00 (cache write) + $0.15 = $3.15
Turn 2: 200k cached + 4k new input + 2k output = $0.30 + $0.06 + $0.15 = $0.51
Turn 3: 200k cached + 6k new + 2k output = $0.30 + $0.09 + $0.15 = $0.54
...
Turn 8: 200k cached + 16k new + 2k output = $0.30 + $0.24 + $0.15 = $0.69
Total: ~$7.50
71% savings in this example. For sessions with more turns, the percentage savings grow because the cache amortizes across more queries.
When caching is the wrong choice
Three cases where caching adds overhead without enough savings:
- One-shot calls. If you make a single API call and never follow up, caching costs you the "cache write" premium (Claude charges 25% more on the initial write) with no chance to recoup. Don't bother.
- Very short prompts. OpenAI requires 1024+ tokens to auto-cache. Anthropic has no minimum but the percentage savings on a small prompt aren't worth the cache management overhead.
- Highly variable prefixes. If your prompt changes substantially between calls (different user, different document, different system message), caching has nothing stable to hold onto. Cache hits won't materialize.
Cache TTL management
Anthropic ephemeral cache: 5 minutes from last hit. Each subsequent call with the same prefix resets the TTL. Practical implication: if you keep asking questions within 5 minutes, the cache stays warm indefinitely. If you walk away for lunch, you pay the cache-write premium again on return.
Anthropic also offers a 1-hour cache tier on certain plans, useful for workflows that pause and resume.
OpenAI: similar 5-minute auto-refresh, no explicit TTL configuration.
If you're building a chat application where users come back hours later, consider:
- Storing the corpus somewhere outside the cache and re-sending it (cheaper than expecting cache to survive)
- Using Anthropic's 1-hour cache if available
- Pre-warming the cache before high-traffic times if you control session boundaries
Mistakes that destroy cache savings
Five things I see developers do that void caching benefits:
1. Putting the user query in the cached portion
# WRONG — invalidates cache every call
system=[{"type": "text", "text": f"{DOCUMENT}\n\nQuestion: {question}", "cache_control": ...}]
The cache key is the cached content. Mixing the variable question in means the cache key changes every call → 100% cache miss rate.
2. Inserting timestamps or request IDs
# WRONG — every call has a new timestamp
system=[{"type": "text", "text": f"Today is {datetime.now()}\n{DOCUMENT}", "cache_control": ...}]
Same problem. Anything dynamic in the cached portion breaks the cache key.
3. Reordering messages
# WRONG — different message order = different cache key
messages = [msg2, msg1, msg3, ...] # changed order
The cache hits on the exact sequence of cached prefix tokens. Reordering creates a miss.
4. Adding small variations to the system prompt
# WRONG — string differences invalidate cache
system_prompts = ["You are a researcher.", "You're a researcher."] # different strings
Cache is byte-exact match. "You are" and "You're" are different.
5. Using ephemeral cache for one-off calls
# WASTEFUL — pay cache write premium with no second turn
client.messages.create(... cache_control: {"type": "ephemeral"} ...)
# never followed up by another call within 5 minutes
The cache write costs more than a no-cache call would. Only enable caching when you expect repeated turns.
Optimal caching pattern
For Claude (Anthropic API):
def make_query(corpus, question, conversation_history):
response = client.messages.create(
model="claude-opus-4-7",
max_tokens=1024,
system=[
{
"type": "text",
"text": "You are a research assistant. Cite sources when possible.",
},
{
"type": "text",
"text": corpus, # ← long stable corpus, cached
"cache_control": {"type": "ephemeral"}
},
],
messages=[
*conversation_history, # accumulating history
{"role": "user", "content": question},
],
)
return response
The corpus is the only thing marked cache_control. Conversation history grows but stays prefix-stable. User questions are at the end (variable, not cached).
For OpenAI (GPT-5.5):
def make_query(corpus, question, conversation_history):
response = client.chat.completions.create(
model="gpt-5.5",
messages=[
{"role": "system", "content": corpus}, # auto-cached if long
*conversation_history,
{"role": "user", "content": question},
],
)
return response
OpenAI handles caching automatically. Just keep your prefix stable.
Caching with Web2MD pipelines
A common pattern: use Web2MD to build a research corpus, paste into Claude, ask multiple questions. The corpus is the natural cache candidate.
# 1. Build corpus once with Web2MD bulk export
corpus = open("research-corpus-2026-06.md").read() # 200k tokens
# 2. First question — cache write
r1 = client.messages.create(
model="claude-opus-4-7",
system=[{"type": "text", "text": corpus, "cache_control": {"type": "ephemeral"}}],
messages=[{"role": "user", "content": "Summarize the 3 main themes."}],
)
# 3. Follow-up questions within 5 minutes — cache hits
r2 = client.messages.create(
model="claude-opus-4-7",
system=[{"type": "text", "text": corpus, "cache_control": {"type": "ephemeral"}}],
messages=[
{"role": "user", "content": "Summarize the 3 main themes."},
{"role": "assistant", "content": r1.content[0].text},
{"role": "user", "content": "What's the strongest evidence for theme 2?"},
],
)
Cost: r1 pays full 200k input. r2 and onward pay 10% on the cached 200k. Eight follow-ups on a 200k corpus drops from $24 to $7.50. That's the math.
Subscription vs API + caching
If you use Claude Pro or ChatGPT Plus subscriptions, caching happens transparently inside the chat UI. You don't manage it. Pricing is bundled — your $20/mo covers the cached usage.
If you build via API, caching is your job. Done right, you can do work that would cost $300/mo on API for under $50/mo, comparable to subscription value at much higher volumes.
The takeaway
Prompt caching is not optional for serious LLM workflows in 2026. It's the largest single cost optimization available — bigger than picking a cheaper model, bigger than HTML-to-Markdown conversion, bigger than prompt compression.
The setup is one line of code. The savings are 70-85% on every workflow with repeated context. If you're not using it, you're paying 4-7x what you need to.
Related
- Claude vs GPT-5.5 vs DeepSeek R2 token costs
- How to reduce LLM token costs (practical)
- How to fill Claude's 1M context window
- Markdown tokenization deep dive
Install
Web2MD on the Chrome Web Store →
Free tier: 3 conversions/day. Pro at $9/mo unlocks unlimited + bulk export (build cacheable research corpora) + token estimates.