What is prompt caching in LLM APIs?

Prompt caching is a feature where you mark part of your input as 'cacheable' on the API call. The provider stores it server-side for ~5 minutes. Subsequent requests with the same cached prefix pay roughly 10% of the standard input rate for those cached tokens. It's the right tool when the same long context (a document, a system prompt, a conversation history) gets reused across multiple turns.

Which models support prompt caching in 2026?

Claude (Opus 4.7, Sonnet 4.6, Haiku 4.5) via the Anthropic API and OpenAI's GPT-5.5/5.5 Mini both support prompt caching. Gemini 2 Pro supports context caching (slightly different API). DeepSeek R2 and Kimi K2 do not have official caching yet. Cursor and Claude Code use Anthropic caching transparently.

How much does prompt caching save in real workflows?

On a 200k-token research corpus you ask 8 questions about: without caching ≈ $20 in input cost (Claude Opus). With caching ≈ $4. That's 80% savings. The first turn pays full price ($3 input); each subsequent turn pays cache-hit price ($0.30 input). The break-even is 2 turns — past that, caching dominates.

What's the cache TTL and how do I extend it?

Anthropic's ephemeral cache is 5 minutes by default. Each new request with the same cache key resets the TTL. So as long as you keep asking questions, the cache stays warm. Anthropic also offers a longer cache tier (1-hour, higher rate) on some plans. OpenAI's caching is similar with 5-minute auto-refresh.

Does caching work for streaming responses?

Yes. Cache hits work identically for streaming and non-streaming responses. The cache key is based on the input prefix; the response generation is independent.

What's the right thing to cache vs not cache?

Cache the long, stable parts of your prompt: a research corpus, a system prompt with detailed instructions, conversation history older than a few turns. Don't cache: the user's current question (that's the variable part), small system prompts (overhead isn't worth it), one-off calls (no second turn to amortize against).

Prompt Caching Cost Optimization: The 80% Savings Most LLM Workflows Miss (2026)

Prompt caching is the single largest cost-reduction technique available in LLM APIs in 2026. Used correctly, it converts a $20 research session into a $4 one. Most developers I talk to either don't use it or use it in ways that miss most of the savings.

This piece is the full breakdown — when caching wins, how to implement it correctly, and the common mistakes.

How caching actually works

When you call an LLM API, your input gets tokenized and fed through the model's attention mechanism. The most expensive part of inference is the first pass through this attention computation. Once the model has "read" your context, subsequent generation is much cheaper.

Prompt caching exposes this internal optimization. You tell the API "this prefix is stable; remember the work you did processing it." On the next call with the same prefix:

The server checks: have I processed this prefix recently?
If yes (cache hit): skip re-processing, use the stored intermediate state. Charge you ~10% of the standard input rate.
If no (cache miss): full processing, full price, store result for next time.

For Claude (Anthropic API):

client.messages.create(
    model="claude-opus-4-7",
    system=[
        {"type": "text", "text": "You are a research assistant."},
        {
            "type": "text",
            "text": LONG_DOCUMENT,  # ← cache this
            "cache_control": {"type": "ephemeral"}
        }
    ],
    messages=[{"role": "user", "content": current_question}]
)

For OpenAI (GPT-5.5):

client.chat.completions.create(
    model="gpt-5.5",
    messages=[
        {"role": "system", "content": LONG_DOCUMENT},  # auto-cached if long
        {"role": "user", "content": current_question}
    ],
    # OpenAI caches automatically for messages > 1024 tokens
)

OpenAI caches automatically for sufficiently long messages. Anthropic requires explicit cache_control markers but gives you finer control.

Real-world cost impact

Consider a research workflow: load a 200k-token corpus into context, ask 8 questions across follow-up turns.

Without caching (Claude Opus 4.7):

Turn 1: 200k input + 2k output = $3.00 + $0.15 = $3.15
Turn 2: 202k input + 2k output = $3.03 + $0.15 = $3.18  ← re-processes the 200k corpus
Turn 3: 204k input + 2k output = $3.06 + $0.15 = $3.21
... (each turn re-reads the whole conversation)
Turn 8: 218k input + 2k output = $3.27 + $0.15 = $3.42

Total: ~$26.00

With caching:

Turn 1: 200k input (cached) + 2k output = $3.00 (cache write) + $0.15 = $3.15
Turn 2: 200k cached + 4k new input + 2k output = $0.30 + $0.06 + $0.15 = $0.51
Turn 3: 200k cached + 6k new + 2k output = $0.30 + $0.09 + $0.15 = $0.54
...
Turn 8: 200k cached + 16k new + 2k output = $0.30 + $0.24 + $0.15 = $0.69

Total: ~$7.50

71% savings in this example. For sessions with more turns, the percentage savings grow because the cache amortizes across more queries.

When caching is the wrong choice

Three cases where caching adds overhead without enough savings:

One-shot calls. If you make a single API call and never follow up, caching costs you the "cache write" premium (Claude charges 25% more on the initial write) with no chance to recoup. Don't bother.
Very short prompts. OpenAI requires 1024+ tokens to auto-cache. Anthropic has no minimum but the percentage savings on a small prompt aren't worth the cache management overhead.
Highly variable prefixes. If your prompt changes substantially between calls (different user, different document, different system message), caching has nothing stable to hold onto. Cache hits won't materialize.

Cache TTL management

Anthropic ephemeral cache: 5 minutes from last hit. Each subsequent call with the same prefix resets the TTL. Practical implication: if you keep asking questions within 5 minutes, the cache stays warm indefinitely. If you walk away for lunch, you pay the cache-write premium again on return.

Anthropic also offers a 1-hour cache tier on certain plans, useful for workflows that pause and resume.

OpenAI: similar 5-minute auto-refresh, no explicit TTL configuration.

If you're building a chat application where users come back hours later, consider:

Storing the corpus somewhere outside the cache and re-sending it (cheaper than expecting cache to survive)
Using Anthropic's 1-hour cache if available
Pre-warming the cache before high-traffic times if you control session boundaries

Mistakes that destroy cache savings

Five things I see developers do that void caching benefits:

1. Putting the user query in the cached portion

# WRONG — invalidates cache every call
system=[{"type": "text", "text": f"{DOCUMENT}\n\nQuestion: {question}", "cache_control": ...}]

The cache key is the cached content. Mixing the variable question in means the cache key changes every call → 100% cache miss rate.

2. Inserting timestamps or request IDs

# WRONG — every call has a new timestamp
system=[{"type": "text", "text": f"Today is {datetime.now()}\n{DOCUMENT}", "cache_control": ...}]

Same problem. Anything dynamic in the cached portion breaks the cache key.

3. Reordering messages

# WRONG — different message order = different cache key
messages = [msg2, msg1, msg3, ...]  # changed order

The cache hits on the exact sequence of cached prefix tokens. Reordering creates a miss.

4. Adding small variations to the system prompt

# WRONG — string differences invalidate cache
system_prompts = ["You are a researcher.", "You're a researcher."]  # different strings

Cache is byte-exact match. "You are" and "You're" are different.

5. Using ephemeral cache for one-off calls

# WASTEFUL — pay cache write premium with no second turn
client.messages.create(... cache_control: {"type": "ephemeral"} ...)
# never followed up by another call within 5 minutes

The cache write costs more than a no-cache call would. Only enable caching when you expect repeated turns.

Optimal caching pattern

For Claude (Anthropic API):

def make_query(corpus, question, conversation_history):
    response = client.messages.create(
        model="claude-opus-4-7",
        max_tokens=1024,
        system=[
            {
                "type": "text",
                "text": "You are a research assistant. Cite sources when possible.",
            },
            {
                "type": "text",
                "text": corpus,  # ← long stable corpus, cached
                "cache_control": {"type": "ephemeral"}
            },
        ],
        messages=[
            *conversation_history,  # accumulating history
            {"role": "user", "content": question},
        ],
    )
    return response

The corpus is the only thing marked cache_control. Conversation history grows but stays prefix-stable. User questions are at the end (variable, not cached).

For OpenAI (GPT-5.5):

def make_query(corpus, question, conversation_history):
    response = client.chat.completions.create(
        model="gpt-5.5",
        messages=[
            {"role": "system", "content": corpus},  # auto-cached if long
            *conversation_history,
            {"role": "user", "content": question},
        ],
    )
    return response

OpenAI handles caching automatically. Just keep your prefix stable.

Caching with Web2MD pipelines

A common pattern: use Web2MD to build a research corpus, paste into Claude, ask multiple questions. The corpus is the natural cache candidate.

# 1. Build corpus once with Web2MD bulk export
corpus = open("research-corpus-2026-06.md").read()  # 200k tokens

# 2. First question — cache write
r1 = client.messages.create(
    model="claude-opus-4-7",
    system=[{"type": "text", "text": corpus, "cache_control": {"type": "ephemeral"}}],
    messages=[{"role": "user", "content": "Summarize the 3 main themes."}],
)

# 3. Follow-up questions within 5 minutes — cache hits
r2 = client.messages.create(
    model="claude-opus-4-7",
    system=[{"type": "text", "text": corpus, "cache_control": {"type": "ephemeral"}}],
    messages=[
        {"role": "user", "content": "Summarize the 3 main themes."},
        {"role": "assistant", "content": r1.content[0].text},
        {"role": "user", "content": "What's the strongest evidence for theme 2?"},
    ],
)

Cost: r1 pays full 200k input. r2 and onward pay 10% on the cached 200k. Eight follow-ups on a 200k corpus drops from $24 to $7.50. That's the math.

Subscription vs API + caching

If you use Claude Pro or ChatGPT Plus subscriptions, caching happens transparently inside the chat UI. You don't manage it. Pricing is bundled — your $20/mo covers the cached usage.

If you build via API, caching is your job. Done right, you can do work that would cost $300/mo on API for under $50/mo, comparable to subscription value at much higher volumes.

The takeaway

Prompt caching is not optional for serious LLM workflows in 2026. It's the largest single cost optimization available — bigger than picking a cheaper model, bigger than HTML-to-Markdown conversion, bigger than prompt compression.

The setup is one line of code. The savings are 70-85% on every workflow with repeated context. If you're not using it, you're paying 4-7x what you need to.

Install

Web2MD on the Chrome Web Store →

Free tier: 3 conversions/day. Pro at $9/mo unlocks unlimited + bulk export (build cacheable research corpora) + token estimates.

Prompt Caching Cost Optimization: The 80% Savings Most LLM Workflows Miss (2026)

Prompt Caching Cost Optimization: The 80% Savings Most LLM Workflows Miss (2026)

How caching actually works

Real-world cost impact

When caching is the wrong choice

Cache TTL management

Mistakes that destroy cache savings

1. Putting the user query in the cached portion

2. Inserting timestamps or request IDs

3. Reordering messages

4. Adding small variations to the system prompt

5. Using ephemeral cache for one-off calls

Optimal caching pattern

Caching with Web2MD pipelines

Subscription vs API + caching

The takeaway

Install

Related Articles

Extend Perplexity Research With Your Sources

Web2MD vs Jina Reader: Browser Extension Guide

Do You Still Need Web2MD with GPT-5.5?

Most Read

Latest Articles