Skip to main content
Prompt caching allows providers to reuse previously processed prompt content. When the beginning of your prompt matches a previously cached prefix, the provider skips reprocessing those tokens — reducing both cost and latency significantly.

Inspecting Cache Usage

Cache usage is reflected in the usage object of every response:
{
  "usage": {
    "prompt_tokens": 1500,
    "completion_tokens": 100,
    "total_tokens": 1600,
    "prompt_tokens_details": {
      "cached_tokens": 1024,
      "cache_write_tokens": 476
    },
    "completion_tokens_details": {
      "reasoning_tokens": 0
    }
  }
}
FieldDescription
prompt_tokens_details.cached_tokensTokens read from cache (cache hit — cheaper)
prompt_tokens_details.cache_write_tokensTokens written to cache this request (one-time write cost)

OpenAI Automatic Caching

OpenAI caches prompt prefixes automatically. No special request configuration is needed. How it works:
  • Caching happens server-side at OpenAI, triggered automatically when prompts are long enough
  • Minimum prompt length: 1,024 tokens
  • Cache entries expire after ~1 hour of inactivity
  • Cached tokens are charged at a reduced rate (typically 50% discount)
from openai import OpenAI

client = OpenAI(
    base_url="https://api.arouter.ai/v1",
    api_key="lr_live_xxxx",
)

# Long system prompt gets cached automatically on repeated calls
response = client.chat.completions.create(
    model="openai/gpt-5.4",
    messages=[
        {
            "role": "system",
            "content": "You are an expert assistant. " + "<long context>" * 100,
        },
        {"role": "user", "content": "Summarize the above."},
    ],
)

print(response.usage.prompt_tokens_details)
# PromptTokensDetails(cached_tokens=1024, audio_tokens=0)

Anthropic Claude Prompt Caching

Anthropic supports two caching modes:
  • Automatic caching (default): Claude caches the system prompt automatically. Minimum 1,024 tokens.
  • Explicit caching (cache_control): You mark specific content blocks with "cache_control": {"type": "ephemeral"} to control exactly what gets cached.

Cache TTL

Cache TypeTTL
Automatic5 minutes
Explicit (ephemeral)1 hour (Claude 3.5+) or 5 minutes (Claude 3)

Supported Models

ModelMin Tokens (text)Min Tokens (images)
anthropic/claude-sonnet-4.61,0241,024
anthropic/claude-opus-4.51,0241,024
anthropic/claude-haiku-3.52,0482,048
anthropic/claude-3-5-sonnet1,0241,024

Explicit Caching Example

Mark content with cache_control to control caching at the content-block level:
{
  "model": "claude-sonnet-4.6",
  "system": [
    {
      "type": "text",
      "text": "You are a helpful assistant with access to the following reference document:\n\n<document>...</document>",
      "cache_control": { "type": "ephemeral" }
    }
  ],
  "messages": [
    { "role": "user", "content": "What does the document say about pricing?" }
  ],
  "max_tokens": 1024
}
For the OpenAI-compatible endpoint, pass via extra_body:
from openai import OpenAI

client = OpenAI(
    base_url="https://api.arouter.ai/v1",
    api_key="lr_live_xxxx",
)

long_document = "<document content here>" * 50  # Ensure > 1024 tokens

response = client.chat.completions.create(
    model="anthropic/claude-sonnet-4.6",
    messages=[
        {
            "role": "system",
            "content": [
                {
                    "type": "text",
                    "text": f"Reference document:\n\n{long_document}",
                    "cache_control": {"type": "ephemeral"},
                }
            ],
        },
        {"role": "user", "content": "Summarize the key points."},
    ],
)

print(response.usage)
# Usage(prompt_tokens=1500, completion_tokens=80, total_tokens=1580,
#   prompt_tokens_details=PromptTokensDetails(cached_tokens=1024, cache_write_tokens=476))

DeepSeek Automatic Caching

DeepSeek caches prompt prefixes automatically, similar to OpenAI. No configuration needed.
client = OpenAI(
    base_url="https://api.arouter.ai/v1",
    api_key="lr_live_xxxx",
)

# DeepSeek caches automatically on repeated calls with the same prefix
response = client.chat.completions.create(
    model="deepseek/deepseek-v3.2",
    messages=[
        {"role": "system", "content": "<long context>" * 100},
        {"role": "user", "content": "Analyze the above."},
    ],
)

# Check cache hit in usage
print(response.usage.prompt_tokens_details.cached_tokens)

xAI (Grok) Automatic Caching

Grok models cache prompt prefixes automatically when the same prefix is reused across requests. No special configuration is required.
client = OpenAI(
    base_url="https://api.arouter.ai/v1",
    api_key="lr_live_xxxx",
)

# Grok caches automatically on repeated calls with the same prefix
response = client.chat.completions.create(
    model="x-ai/grok-4.20",
    messages=[
        {"role": "system", "content": "<long system prompt>" * 100},
        {"role": "user", "content": "Answer the question."},
    ],
)

# Cache hit reflected in usage
print(response.usage.prompt_tokens_details)

Groq Automatic Caching

Groq’s inference infrastructure caches prompt prefixes automatically for supported models. Cache hits reduce latency and are reflected in the response usage object.
client = OpenAI(
    base_url="https://api.arouter.ai/v1",
    api_key="lr_live_xxxx",
)

# Groq caches automatically on repeated calls
response = client.chat.completions.create(
    model="groq/meta-llama/llama-4-maverick",
    messages=[
        {"role": "system", "content": "<long context>" * 100},
        {"role": "user", "content": "Analyze the above."},
    ],
)

print(response.usage.prompt_tokens_details.cached_tokens)

Google Gemini Prompt Caching

Gemini supports both implicit (automatic) and explicit caching.

Implicit Caching

Gemini 2.5 Flash and Pro cache large contexts automatically at no extra cost. Cache hits are visible in the response usage.

Explicit Caching via Native Gemini API

For fine-grained control, use the native Gemini cachedContents API. You create a cache object and reference it in subsequent requests:
{
  "model": "models/gemini-2.5-flash",
  "contents": [
    {
      "role": "user",
      "parts": [
        {
          "text": "What are the key points in this document?"
        }
      ]
    }
  ],
  "cachedContent": "cachedContents/abc123"
}
Use the native Gemini endpoint via ARouter’s provider proxy to work with cached content:
# Create cached content
curl https://api.arouter.ai/google/v1beta/cachedContents \
  -X POST \
  -H "Authorization: Bearer lr_live_xxxx" \
  -H "Content-Type: application/json" \
  -d '{
    "model": "models/gemini-2.5-flash",
    "contents": [
      {
        "role": "user",
        "parts": [{"text": "<large document content>"}]
      }
    ],
    "ttl": "3600s"
  }'
The response includes a name field (e.g., cachedContents/abc123) you reference in subsequent requests:
curl https://api.arouter.ai/google/v1beta/models/gemini-2.5-flash:generateContent \
  -X POST \
  -H "Authorization: Bearer lr_live_xxxx" \
  -H "Content-Type: application/json" \
  -d '{
    "contents": [{"role": "user", "parts": [{"text": "Summarize"}]}],
    "cachedContent": "cachedContents/abc123"
  }'
Cache usage appears in the response:
{
  "usageMetadata": {
    "promptTokenCount": 200,
    "cachedContentTokenCount": 1500,
    "candidatesTokenCount": 50,
    "totalTokenCount": 250
  }
}

Provider Sticky Routing

To maximize cache hits, your repeated requests should reach the same provider instance. ARouter supports sticky routing to ensure this for providers that require it. When you include Anthropic cache_control blocks in your request, ARouter automatically routes subsequent requests with the same prefix to the same provider endpoint, preserving cache validity.

How Sticky Routing Works

  1. Your first request with a cache_control block is processed and cached at the provider
  2. ARouter records which provider instance handled the request
  3. Subsequent requests with the same cache prefix are routed to the same instance
  4. Cache hits lower your cost (reads are cheaper than writes) and reduce latency

Verifying Cache Hits

Check the usage object to confirm cache hits across requests:
# First request — cache miss, content is written
response1 = client.chat.completions.create(
    model="anthropic/claude-sonnet-4.6",
    messages=[
        {"role": "system", "content": [{"type": "text", "text": long_doc, "cache_control": {"type": "ephemeral"}}]},
        {"role": "user", "content": "Question 1"},
    ],
)
# prompt_tokens_details.cache_write_tokens > 0
print(response1.usage.prompt_tokens_details)

# Second request — cache hit
response2 = client.chat.completions.create(
    model="anthropic/claude-sonnet-4.6",
    messages=[
        {"role": "system", "content": [{"type": "text", "text": long_doc, "cache_control": {"type": "ephemeral"}}]},
        {"role": "user", "content": "Question 2"},
    ],
)
# prompt_tokens_details.cached_tokens > 0 (cache hit!)
print(response2.usage.prompt_tokens_details)

Provider Cache Support Summary

ProviderCache TypeMin TokensTTLConfiguration
OpenAIAutomatic1,024~1 hourNone required
AnthropicAutomatic + Explicit1,0245 min (auto), 1 hour (explicit)cache_control blocks
DeepSeekAutomatic1,024Provider-definedNone required
Google GeminiAutomatic + Explicit32,7681 hour defaultcachedContents API
xAI (Grok)AutomaticProvider-definedProvider-definedNone required
GroqAutomaticProvider-definedProvider-definedNone required