Skip to main content
Prompt caching allows providers to reuse previously processed prompt content. When the beginning of your prompt matches a previously cached prefix, the provider skips reprocessing those tokens — reducing both cost and latency significantly.

Inspecting Cache Usage

Cache usage is reflected in the usage object of every response:
{
  "usage": {
    "prompt_tokens": 1500,
    "completion_tokens": 100,
    "total_tokens": 1600,
    "prompt_tokens_details": {
      "cached_tokens": 1024,
      "cache_write_tokens": 476
    },
    "completion_tokens_details": {
      "reasoning_tokens": 0
    }
  }
}
FieldDescription
prompt_tokens_details.cached_tokensTokens read from cache (cache hit — cheaper)
prompt_tokens_details.cache_write_tokensTokens written to cache this request (one-time write cost)

OpenAI Automatic Caching

OpenAI caches prompt prefixes automatically. No special request configuration is needed. How it works:
  • Caching happens server-side at OpenAI, triggered automatically when prompts are long enough
  • Minimum prompt length: 1,024 tokens
  • Cache entries expire after ~1 hour of inactivity
  • Cached tokens are charged at a reduced rate (typically 50% discount)
from openai import OpenAI

client = OpenAI(
    base_url="https://api.arouter.ai/v1",
    api_key="lr_live_xxxx",
)

# Long system prompt gets cached automatically on repeated calls
response = client.chat.completions.create(
    model="openai/gpt-5.4",
    messages=[
        {
            "role": "system",
            "content": "You are an expert assistant. " + "<long context>" * 100,
        },
        {"role": "user", "content": "Summarize the above."},
    ],
)

print(response.usage.prompt_tokens_details)
# PromptTokensDetails(cached_tokens=1024, audio_tokens=0)

Anthropic Claude Prompt Caching

Anthropic supports two caching modes:
  • Automatic caching (default): Claude caches the system prompt automatically. Minimum 1,024 tokens.
  • Explicit caching (cache_control): You mark specific content blocks with "cache_control": {"type": "ephemeral"} to control exactly what gets cached.

Cache TTL

Cache TypeTTL
Automatic5 minutes
Explicit (ephemeral)1 hour (Claude 3.5+) or 5 minutes (Claude 3)

Supported Models

ModelMin Tokens (text)Min Tokens (images)
anthropic/claude-sonnet-4.61,0241,024
anthropic/claude-opus-4.51,0241,024
anthropic/claude-haiku-3.52,0482,048
anthropic/claude-3-5-sonnet1,0241,024

Explicit Caching Example

Mark content with cache_control to control caching at the content-block level:
{
  "model": "claude-sonnet-4.6",
  "system": [
    {
      "type": "text",
      "text": "You are a helpful assistant with access to the following reference document:\n\n<document>...</document>",
      "cache_control": { "type": "ephemeral" }
    }
  ],
  "messages": [
    { "role": "user", "content": "What does the document say about pricing?" }
  ],
  "max_tokens": 1024
}
For the OpenAI-compatible endpoint, pass via extra_body:
from openai import OpenAI

client = OpenAI(
    base_url="https://api.arouter.ai/v1",
    api_key="lr_live_xxxx",
)

long_document = "<document content here>" * 50  # Ensure > 1024 tokens

response = client.chat.completions.create(
    model="anthropic/claude-sonnet-4.6",
    messages=[
        {
            "role": "system",
            "content": [
                {
                    "type": "text",
                    "text": f"Reference document:\n\n{long_document}",
                    "cache_control": {"type": "ephemeral"},
                }
            ],
        },
        {"role": "user", "content": "Summarize the key points."},
    ],
)

print(response.usage)
# Usage(prompt_tokens=1500, completion_tokens=80, total_tokens=1580,
#   prompt_tokens_details=PromptTokensDetails(cached_tokens=1024, cache_write_tokens=476))

DeepSeek Automatic Caching

DeepSeek caches prompt prefixes automatically, similar to OpenAI. No configuration needed.
client = OpenAI(
    base_url="https://api.arouter.ai/v1",
    api_key="lr_live_xxxx",
)

# DeepSeek caches automatically on repeated calls with the same prefix
response = client.chat.completions.create(
    model="deepseek/deepseek-v3.2",
    messages=[
        {"role": "system", "content": "<long context>" * 100},
        {"role": "user", "content": "Analyze the above."},
    ],
)

# Check cache hit in usage
print(response.usage.prompt_tokens_details.cached_tokens)

Google Gemini Prompt Caching

Gemini supports both implicit (automatic) and explicit caching.

Implicit Caching

Gemini 2.5 Flash and Pro cache large contexts automatically at no extra cost. Cache hits are visible in the response usage.

Explicit Caching via Native Gemini API

For fine-grained control, use the native Gemini cachedContents API. You create a cache object and reference it in subsequent requests:
{
  "model": "models/gemini-2.5-flash",
  "contents": [
    {
      "role": "user",
      "parts": [
        {
          "text": "What are the key points in this document?"
        }
      ]
    }
  ],
  "cachedContent": "cachedContents/abc123"
}
Use the native Gemini endpoint via ARouter’s provider proxy to work with cached content:
# Create cached content
curl https://api.arouter.ai/google/v1beta/cachedContents \
  -X POST \
  -H "Authorization: Bearer lr_live_xxxx" \
  -H "Content-Type: application/json" \
  -d '{
    "model": "models/gemini-2.5-flash",
    "contents": [
      {
        "role": "user",
        "parts": [{"text": "<large document content>"}]
      }
    ],
    "ttl": "3600s"
  }'
The response includes a name field (e.g., cachedContents/abc123) you reference in subsequent requests:
curl https://api.arouter.ai/google/v1beta/models/gemini-2.5-flash:generateContent \
  -X POST \
  -H "Authorization: Bearer lr_live_xxxx" \
  -H "Content-Type: application/json" \
  -d '{
    "contents": [{"role": "user", "parts": [{"text": "Summarize"}]}],
    "cachedContent": "cachedContents/abc123"
  }'
Cache usage appears in the response:
{
  "usageMetadata": {
    "promptTokenCount": 200,
    "cachedContentTokenCount": 1500,
    "candidatesTokenCount": 50,
    "totalTokenCount": 250
  }
}