Prompt Caching - ARouter

Prompt caching allows providers to reuse previously processed prompt content. When the beginning of your prompt matches a previously cached prefix, the provider skips reprocessing those tokens — reducing both cost and latency significantly.

Inspecting Cache Usage

Cache usage is reflected in the usage object of every response:

{
  "usage": {
    "prompt_tokens": 1500,
    "completion_tokens": 100,
    "total_tokens": 1600,
    "prompt_tokens_details": {
      "cached_tokens": 1024,
      "cache_write_tokens": 476
    },
    "completion_tokens_details": {
      "reasoning_tokens": 0
    }
  }
}

Field	Description
`prompt_tokens_details.cached_tokens`	Tokens read from cache (cache hit — cheaper)
`prompt_tokens_details.cache_write_tokens`	Tokens written to cache this request (one-time write cost)

OpenAI Automatic Caching

OpenAI caches prompt prefixes automatically. No special request configuration is needed. How it works:

Caching happens server-side at OpenAI, triggered automatically when prompts are long enough
Minimum prompt length: 1,024 tokens
Cache entries expire after ~1 hour of inactivity
Cached tokens are charged at a reduced rate (typically 50% discount)

from openai import OpenAI

client = OpenAI(
    base_url="https://api.arouter.ai/v1",
    api_key="lr_live_xxxx",
)

# Long system prompt gets cached automatically on repeated calls
response = client.chat.completions.create(
    model="openai/gpt-5.4",
    messages=[
        {
            "role": "system",
            "content": "You are an expert assistant. " + "<long context>" * 100,
        },
        {"role": "user", "content": "Summarize the above."},
    ],
)

print(response.usage.prompt_tokens_details)
# PromptTokensDetails(cached_tokens=1024, audio_tokens=0)

Anthropic Claude Prompt Caching

Anthropic supports two caching modes:

Automatic caching (default): Claude caches the system prompt automatically. Minimum 1,024 tokens.
Explicit caching (cache_control): You mark specific content blocks with "cache_control": {"type": "ephemeral"} to control exactly what gets cached.

Cache TTL

Cache Type	TTL
Automatic	5 minutes
Explicit (`ephemeral`)	1 hour (Claude 3.5+) or 5 minutes (Claude 3)

Supported Models

Model	Min Tokens (text)	Min Tokens (images)
`anthropic/claude-sonnet-4.6`	1,024	1,024
`anthropic/claude-opus-4.5`	1,024	1,024
`anthropic/claude-haiku-3.5`	2,048	2,048
`anthropic/claude-3-5-sonnet`	1,024	1,024

Explicit Caching Example

Mark content with cache_control to control caching at the content-block level:

{
  "model": "claude-sonnet-4.6",
  "system": [
    {
      "type": "text",
      "text": "You are a helpful assistant with access to the following reference document:\n\n<document>...</document>",
      "cache_control": { "type": "ephemeral" }
    }
  ],
  "messages": [
    { "role": "user", "content": "What does the document say about pricing?" }
  ],
  "max_tokens": 1024
}

For the OpenAI-compatible endpoint, pass via extra_body:

Python (OpenAI)
Node.js (OpenAI)
Anthropic SDK

from openai import OpenAI

client = OpenAI(
    base_url="https://api.arouter.ai/v1",
    api_key="lr_live_xxxx",
)

long_document = "<document content here>" * 50  # Ensure > 1024 tokens

response = client.chat.completions.create(
    model="anthropic/claude-sonnet-4.6",
    messages=[
        {
            "role": "system",
            "content": [
                {
                    "type": "text",
                    "text": f"Reference document:\n\n{long_document}",
                    "cache_control": {"type": "ephemeral"},
                }
            ],
        },
        {"role": "user", "content": "Summarize the key points."},
    ],
)

print(response.usage)
# Usage(prompt_tokens=1500, completion_tokens=80, total_tokens=1580,
#   prompt_tokens_details=PromptTokensDetails(cached_tokens=1024, cache_write_tokens=476))

import OpenAI from "openai";

const client = new OpenAI({
  baseURL: "https://api.arouter.ai/v1",
  apiKey: "lr_live_xxxx",
});

const longDocument = "<document content here>".repeat(50);

const response = await client.chat.completions.create({
  model: "anthropic/claude-sonnet-4.6",
  messages: [
    {
      role: "system",
      content: [
        {
          type: "text",
          text: `Reference document:\n\n${longDocument}`,
          // @ts-ignore — cache_control is Anthropic-specific
          cache_control: { type: "ephemeral" },
        },
      ],
    },
    { role: "user", content: "Summarize the key points." },
  ],
});

console.log(response.usage?.prompt_tokens_details);
// { cached_tokens: 1024, cache_write_tokens: 476 }

import anthropic

client = anthropic.Anthropic(
    base_url="https://api.arouter.ai",
    api_key="lr_live_xxxx",
)

long_document = "<document content here>" * 50

response = client.messages.create(
    model="claude-sonnet-4.6",
    max_tokens=1024,
    system=[
        {
            "type": "text",
            "text": f"Reference document:\n\n{long_document}",
            "cache_control": {"type": "ephemeral"},
        }
    ],
    messages=[
        {"role": "user", "content": "Summarize the key points."}
    ],
)

print(response.usage)
# Usage(input_tokens=1500, output_tokens=80,
#   cache_creation_input_tokens=1024, cache_read_input_tokens=0)

# Second call — cache is read instead of written
response2 = client.messages.create(
    model="claude-sonnet-4.6",
    max_tokens=1024,
    system=[
        {
            "type": "text",
            "text": f"Reference document:\n\n{long_document}",
            "cache_control": {"type": "ephemeral"},
        }
    ],
    messages=[
        {"role": "user", "content": "What's the main topic?"}
    ],
)

print(response2.usage)
# Usage(input_tokens=476, output_tokens=40,
#   cache_creation_input_tokens=0, cache_read_input_tokens=1024)

DeepSeek Automatic Caching

DeepSeek caches prompt prefixes automatically, similar to OpenAI. No configuration needed.

client = OpenAI(
    base_url="https://api.arouter.ai/v1",
    api_key="lr_live_xxxx",
)

# DeepSeek caches automatically on repeated calls with the same prefix
response = client.chat.completions.create(
    model="deepseek/deepseek-v3.2",
    messages=[
        {"role": "system", "content": "<long context>" * 100},
        {"role": "user", "content": "Analyze the above."},
    ],
)

# Check cache hit in usage
print(response.usage.prompt_tokens_details.cached_tokens)

xAI (Grok) Automatic Caching

Grok models cache prompt prefixes automatically when the same prefix is reused across requests. No special configuration is required.

client = OpenAI(
    base_url="https://api.arouter.ai/v1",
    api_key="lr_live_xxxx",
)

# Grok caches automatically on repeated calls with the same prefix
response = client.chat.completions.create(
    model="x-ai/grok-4.20",
    messages=[
        {"role": "system", "content": "<long system prompt>" * 100},
        {"role": "user", "content": "Answer the question."},
    ],
)

# Cache hit reflected in usage
print(response.usage.prompt_tokens_details)

Groq Automatic Caching

Groq’s inference infrastructure caches prompt prefixes automatically for supported models. Cache hits reduce latency and are reflected in the response usage object.

client = OpenAI(
    base_url="https://api.arouter.ai/v1",
    api_key="lr_live_xxxx",
)

# Groq caches automatically on repeated calls
response = client.chat.completions.create(
    model="groq/meta-llama/llama-4-maverick",
    messages=[
        {"role": "system", "content": "<long context>" * 100},
        {"role": "user", "content": "Analyze the above."},
    ],
)

print(response.usage.prompt_tokens_details.cached_tokens)

Google Gemini Prompt Caching

Gemini supports both implicit (automatic) and explicit caching.

Implicit Caching

Gemini 2.5 Flash and Pro cache large contexts automatically at no extra cost. Cache hits are visible in the response usage.

Explicit Caching via Native Gemini API

For fine-grained control, use the native Gemini cachedContents API. You create a cache object and reference it in subsequent requests:

{
  "model": "models/gemini-2.5-flash",
  "contents": [
    {
      "role": "user",
      "parts": [
        {
          "text": "What are the key points in this document?"
        }
      ]
    }
  ],
  "cachedContent": "cachedContents/abc123"
}

Use the native Gemini endpoint via ARouter’s provider proxy to work with cached content:

# Create cached content
curl https://api.arouter.ai/google/v1beta/cachedContents \
  -X POST \
  -H "Authorization: Bearer lr_live_xxxx" \
  -H "Content-Type: application/json" \
  -d '{
    "model": "models/gemini-2.5-flash",
    "contents": [
      {
        "role": "user",
        "parts": [{"text": "<large document content>"}]
      }
    ],
    "ttl": "3600s"
  }'

The response includes a name field (e.g., cachedContents/abc123) you reference in subsequent requests:

curl https://api.arouter.ai/google/v1beta/models/gemini-2.5-flash:generateContent \
  -X POST \
  -H "Authorization: Bearer lr_live_xxxx" \
  -H "Content-Type: application/json" \
  -d '{
    "contents": [{"role": "user", "parts": [{"text": "Summarize"}]}],
    "cachedContent": "cachedContents/abc123"
  }'

Cache usage appears in the response:

{
  "usageMetadata": {
    "promptTokenCount": 200,
    "cachedContentTokenCount": 1500,
    "candidatesTokenCount": 50,
    "totalTokenCount": 250
  }
}

Provider Sticky Routing

To maximize cache hits, your repeated requests should reach the same provider instance. ARouter supports sticky routing to ensure this for providers that require it. When you include Anthropic cache_control blocks in your request, ARouter automatically routes subsequent requests with the same prefix to the same provider endpoint, preserving cache validity.

How Sticky Routing Works

Your first request with a cache_control block is processed and cached at the provider
ARouter records which provider instance handled the request
Subsequent requests with the same cache prefix are routed to the same instance
Cache hits lower your cost (reads are cheaper than writes) and reduce latency

Verifying Cache Hits

Check the usage object to confirm cache hits across requests:

# First request — cache miss, content is written
response1 = client.chat.completions.create(
    model="anthropic/claude-sonnet-4.6",
    messages=[
        {"role": "system", "content": [{"type": "text", "text": long_doc, "cache_control": {"type": "ephemeral"}}]},
        {"role": "user", "content": "Question 1"},
    ],
)
# prompt_tokens_details.cache_write_tokens > 0
print(response1.usage.prompt_tokens_details)

# Second request — cache hit
response2 = client.chat.completions.create(
    model="anthropic/claude-sonnet-4.6",
    messages=[
        {"role": "system", "content": [{"type": "text", "text": long_doc, "cache_control": {"type": "ephemeral"}}]},
        {"role": "user", "content": "Question 2"},
    ],
)
# prompt_tokens_details.cached_tokens > 0 (cache hit!)
print(response2.usage.prompt_tokens_details)

Provider Cache Support Summary

Provider	Cache Type	Min Tokens	TTL	Configuration
OpenAI	Automatic	1,024	~1 hour	None required
Anthropic	Automatic + Explicit	1,024	5 min (auto), 1 hour (explicit)	`cache_control` blocks
DeepSeek	Automatic	1,024	Provider-defined	None required
Google Gemini	Automatic + Explicit	32,768	1 hour default	`cachedContents` API
xAI (Grok)	Automatic	Provider-defined	Provider-defined	None required
Groq	Automatic	Provider-defined	Provider-defined	None required

​Inspecting Cache Usage

​OpenAI Automatic Caching

​Anthropic Claude Prompt Caching

​Cache TTL

​Supported Models

​Explicit Caching Example

​DeepSeek Automatic Caching

​xAI (Grok) Automatic Caching

​Groq Automatic Caching

​Google Gemini Prompt Caching

​Implicit Caching

​Explicit Caching via Native Gemini API

​Provider Sticky Routing

​How Sticky Routing Works

​Verifying Cache Hits

​Provider Cache Support Summary