Prompt Caching

Prompt caching allows providers to reuse previously processed prompt content. When the beginning of your prompt matches a previously cached prefix, the provider skips reprocessing those tokens — reducing both cost and latency significantly.

Inspecting Cache Usage

Cache usage is reflected in the usage object of every response:

{
  "usage": {
    "prompt_tokens": 1500,
    "completion_tokens": 100,
    "total_tokens": 1600,
    "prompt_tokens_details": {
      "cached_tokens": 1024,
      "cache_write_tokens": 476
    },
    "completion_tokens_details": {
      "reasoning_tokens": 0
    }
  }
}

Field	Description
`prompt_tokens_details.cached_tokens`	Tokens read from cache (cache hit — cheaper)
`prompt_tokens_details.cache_write_tokens`	Tokens written to cache this request (one-time write cost)

OpenAI Automatic Caching

OpenAI caches prompt prefixes automatically. No special request configuration is needed. How it works:

Caching happens server-side at OpenAI, triggered automatically when prompts are long enough
Minimum prompt length: 1,024 tokens
Cache entries expire after ~1 hour of inactivity
Cached tokens are charged at a reduced rate (typically 50% discount)

from openai import OpenAI

client = OpenAI(
    base_url="https://api.arouter.ai/v1",
    api_key="lr_live_xxxx",
)

# Long system prompt gets cached automatically on repeated calls
response = client.chat.completions.create(
    model="openai/gpt-5.4",
    messages=[
        {
            "role": "system",
            "content": "You are an expert assistant. " + "<long context>" * 100,
        },
        {"role": "user", "content": "Summarize the above."},
    ],
)

print(response.usage.prompt_tokens_details)
# PromptTokensDetails(cached_tokens=1024, audio_tokens=0)

Anthropic Claude Prompt Caching

Anthropic supports two caching modes:

Automatic caching (default): Claude caches the system prompt automatically. Minimum 1,024 tokens.
Explicit caching (cache_control): You mark specific content blocks with "cache_control": {"type": "ephemeral"} to control exactly what gets cached.

Cache TTL

Cache Type	TTL
Automatic	5 minutes
Explicit (`ephemeral`)	1 hour (Claude 3.5+) or 5 minutes (Claude 3)

Supported Models

Model	Min Tokens (text)	Min Tokens (images)
`anthropic/claude-sonnet-4.6`	1,024	1,024
`anthropic/claude-opus-4.5`	1,024	1,024
`anthropic/claude-haiku-3.5`	2,048	2,048
`anthropic/claude-3-5-sonnet`	1,024	1,024

Explicit Caching Example

Mark content with cache_control to control caching at the content-block level:

{
  "model": "claude-sonnet-4.6",
  "system": [
    {
      "type": "text",
      "text": "You are a helpful assistant with access to the following reference document:\n\n<document>...</document>",
      "cache_control": { "type": "ephemeral" }
    }
  ],
  "messages": [
    { "role": "user", "content": "What does the document say about pricing?" }
  ],
  "max_tokens": 1024
}

For the OpenAI-compatible endpoint, pass via extra_body:

Python (OpenAI)
Node.js (OpenAI)
Anthropic SDK

from openai import OpenAI

client = OpenAI(
    base_url="https://api.arouter.ai/v1",
    api_key="lr_live_xxxx",
)

long_document = "<document content here>" * 50  # Ensure > 1024 tokens

response = client.chat.completions.create(
    model="anthropic/claude-sonnet-4.6",
    messages=[
        {
            "role": "system",
            "content": [
                {
                    "type": "text",
                    "text": f"Reference document:\n\n{long_document}",
                    "cache_control": {"type": "ephemeral"},
                }
            ],
        },
        {"role": "user", "content": "Summarize the key points."},
    ],
)

print(response.usage)
# Usage(prompt_tokens=1500, completion_tokens=80, total_tokens=1580,
#   prompt_tokens_details=PromptTokensDetails(cached_tokens=1024, cache_write_tokens=476))

import OpenAI from "openai";

const client = new OpenAI({
  baseURL: "https://api.arouter.ai/v1",
  apiKey: "lr_live_xxxx",
});

const longDocument = "<document content here>".repeat(50);

const response = await client.chat.completions.create({
  model: "anthropic/claude-sonnet-4.6",
  messages: [
    {
      role: "system",
      content: [
        {
          type: "text",
          text: `Reference document:\n\n${longDocument}`,
          // @ts-ignore — cache_control is Anthropic-specific
          cache_control: { type: "ephemeral" },
        },
      ],
    },
    { role: "user", content: "Summarize the key points." },
  ],
});

console.log(response.usage?.prompt_tokens_details);
// { cached_tokens: 1024, cache_write_tokens: 476 }

import anthropic

client = anthropic.Anthropic(
    base_url="https://api.arouter.ai",
    api_key="lr_live_xxxx",
)

long_document = "<document content here>" * 50

response = client.messages.create(
    model="claude-sonnet-4.6",
    max_tokens=1024,
    system=[
        {
            "type": "text",
            "text": f"Reference document:\n\n{long_document}",
            "cache_control": {"type": "ephemeral"},
        }
    ],
    messages=[
        {"role": "user", "content": "Summarize the key points."}
    ],
)

print(response.usage)
# Usage(input_tokens=1500, output_tokens=80,
#   cache_creation_input_tokens=1024, cache_read_input_tokens=0)

# Second call — cache is read instead of written
response2 = client.messages.create(
    model="claude-sonnet-4.6",
    max_tokens=1024,
    system=[
        {
            "type": "text",
            "text": f"Reference document:\n\n{long_document}",
            "cache_control": {"type": "ephemeral"},
        }
    ],
    messages=[
        {"role": "user", "content": "What's the main topic?"}
    ],
)

print(response2.usage)
# Usage(input_tokens=476, output_tokens=40,
#   cache_creation_input_tokens=0, cache_read_input_tokens=1024)

DeepSeek Automatic Caching

DeepSeek caches prompt prefixes automatically, similar to OpenAI. No configuration needed.

client = OpenAI(
    base_url="https://api.arouter.ai/v1",
    api_key="lr_live_xxxx",
)

# DeepSeek caches automatically on repeated calls with the same prefix
response = client.chat.completions.create(
    model="deepseek/deepseek-v3.2",
    messages=[
        {"role": "system", "content": "<long context>" * 100},
        {"role": "user", "content": "Analyze the above."},
    ],
)

# Check cache hit in usage
print(response.usage.prompt_tokens_details.cached_tokens)

Google Gemini Prompt Caching

Gemini supports both implicit (automatic) and explicit caching.

Implicit Caching

Gemini 2.5 Flash and Pro cache large contexts automatically at no extra cost. Cache hits are visible in the response usage.

Explicit Caching via Native Gemini API

For fine-grained control, use the native Gemini cachedContents API. You create a cache object and reference it in subsequent requests:

{
  "model": "models/gemini-2.5-flash",
  "contents": [
    {
      "role": "user",
      "parts": [
        {
          "text": "What are the key points in this document?"
        }
      ]
    }
  ],
  "cachedContent": "cachedContents/abc123"
}

Use the native Gemini endpoint via ARouter’s provider proxy to work with cached content:

# Create cached content
curl https://api.arouter.ai/google/v1beta/cachedContents \
  -X POST \
  -H "Authorization: Bearer lr_live_xxxx" \
  -H "Content-Type: application/json" \
  -d '{
    "model": "models/gemini-2.5-flash",
    "contents": [
      {
        "role": "user",
        "parts": [{"text": "<large document content>"}]
      }
    ],
    "ttl": "3600s"
  }'

The response includes a name field (e.g., cachedContents/abc123) you reference in subsequent requests:

curl https://api.arouter.ai/google/v1beta/models/gemini-2.5-flash:generateContent \
  -X POST \
  -H "Authorization: Bearer lr_live_xxxx" \
  -H "Content-Type: application/json" \
  -d '{
    "contents": [{"role": "user", "parts": [{"text": "Summarize"}]}],
    "cachedContent": "cachedContents/abc123"
  }'

Cache usage appears in the response:

{
  "usageMetadata": {
    "promptTokenCount": 200,
    "cachedContentTokenCount": 1500,
    "candidatesTokenCount": 50,
    "totalTokenCount": 250
  }
}

Get Started

Core Concepts

Features

Guides

Privacy

Administration

Best Practices

Frameworks & Integrations

Support

Inspecting Cache Usage

OpenAI Automatic Caching

Anthropic Claude Prompt Caching

Cache TTL

Supported Models

Explicit Caching Example

DeepSeek Automatic Caching

Google Gemini Prompt Caching

Implicit Caching

Explicit Caching via Native Gemini API

Get Started

Core Concepts

Features

Guides

Privacy

Administration

Best Practices

Frameworks & Integrations

Support

​Inspecting Cache Usage

​OpenAI Automatic Caching

​Anthropic Claude Prompt Caching

​Cache TTL

​Supported Models

​Explicit Caching Example

​DeepSeek Automatic Caching

​Google Gemini Prompt Caching

​Implicit Caching

​Explicit Caching via Native Gemini API

Inspecting Cache Usage

OpenAI Automatic Caching

Anthropic Claude Prompt Caching

Cache TTL

Supported Models

Explicit Caching Example

DeepSeek Automatic Caching

Google Gemini Prompt Caching

Implicit Caching

Explicit Caching via Native Gemini API