Documentation Index
Fetch the complete documentation index at: https://docs.arouter.ai/llms.txt
Use this file to discover all available pages before exploring further.
Prompt caching allows providers to reuse previously processed prompt content. When the beginning of your prompt matches a previously cached prefix, the provider skips reprocessing those tokens — reducing both cost and latency significantly.
Inspecting Cache Usage
Cache usage is reflected in the usage object of every response:
{
"usage": {
"prompt_tokens": 1500,
"completion_tokens": 100,
"total_tokens": 1600,
"prompt_tokens_details": {
"cached_tokens": 1024,
"cache_write_tokens": 476
},
"completion_tokens_details": {
"reasoning_tokens": 0
}
}
}
| Field | Description |
|---|
prompt_tokens_details.cached_tokens | Tokens read from cache (cache hit — cheaper) |
prompt_tokens_details.cache_write_tokens | Tokens written to cache this request (one-time write cost) |
OpenAI Automatic Caching
OpenAI caches prompt prefixes automatically. No special request configuration is needed.
How it works:
- Caching happens server-side at OpenAI, triggered automatically when prompts are long enough
- Minimum prompt length: 1,024 tokens
- Cache entries expire after ~1 hour of inactivity
- Cached tokens are charged at a reduced rate (typically 50% discount)
from openai import OpenAI
client = OpenAI(
base_url="https://api.arouter.ai/v1",
api_key="lr_live_xxxx",
)
# Long system prompt gets cached automatically on repeated calls
response = client.chat.completions.create(
model="openai/gpt-5.4",
messages=[
{
"role": "system",
"content": "You are an expert assistant. " + "<long context>" * 100,
},
{"role": "user", "content": "Summarize the above."},
],
)
print(response.usage.prompt_tokens_details)
# PromptTokensDetails(cached_tokens=1024, audio_tokens=0)
Anthropic Claude Prompt Caching
Anthropic supports two caching modes:
- Automatic caching (default): Claude caches the system prompt automatically. Minimum 1,024 tokens.
- Explicit caching (
cache_control): You mark specific content blocks with "cache_control": {"type": "ephemeral"} to control exactly what gets cached.
Cache TTL
| Cache Type | TTL |
|---|
| Automatic | 5 minutes |
Explicit (ephemeral) | 1 hour (Claude 3.5+) or 5 minutes (Claude 3) |
Supported Models
| Model | Min Tokens (text) | Min Tokens (images) |
|---|
anthropic/claude-sonnet-4.6 | 1,024 | 1,024 |
anthropic/claude-opus-4.5 | 1,024 | 1,024 |
anthropic/claude-haiku-3.5 | 2,048 | 2,048 |
anthropic/claude-3-5-sonnet | 1,024 | 1,024 |
Explicit Caching Example
Mark content with cache_control to control caching at the content-block level:
{
"model": "claude-sonnet-4.6",
"system": [
{
"type": "text",
"text": "You are a helpful assistant with access to the following reference document:\n\n<document>...</document>",
"cache_control": { "type": "ephemeral" }
}
],
"messages": [
{ "role": "user", "content": "What does the document say about pricing?" }
],
"max_tokens": 1024
}
For the OpenAI-compatible endpoint, pass via extra_body:
Python (OpenAI)
Node.js (OpenAI)
Anthropic SDK
from openai import OpenAI
client = OpenAI(
base_url="https://api.arouter.ai/v1",
api_key="lr_live_xxxx",
)
long_document = "<document content here>" * 50 # Ensure > 1024 tokens
response = client.chat.completions.create(
model="anthropic/claude-sonnet-4.6",
messages=[
{
"role": "system",
"content": [
{
"type": "text",
"text": f"Reference document:\n\n{long_document}",
"cache_control": {"type": "ephemeral"},
}
],
},
{"role": "user", "content": "Summarize the key points."},
],
)
print(response.usage)
# Usage(prompt_tokens=1500, completion_tokens=80, total_tokens=1580,
# prompt_tokens_details=PromptTokensDetails(cached_tokens=1024, cache_write_tokens=476))
import OpenAI from "openai";
const client = new OpenAI({
baseURL: "https://api.arouter.ai/v1",
apiKey: "lr_live_xxxx",
});
const longDocument = "<document content here>".repeat(50);
const response = await client.chat.completions.create({
model: "anthropic/claude-sonnet-4.6",
messages: [
{
role: "system",
content: [
{
type: "text",
text: `Reference document:\n\n${longDocument}`,
// @ts-ignore — cache_control is Anthropic-specific
cache_control: { type: "ephemeral" },
},
],
},
{ role: "user", content: "Summarize the key points." },
],
});
console.log(response.usage?.prompt_tokens_details);
// { cached_tokens: 1024, cache_write_tokens: 476 }
import anthropic
client = anthropic.Anthropic(
base_url="https://api.arouter.ai",
api_key="lr_live_xxxx",
)
long_document = "<document content here>" * 50
response = client.messages.create(
model="claude-sonnet-4.6",
max_tokens=1024,
system=[
{
"type": "text",
"text": f"Reference document:\n\n{long_document}",
"cache_control": {"type": "ephemeral"},
}
],
messages=[
{"role": "user", "content": "Summarize the key points."}
],
)
print(response.usage)
# Usage(input_tokens=1500, output_tokens=80,
# cache_creation_input_tokens=1024, cache_read_input_tokens=0)
# Second call — cache is read instead of written
response2 = client.messages.create(
model="claude-sonnet-4.6",
max_tokens=1024,
system=[
{
"type": "text",
"text": f"Reference document:\n\n{long_document}",
"cache_control": {"type": "ephemeral"},
}
],
messages=[
{"role": "user", "content": "What's the main topic?"}
],
)
print(response2.usage)
# Usage(input_tokens=476, output_tokens=40,
# cache_creation_input_tokens=0, cache_read_input_tokens=1024)
DeepSeek Automatic Caching
DeepSeek caches prompt prefixes automatically, similar to OpenAI. No configuration needed.
client = OpenAI(
base_url="https://api.arouter.ai/v1",
api_key="lr_live_xxxx",
)
# DeepSeek caches automatically on repeated calls with the same prefix
response = client.chat.completions.create(
model="deepseek/deepseek-v3.2",
messages=[
{"role": "system", "content": "<long context>" * 100},
{"role": "user", "content": "Analyze the above."},
],
)
# Check cache hit in usage
print(response.usage.prompt_tokens_details.cached_tokens)
xAI (Grok) Automatic Caching
Grok models cache prompt prefixes automatically when the same prefix is reused across requests. No special configuration is required.
client = OpenAI(
base_url="https://api.arouter.ai/v1",
api_key="lr_live_xxxx",
)
# Grok caches automatically on repeated calls with the same prefix
response = client.chat.completions.create(
model="x-ai/grok-4.20",
messages=[
{"role": "system", "content": "<long system prompt>" * 100},
{"role": "user", "content": "Answer the question."},
],
)
# Cache hit reflected in usage
print(response.usage.prompt_tokens_details)
Groq Automatic Caching
Groq’s inference infrastructure caches prompt prefixes automatically for supported models. Cache hits reduce latency and are reflected in the response usage object.
client = OpenAI(
base_url="https://api.arouter.ai/v1",
api_key="lr_live_xxxx",
)
# Groq caches automatically on repeated calls
response = client.chat.completions.create(
model="groq/meta-llama/llama-4-maverick",
messages=[
{"role": "system", "content": "<long context>" * 100},
{"role": "user", "content": "Analyze the above."},
],
)
print(response.usage.prompt_tokens_details.cached_tokens)
Google Gemini Prompt Caching
Gemini supports both implicit (automatic) and explicit caching.
Implicit Caching
Gemini 2.5 Flash and Pro cache large contexts automatically at no extra cost. Cache hits are visible in the response usage.
Explicit Caching via Native Gemini API
For fine-grained control, use the native Gemini cachedContents API. You create a cache object and reference it in subsequent requests:
{
"model": "models/gemini-2.5-flash",
"contents": [
{
"role": "user",
"parts": [
{
"text": "What are the key points in this document?"
}
]
}
],
"cachedContent": "cachedContents/abc123"
}
Use the native Gemini endpoint via ARouter’s provider proxy to work with cached content:
# Create cached content
curl https://api.arouter.ai/google/v1beta/cachedContents \
-X POST \
-H "Authorization: Bearer lr_live_xxxx" \
-H "Content-Type: application/json" \
-d '{
"model": "models/gemini-2.5-flash",
"contents": [
{
"role": "user",
"parts": [{"text": "<large document content>"}]
}
],
"ttl": "3600s"
}'
The response includes a name field (e.g., cachedContents/abc123) you reference in subsequent requests:
curl https://api.arouter.ai/google/v1beta/models/gemini-2.5-flash:generateContent \
-X POST \
-H "Authorization: Bearer lr_live_xxxx" \
-H "Content-Type: application/json" \
-d '{
"contents": [{"role": "user", "parts": [{"text": "Summarize"}]}],
"cachedContent": "cachedContents/abc123"
}'
Cache usage appears in the response:
{
"usageMetadata": {
"promptTokenCount": 200,
"cachedContentTokenCount": 1500,
"candidatesTokenCount": 50,
"totalTokenCount": 250
}
}
Provider Sticky Routing
To maximize cache hits, your repeated requests should reach the same provider instance. ARouter supports sticky routing to ensure this for providers that require it.
When you include Anthropic cache_control blocks in your request, ARouter automatically routes subsequent requests with the same prefix to the same provider endpoint, preserving cache validity.
How Sticky Routing Works
- Your first request with a
cache_control block is processed and cached at the provider
- ARouter records which provider instance handled the request
- Subsequent requests with the same cache prefix are routed to the same instance
- Cache hits lower your cost (reads are cheaper than writes) and reduce latency
Verifying Cache Hits
Check the usage object to confirm cache hits across requests:
# First request — cache miss, content is written
response1 = client.chat.completions.create(
model="anthropic/claude-sonnet-4.6",
messages=[
{"role": "system", "content": [{"type": "text", "text": long_doc, "cache_control": {"type": "ephemeral"}}]},
{"role": "user", "content": "Question 1"},
],
)
# prompt_tokens_details.cache_write_tokens > 0
print(response1.usage.prompt_tokens_details)
# Second request — cache hit
response2 = client.chat.completions.create(
model="anthropic/claude-sonnet-4.6",
messages=[
{"role": "system", "content": [{"type": "text", "text": long_doc, "cache_control": {"type": "ephemeral"}}]},
{"role": "user", "content": "Question 2"},
],
)
# prompt_tokens_details.cached_tokens > 0 (cache hit!)
print(response2.usage.prompt_tokens_details)
Provider Cache Support Summary
| Provider | Cache Type | Min Tokens | TTL | Configuration |
|---|
| OpenAI | Automatic | 1,024 | ~1 hour | None required |
| Anthropic | Automatic + Explicit | 1,024 | 5 min (auto), 1 hour (explicit) | cache_control blocks |
| DeepSeek | Automatic | 1,024 | Provider-defined | None required |
| Google Gemini | Automatic + Explicit | 32,768 | 1 hour default | cachedContents API |
| xAI (Grok) | Automatic | Provider-defined | Provider-defined | None required |
| Groq | Automatic | Provider-defined | Provider-defined | None required |