Inspecting Cache Usage
Cache usage is reflected in theusage object of every response:
| Field | Description |
|---|---|
prompt_tokens_details.cached_tokens | Tokens read from cache (cache hit — cheaper) |
prompt_tokens_details.cache_write_tokens | Tokens written to cache this request (one-time write cost) |
OpenAI Automatic Caching
OpenAI caches prompt prefixes automatically. No special request configuration is needed. How it works:- Caching happens server-side at OpenAI, triggered automatically when prompts are long enough
- Minimum prompt length: 1,024 tokens
- Cache entries expire after ~1 hour of inactivity
- Cached tokens are charged at a reduced rate (typically 50% discount)
Anthropic Claude Prompt Caching
Anthropic supports two caching modes:- Automatic caching (default): Claude caches the system prompt automatically. Minimum 1,024 tokens.
- Explicit caching (
cache_control): You mark specific content blocks with"cache_control": {"type": "ephemeral"}to control exactly what gets cached.
Cache TTL
| Cache Type | TTL |
|---|---|
| Automatic | 5 minutes |
Explicit (ephemeral) | 1 hour (Claude 3.5+) or 5 minutes (Claude 3) |
Supported Models
| Model | Min Tokens (text) | Min Tokens (images) |
|---|---|---|
anthropic/claude-sonnet-4.6 | 1,024 | 1,024 |
anthropic/claude-opus-4.5 | 1,024 | 1,024 |
anthropic/claude-haiku-3.5 | 2,048 | 2,048 |
anthropic/claude-3-5-sonnet | 1,024 | 1,024 |
Explicit Caching Example
Mark content withcache_control to control caching at the content-block level:
extra_body:
- Python (OpenAI)
- Node.js (OpenAI)
- Anthropic SDK
DeepSeek Automatic Caching
DeepSeek caches prompt prefixes automatically, similar to OpenAI. No configuration needed.xAI (Grok) Automatic Caching
Grok models cache prompt prefixes automatically when the same prefix is reused across requests. No special configuration is required.Groq Automatic Caching
Groq’s inference infrastructure caches prompt prefixes automatically for supported models. Cache hits reduce latency and are reflected in the response usage object.Google Gemini Prompt Caching
Gemini supports both implicit (automatic) and explicit caching.Implicit Caching
Gemini 2.5 Flash and Pro cache large contexts automatically at no extra cost. Cache hits are visible in the response usage.Explicit Caching via Native Gemini API
For fine-grained control, use the native GeminicachedContents API. You create a cache object and reference it in subsequent requests:
name field (e.g., cachedContents/abc123) you reference in subsequent requests:
Provider Sticky Routing
To maximize cache hits, your repeated requests should reach the same provider instance. ARouter supports sticky routing to ensure this for providers that require it. When you include Anthropiccache_control blocks in your request, ARouter automatically routes subsequent requests with the same prefix to the same provider endpoint, preserving cache validity.
How Sticky Routing Works
- Your first request with a
cache_controlblock is processed and cached at the provider - ARouter records which provider instance handled the request
- Subsequent requests with the same cache prefix are routed to the same instance
- Cache hits lower your cost (reads are cheaper than writes) and reduce latency
Verifying Cache Hits
Check theusage object to confirm cache hits across requests:
Provider Cache Support Summary
| Provider | Cache Type | Min Tokens | TTL | Configuration |
|---|---|---|---|---|
| OpenAI | Automatic | 1,024 | ~1 hour | None required |
| Anthropic | Automatic + Explicit | 1,024 | 5 min (auto), 1 hour (explicit) | cache_control blocks |
| DeepSeek | Automatic | 1,024 | Provider-defined | None required |
| Google Gemini | Automatic + Explicit | 32,768 | 1 hour default | cachedContents API |
| xAI (Grok) | Automatic | Provider-defined | Provider-defined | None required |
| Groq | Automatic | Provider-defined | Provider-defined | None required |