Prompt 缓存 - ARouter

Prompt 缓存允许供应商重用之前处理过的提示内容。当您的提示开头与之前缓存的前缀匹配时，供应商会跳过重新处理这些 Token——显著降低成本和延迟。

查看缓存使用情况

缓存使用情况反映在每个响应的 usage 对象中：

{
  "usage": {
    "prompt_tokens": 1500,
    "completion_tokens": 100,
    "total_tokens": 1600,
    "prompt_tokens_details": {
      "cached_tokens": 1024,
      "cache_write_tokens": 476
    },
    "completion_tokens_details": {
      "reasoning_tokens": 0
    }
  }
}

字段	说明
`prompt_tokens_details.cached_tokens`	从缓存读取的 Token（缓存命中——更便宜）
`prompt_tokens_details.cache_write_tokens`	本次请求写入缓存的 Token（一次性写入费用）

OpenAI 自动缓存

OpenAI 自动缓存提示前缀，无需特殊请求配置。 工作原理：

缓存在 OpenAI 服务端发生，当提示足够长时自动触发
最小提示长度：1,024 Token
缓存条目在闲置约 1 小时后过期
缓存 Token 按折扣价计费（通常享受 50% 折扣）

from openai import OpenAI

client = OpenAI(
    base_url="https://api.arouter.ai/v1",
    api_key="lr_live_xxxx",
)

# 重复调用时，长系统提示会自动被缓存
response = client.chat.completions.create(
    model="openai/gpt-5.4",
    messages=[
        {
            "role": "system",
            "content": "You are an expert assistant. " + "<long context>" * 100,
        },
        {"role": "user", "content": "Summarize the above."},
    ],
)

print(response.usage.prompt_tokens_details)
# PromptTokensDetails(cached_tokens=1024, audio_tokens=0)

Anthropic Claude Prompt 缓存

Anthropic 支持两种缓存模式：

自动缓存（默认）：Claude 自动缓存系统提示。最少 1,024 Token。
显式缓存（cache_control）：使用 "cache_control": {"type": "ephemeral"} 标记特定内容块，精确控制缓存内容。

缓存 TTL

缓存类型	TTL
自动	5 分钟
显式（`ephemeral`）	1 小时（Claude 3.5+）或 5 分钟（Claude 3）

支持的模型

模型	最少 Token（文本）	最少 Token（图像）
`anthropic/claude-sonnet-4.6`	1,024	1,024
`anthropic/claude-opus-4.5`	1,024	1,024
`anthropic/claude-haiku-3.5`	2,048	2,048
`anthropic/claude-3-5-sonnet`	1,024	1,024

显式缓存示例

使用 cache_control 在内容块级别控制缓存：

{
  "model": "claude-sonnet-4.6",
  "system": [
    {
      "type": "text",
      "text": "You are a helpful assistant with access to the following reference document:\n\n<document>...</document>",
      "cache_control": { "type": "ephemeral" }
    }
  ],
  "messages": [
    { "role": "user", "content": "What does the document say about pricing?" }
  ],
  "max_tokens": 1024
}

通过 OpenAI 兼容端点时，使用 extra_body 传递：

Python (OpenAI)
Node.js (OpenAI)
Anthropic SDK

from openai import OpenAI

client = OpenAI(
    base_url="https://api.arouter.ai/v1",
    api_key="lr_live_xxxx",
)

long_document = "<document content here>" * 50  # 确保 > 1024 Token

response = client.chat.completions.create(
    model="anthropic/claude-sonnet-4.6",
    messages=[
        {
            "role": "system",
            "content": [
                {
                    "type": "text",
                    "text": f"Reference document:\n\n{long_document}",
                    "cache_control": {"type": "ephemeral"},
                }
            ],
        },
        {"role": "user", "content": "Summarize the key points."},
    ],
)

print(response.usage)
# Usage(prompt_tokens=1500, completion_tokens=80, total_tokens=1580,
#   prompt_tokens_details=PromptTokensDetails(cached_tokens=1024, cache_write_tokens=476))

import OpenAI from "openai";

const client = new OpenAI({
  baseURL: "https://api.arouter.ai/v1",
  apiKey: "lr_live_xxxx",
});

const longDocument = "<document content here>".repeat(50);

const response = await client.chat.completions.create({
  model: "anthropic/claude-sonnet-4.6",
  messages: [
    {
      role: "system",
      content: [
        {
          type: "text",
          text: `Reference document:\n\n${longDocument}`,
          // @ts-ignore — cache_control 是 Anthropic 特有的
          cache_control: { type: "ephemeral" },
        },
      ],
    },
    { role: "user", content: "Summarize the key points." },
  ],
});

console.log(response.usage?.prompt_tokens_details);
// { cached_tokens: 1024, cache_write_tokens: 476 }

import anthropic

client = anthropic.Anthropic(
    base_url="https://api.arouter.ai",
    api_key="lr_live_xxxx",
)

long_document = "<document content here>" * 50

response = client.messages.create(
    model="claude-sonnet-4.6",
    max_tokens=1024,
    system=[
        {
            "type": "text",
            "text": f"Reference document:\n\n{long_document}",
            "cache_control": {"type": "ephemeral"},
        }
    ],
    messages=[
        {"role": "user", "content": "Summarize the key points."}
    ],
)

print(response.usage)
# Usage(input_tokens=1500, output_tokens=80,
#   cache_creation_input_tokens=1024, cache_read_input_tokens=0)

# 第二次调用——从缓存读取而非写入
response2 = client.messages.create(
    model="claude-sonnet-4.6",
    max_tokens=1024,
    system=[
        {
            "type": "text",
            "text": f"Reference document:\n\n{long_document}",
            "cache_control": {"type": "ephemeral"},
        }
    ],
    messages=[
        {"role": "user", "content": "What's the main topic?"}
    ],
)

print(response2.usage)
# Usage(input_tokens=476, output_tokens=40,
#   cache_creation_input_tokens=0, cache_read_input_tokens=1024)

DeepSeek 自动缓存

DeepSeek 与 OpenAI 类似，自动缓存提示前缀，无需配置。

client = OpenAI(
    base_url="https://api.arouter.ai/v1",
    api_key="lr_live_xxxx",
)

# DeepSeek 在重复调用相同前缀时自动缓存
response = client.chat.completions.create(
    model="deepseek/deepseek-v3.2",
    messages=[
        {"role": "system", "content": "<long context>" * 100},
        {"role": "user", "content": "Analyze the above."},
    ],
)

# 在 usage 中检查缓存命中
print(response.usage.prompt_tokens_details.cached_tokens)

xAI（Grok）自动缓存

Grok 模型在跨请求重用相同前缀时自动缓存提示前缀，无需特殊配置。

client = OpenAI(
    base_url="https://api.arouter.ai/v1",
    api_key="lr_live_xxxx",
)

# Grok 在重复调用相同前缀时自动缓存
response = client.chat.completions.create(
    model="x-ai/grok-4.20",
    messages=[
        {"role": "system", "content": "<long system prompt>" * 100},
        {"role": "user", "content": "Answer the question."},
    ],
)

# usage 中反映缓存命中
print(response.usage.prompt_tokens_details)

Groq 自动缓存

Groq 的推理基础设施为支持的模型自动缓存提示前缀。缓存命中降低延迟，并反映在响应 usage 对象中。

client = OpenAI(
    base_url="https://api.arouter.ai/v1",
    api_key="lr_live_xxxx",
)

# Groq 在重复调用时自动缓存
response = client.chat.completions.create(
    model="groq/meta-llama/llama-4-maverick",
    messages=[
        {"role": "system", "content": "<long context>" * 100},
        {"role": "user", "content": "Analyze the above."},
    ],
)

print(response.usage.prompt_tokens_details.cached_tokens)

Google Gemini Prompt 缓存

Gemini 支持隐式（自动）和显式缓存。

隐式缓存

Gemini 2.5 Flash 和 Pro 自动缓存大型上下文，无需额外费用。缓存命中在响应 usage 中可见。

通过 Gemini 原生 API 进行显式缓存

如需精细控制，可使用 Gemini 原生 cachedContents API。您创建缓存对象并在后续请求中引用它：

{
  "model": "models/gemini-2.5-flash",
  "contents": [
    {
      "role": "user",
      "parts": [
        {
          "text": "What are the key points in this document?"
        }
      ]
    }
  ],
  "cachedContent": "cachedContents/abc123"
}

通过 ARouter 的提供商代理使用 Gemini 原生端点处理缓存内容：

# 创建缓存内容
curl https://api.arouter.ai/google/v1beta/cachedContents \
  -X POST \
  -H "Authorization: Bearer lr_live_xxxx" \
  -H "Content-Type: application/json" \
  -d '{
    "model": "models/gemini-2.5-flash",
    "contents": [
      {
        "role": "user",
        "parts": [{"text": "<large document content>"}]
      }
    ],
    "ttl": "3600s"
  }'

响应包含 name 字段（如 cachedContents/abc123），在后续请求中引用：

curl https://api.arouter.ai/google/v1beta/models/gemini-2.5-flash:generateContent \
  -X POST \
  -H "Authorization: Bearer lr_live_xxxx" \
  -H "Content-Type: application/json" \
  -d '{
    "contents": [{"role": "user", "parts": [{"text": "Summarize"}]}],
    "cachedContent": "cachedContents/abc123"
  }'

缓存使用情况出现在响应中：

{
  "usageMetadata": {
    "promptTokenCount": 200,
    "cachedContentTokenCount": 1500,
    "candidatesTokenCount": 50,
    "totalTokenCount": 250
  }
}

提供商粘性路由

为最大化缓存命中率，您的重复请求应到达同一提供商实例。ARouter 支持粘性路由，确保需要的提供商获得此保证。当您的请求中包含 Anthropic cache_control 块时，ARouter 自动将具有相同前缀的后续请求路由到同一提供商端点，保持缓存有效性。

粘性路由工作原理

带有 cache_control 块的首次请求在提供商处处理并缓存
ARouter 记录处理该请求的提供商实例
具有相同缓存前缀的后续请求被路由到同一实例
缓存命中降低您的费用（读取比写入便宜）并减少延迟

验证缓存命中

检查 usage 对象以确认跨请求的缓存命中：

# 首次请求——缓存未命中，写入内容
response1 = client.chat.completions.create(
    model="anthropic/claude-sonnet-4.6",
    messages=[
        {"role": "system", "content": [{"type": "text", "text": long_doc, "cache_control": {"type": "ephemeral"}}]},
        {"role": "user", "content": "Question 1"},
    ],
)
# prompt_tokens_details.cache_write_tokens > 0
print(response1.usage.prompt_tokens_details)

# 第二次请求——缓存命中
response2 = client.chat.completions.create(
    model="anthropic/claude-sonnet-4.6",
    messages=[
        {"role": "system", "content": [{"type": "text", "text": long_doc, "cache_control": {"type": "ephemeral"}}]},
        {"role": "user", "content": "Question 2"},
    ],
)
# prompt_tokens_details.cached_tokens > 0（缓存命中！）
print(response2.usage.prompt_tokens_details)

提供商缓存支持汇总

提供商	缓存类型	最少 Token	TTL	配置
OpenAI	自动	1,024	约 1 小时	无需配置
Anthropic	自动 + 显式	1,024	5 分钟（自动），1 小时（显式）	`cache_control` 块
DeepSeek	自动	1,024	供应商定义	无需配置
Google Gemini	自动 + 显式	32,768	默认 1 小时	`cachedContents` API
xAI（Grok）	自动	供应商定义	供应商定义	无需配置
Groq	自动	供应商定义	供应商定义	无需配置

​查看缓存使用情况

​OpenAI 自动缓存

​Anthropic Claude Prompt 缓存

​缓存 TTL

​支持的模型

​显式缓存示例

​DeepSeek 自动缓存

​xAI（Grok）自动缓存

​Groq 自动缓存

​Google Gemini Prompt 缓存

​隐式缓存

​通过 Gemini 原生 API 进行显式缓存

​提供商粘性路由

​粘性路由工作原理

​验证缓存命中

​提供商缓存支持汇总