프롬프트 캐싱

프롬프트 캐싱을 통해 제공업체는 이전에 처리한 프롬프트 내용을 재사용할 수 있습니다. 프롬프트의 시작 부분이 이전에 캐시된 접두사와 일치하면 제공업체는 해당 토큰의 재처리를 건너뜁니다——비용과 지연 시간 모두 크게 줄어듭니다.

캐시 사용량 확인

캐시 사용량은 모든 응답의 usage 객체에 반영됩니다:

{
  "usage": {
    "prompt_tokens": 1500,
    "completion_tokens": 100,
    "total_tokens": 1600,
    "prompt_tokens_details": {
      "cached_tokens": 1024,
      "cache_write_tokens": 476
    },
    "completion_tokens_details": {
      "reasoning_tokens": 0
    }
  }
}

필드	설명
`prompt_tokens_details.cached_tokens`	캐시에서 읽은 토큰（캐시 히트——더 저렴）
`prompt_tokens_details.cache_write_tokens`	이번 요청에서 캐시에 쓴 토큰（일회성 쓰기 비용）

OpenAI 자동 캐싱

OpenAI는 프롬프트 접두사를 자동으로 캐시합니다. 특별한 요청 구성이 필요하지 않습니다. 작동 방식:

캐싱은 OpenAI의 서버 측에서 프롬프트가 충분히 길 때 자동으로 트리거됩니다
최소 프롬프트 길이: 1,024 토큰
캐시 항목은 약 1시간 비활성 후 만료됩니다
캐시된 토큰은 할인 요금으로 청구됩니다（일반적으로 50% 할인）

from openai import OpenAI

client = OpenAI(
    base_url="https://api.arouter.ai/v1",
    api_key="lr_live_xxxx",
)

# 반복 호출 시 긴 시스템 프롬프트는 자동으로 캐시됨
response = client.chat.completions.create(
    model="openai/gpt-5.4",
    messages=[
        {
            "role": "system",
            "content": "You are an expert assistant. " + "<long context>" * 100,
        },
        {"role": "user", "content": "Summarize the above."},
    ],
)

print(response.usage.prompt_tokens_details)
# PromptTokensDetails(cached_tokens=1024, audio_tokens=0)

Anthropic Claude 프롬프트 캐싱

Anthropic은 두 가지 캐싱 모드를 지원합니다:

자동 캐싱（기본값）: Claude가 시스템 프롬프트를 자동으로 캐시합니다. 최소 1,024 토큰.
명시적 캐싱（cache_control）: "cache_control": {"type": "ephemeral"}로 특정 콘텐츠 블록을 표시하여 캐시할 내용을 정확히 제어합니다.

캐시 TTL

캐시 유형	TTL
자동	5분
명시적（`ephemeral`）	1시간（Claude 3.5+）또는 5분（Claude 3）

지원되는 모델

모델	최소 토큰（텍스트）	최소 토큰（이미지）
`anthropic/claude-sonnet-4.6`	1,024	1,024
`anthropic/claude-opus-4.5`	1,024	1,024
`anthropic/claude-haiku-3.5`	2,048	2,048
`anthropic/claude-3-5-sonnet`	1,024	1,024

명시적 캐싱 예제

cache_control을 사용하여 콘텐츠 블록 수준에서 캐싱을 제어합니다:

{
  "model": "claude-sonnet-4.6",
  "system": [
    {
      "type": "text",
      "text": "You are a helpful assistant with access to the following reference document:\n\n<document>...</document>",
      "cache_control": { "type": "ephemeral" }
    }
  ],
  "messages": [
    { "role": "user", "content": "What does the document say about pricing?" }
  ],
  "max_tokens": 1024
}

OpenAI 호환 엔드포인트에서는 extra_body를 통해 전달합니다:

Python (OpenAI)
Node.js (OpenAI)
Anthropic SDK

from openai import OpenAI

client = OpenAI(
    base_url="https://api.arouter.ai/v1",
    api_key="lr_live_xxxx",
)

long_document = "<document content here>" * 50  # 1024 토큰 이상 확보

response = client.chat.completions.create(
    model="anthropic/claude-sonnet-4.6",
    messages=[
        {
            "role": "system",
            "content": [
                {
                    "type": "text",
                    "text": f"Reference document:\n\n{long_document}",
                    "cache_control": {"type": "ephemeral"},
                }
            ],
        },
        {"role": "user", "content": "Summarize the key points."},
    ],
)

print(response.usage)
# Usage(prompt_tokens=1500, completion_tokens=80, total_tokens=1580,
#   prompt_tokens_details=PromptTokensDetails(cached_tokens=1024, cache_write_tokens=476))

import OpenAI from "openai";

const client = new OpenAI({
  baseURL: "https://api.arouter.ai/v1",
  apiKey: "lr_live_xxxx",
});

const longDocument = "<document content here>".repeat(50);

const response = await client.chat.completions.create({
  model: "anthropic/claude-sonnet-4.6",
  messages: [
    {
      role: "system",
      content: [
        {
          type: "text",
          text: `Reference document:\n\n${longDocument}`,
          // @ts-ignore — cache_control은 Anthropic 전용
          cache_control: { type: "ephemeral" },
        },
      ],
    },
    { role: "user", content: "Summarize the key points." },
  ],
});

console.log(response.usage?.prompt_tokens_details);
// { cached_tokens: 1024, cache_write_tokens: 476 }

import anthropic

client = anthropic.Anthropic(
    base_url="https://api.arouter.ai",
    api_key="lr_live_xxxx",
)

long_document = "<document content here>" * 50

response = client.messages.create(
    model="claude-sonnet-4.6",
    max_tokens=1024,
    system=[
        {
            "type": "text",
            "text": f"Reference document:\n\n{long_document}",
            "cache_control": {"type": "ephemeral"},
        }
    ],
    messages=[
        {"role": "user", "content": "Summarize the key points."}
    ],
)

print(response.usage)
# Usage(input_tokens=1500, output_tokens=80,
#   cache_creation_input_tokens=1024, cache_read_input_tokens=0)

# 두 번째 호출——쓰기 대신 캐시에서 읽기
response2 = client.messages.create(
    model="claude-sonnet-4.6",
    max_tokens=1024,
    system=[
        {
            "type": "text",
            "text": f"Reference document:\n\n{long_document}",
            "cache_control": {"type": "ephemeral"},
        }
    ],
    messages=[
        {"role": "user", "content": "What's the main topic?"}
    ],
)

print(response2.usage)
# Usage(input_tokens=476, output_tokens=40,
#   cache_creation_input_tokens=0, cache_read_input_tokens=1024)

DeepSeek 자동 캐싱

DeepSeek은 OpenAI와 유사하게 프롬프트 접두사를 자동으로 캐시합니다. 구성이 필요하지 않습니다.

client = OpenAI(
    base_url="https://api.arouter.ai/v1",
    api_key="lr_live_xxxx",
)

# DeepSeek은 동일한 접두사로 반복 호출 시 자동으로 캐시
response = client.chat.completions.create(
    model="deepseek/deepseek-v3.2",
    messages=[
        {"role": "system", "content": "<long context>" * 100},
        {"role": "user", "content": "Analyze the above."},
    ],
)

# usage에서 캐시 히트 확인
print(response.usage.prompt_tokens_details.cached_tokens)

xAI（Grok）자동 캐싱

Grok 모델은 요청 간에 동일한 접두사를 재사용할 때 프롬프트 접두사를 자동으로 캐시합니다. 특별한 구성이 필요하지 않습니다.

client = OpenAI(
    base_url="https://api.arouter.ai/v1",
    api_key="lr_live_xxxx",
)

# Grok은 동일한 접두사로 반복 호출 시 자동으로 캐시
response = client.chat.completions.create(
    model="x-ai/grok-4.20",
    messages=[
        {"role": "system", "content": "<long system prompt>" * 100},
        {"role": "user", "content": "Answer the question."},
    ],
)

# 캐시 히트가 usage에 반영됨
print(response.usage.prompt_tokens_details)

Groq 자동 캐싱

Groq의 추론 인프라는 지원되는 모델에 대해 프롬프트 접두사를 자동으로 캐시합니다. 캐시 히트는 지연 시간을 줄이고 응답 usage 객체에 반영됩니다.

client = OpenAI(
    base_url="https://api.arouter.ai/v1",
    api_key="lr_live_xxxx",
)

# Groq은 반복 호출 시 자동으로 캐시
response = client.chat.completions.create(
    model="groq/meta-llama/llama-4-maverick",
    messages=[
        {"role": "system", "content": "<long context>" * 100},
        {"role": "user", "content": "Analyze the above."},
    ],
)

print(response.usage.prompt_tokens_details.cached_tokens)

Google Gemini 프롬프트 캐싱

Gemini는 암시적（자동）및 명시적 캐싱을 모두 지원합니다.

암시적 캐싱

Gemini 2.5 Flash와 Pro는 추가 비용 없이 대규모 컨텍스트를 자동으로 캐시합니다. 캐시 히트는 응답 usage에서 확인할 수 있습니다.

네이티브 Gemini API를 통한 명시적 캐싱

세밀한 제어를 위해 네이티브 Gemini cachedContents API를 사용합니다. 캐시 객체를 만들고 후속 요청에서 참조합니다:

{
  "model": "models/gemini-2.5-flash",
  "contents": [
    {
      "role": "user",
      "parts": [
        {
          "text": "What are the key points in this document?"
        }
      ]
    }
  ],
  "cachedContent": "cachedContents/abc123"
}

ARouter의 제공업체 프록시를 통해 네이티브 Gemini 엔드포인트를 사용하여 캐시된 콘텐츠를 처리합니다:

# 캐시된 콘텐츠 만들기
curl https://api.arouter.ai/google/v1beta/cachedContents \
  -X POST \
  -H "Authorization: Bearer lr_live_xxxx" \
  -H "Content-Type: application/json" \
  -d '{
    "model": "models/gemini-2.5-flash",
    "contents": [
      {
        "role": "user",
        "parts": [{"text": "<large document content>"}]
      }
    ],
    "ttl": "3600s"
  }'

응답에는 후속 요청에서 참조하는 name 필드（예: cachedContents/abc123）가 포함됩니다:

curl https://api.arouter.ai/google/v1beta/models/gemini-2.5-flash:generateContent \
  -X POST \
  -H "Authorization: Bearer lr_live_xxxx" \
  -H "Content-Type: application/json" \
  -d '{
    "contents": [{"role": "user", "parts": [{"text": "Summarize"}]}],
    "cachedContent": "cachedContents/abc123"
  }'

캐시 사용량이 응답에 나타납니다:

{
  "usageMetadata": {
    "promptTokenCount": 200,
    "cachedContentTokenCount": 1500,
    "candidatesTokenCount": 50,
    "totalTokenCount": 250
  }
}

제공업체 스티키 라우팅

캐시 히트율을 극대화하려면 반복 요청이 동일한 제공업체 인스턴스에 도달해야 합니다. ARouter는 이를 필요로 하는 제공업체에 대해 스티키 라우팅을 지원합니다. 요청에 Anthropic cache_control 블록이 포함되면 ARouter는 동일한 접두사를 가진 후속 요청을 자동으로 동일한 제공업체 엔드포인트로 라우팅하여 캐시 유효성을 유지합니다.

스티키 라우팅 작동 방식

cache_control 블록이 있는 첫 번째 요청이 제공업체에서 처리되고 캐시됩니다
ARouter는 요청을 처리한 제공업체 인스턴스를 기록합니다
동일한 캐시 접두사를 가진 후속 요청이 동일한 인스턴스로 라우팅됩니다
캐시 히트는 비용을 낮추고（읽기는 쓰기보다 저렴）지연 시간을 줄입니다

캐시 히트 확인

usage 객체를 확인하여 요청 간 캐시 히트를 확인합니다:

# 첫 번째 요청——캐시 미스, 콘텐츠 쓰기
response1 = client.chat.completions.create(
    model="anthropic/claude-sonnet-4.6",
    messages=[
        {"role": "system", "content": [{"type": "text", "text": long_doc, "cache_control": {"type": "ephemeral"}}]},
        {"role": "user", "content": "Question 1"},
    ],
)
# prompt_tokens_details.cache_write_tokens > 0
print(response1.usage.prompt_tokens_details)

# 두 번째 요청——캐시 히트
response2 = client.chat.completions.create(
    model="anthropic/claude-sonnet-4.6",
    messages=[
        {"role": "system", "content": [{"type": "text", "text": long_doc, "cache_control": {"type": "ephemeral"}}]},
        {"role": "user", "content": "Question 2"},
    ],
)
# prompt_tokens_details.cached_tokens > 0（캐시 히트!）
print(response2.usage.prompt_tokens_details)

제공업체 캐시 지원 요약

제공업체	캐시 유형	최소 토큰	TTL	구성
OpenAI	자동	1,024	약 1시간	필요 없음
Anthropic	자동 + 명시적	1,024	5분（자동）, 1시간（명시적）	`cache_control` 블록
DeepSeek	자동	1,024	제공업체 정의	필요 없음
Google Gemini	자동 + 명시적	32,768	기본 1시간	`cachedContents` API
xAI（Grok）	자동	제공업체 정의	제공업체 정의	필요 없음
Groq	자동	제공업체 정의	제공업체 정의	필요 없음

Documentation Index

​캐시 사용량 확인

​OpenAI 자동 캐싱

​Anthropic Claude 프롬프트 캐싱

​캐시 TTL

​지원되는 모델

​명시적 캐싱 예제

​DeepSeek 자동 캐싱

​xAI（Grok）자동 캐싱

​Groq 자동 캐싱

​Google Gemini 프롬프트 캐싱

​암시적 캐싱

​네이티브 Gemini API를 통한 명시적 캐싱

​제공업체 스티키 라우팅

​스티키 라우팅 작동 방식

​캐시 히트 확인

​제공업체 캐시 지원 요약