Skip to main content
When a prompt exceeds a model’s context length, ARouter can automatically compress it using the context-compression plugin — rather than failing the request.

Context Compression

Enable context compression per-request by passing the plugin in the request body:
{
  "model": "anthropic/claude-sonnet-4.6",
  "messages": [...],
  "plugins": [{"id": "context-compression"}]
}
import OpenAI from "openai";

const client = new OpenAI({
  baseURL: "https://api.arouter.ai/v1",
  apiKey: "lr_live_xxxx",
});

const response = await client.chat.completions.create({
  model: "anthropic/claude-sonnet-4.6",
  messages: veryLongConversation,
  // @ts-ignore
  plugins: [{ id: "context-compression" }],
});

How It Works

The plugin removes or truncates messages from the middle of the conversation until the prompt fits within the model’s context window. This strategy is based on research showing that LLMs pay less attention to the middle of long sequences. Preserving the beginning (system instructions, initial context) and end (most recent messages) of a conversation generally produces better results than truncating from either end. Compression steps:
  1. Check if total tokens (prompt + estimated completion) exceed the model’s context length
  2. If over limit: remove or truncate messages from the middle of messages[]
  3. Repeat until the prompt fits
  4. Forward the compressed prompt to the model

Message Count Limits

Some models enforce a maximum number of messages regardless of token count. For example, Anthropic Claude models have a maximum message count. When this limit is exceeded with context compression enabled, the plugin keeps half of the messages from the start and half from the end of the conversation.

Default Behavior for Small Context Models

All models with 8,192 tokens or fewer context length have context compression enabled by default. To explicitly disable compression for these models:
{
  "model": "some-small-context-model",
  "messages": [...],
  "plugins": [{"id": "context-compression", "enabled": false}]
}
Without compression enabled, if your total tokens exceed the model’s context length, the request fails with an error suggesting you reduce input length or enable compression.

Model Selection with Compression

When context compression is active, ARouter first tries to find models whose context length is at least half of your total required tokens (input + estimated completion). For example, if your prompt requires 10,000 tokens total:
  • Models with at least 5,000 context length are considered
  • If no models meet this threshold, ARouter uses the model with the highest available context length

When to Use

Context compression is useful when:
  • You have long multi-turn conversations that grow over time
  • You’re processing documents that may occasionally exceed the context window
  • You want resilient behavior without manually managing context length
Context compression is not ideal when:
  • Perfect recall of all conversation history is required (e.g. document Q&A where any message may contain the answer)
  • You need deterministic behavior (compression is non-deterministic in which messages are removed)
For use cases requiring full context retention, consider models with larger context windows (see Model Variants :extended).

Combining with Other Plugins

Context compression can be combined with other plugins:
{
  "model": "openai/gpt-5.4:online",
  "messages": [...],
  "plugins": [
    {"id": "context-compression"},
    {"id": "web"}
  ]
}
See Plugins Overview for the complete list of available plugins.