context-compression plugin — rather than failing the request.
Context Compression
Enable context compression per-request by passing the plugin in the request body:- TypeScript
- Python
- cURL
How It Works
The plugin removes or truncates messages from the middle of the conversation until the prompt fits within the model’s context window. This strategy is based on research showing that LLMs pay less attention to the middle of long sequences. Preserving the beginning (system instructions, initial context) and end (most recent messages) of a conversation generally produces better results than truncating from either end. Compression steps:- Check if total tokens (prompt + estimated completion) exceed the model’s context length
- If over limit: remove or truncate messages from the middle of
messages[] - Repeat until the prompt fits
- Forward the compressed prompt to the model
Message Count Limits
Some models enforce a maximum number of messages regardless of token count. For example, Anthropic Claude models have a maximum message count. When this limit is exceeded with context compression enabled, the plugin keeps half of the messages from the start and half from the end of the conversation.Default Behavior for Small Context Models
All models with 8,192 tokens or fewer context length have context compression enabled by default. To explicitly disable compression for these models:Model Selection with Compression
When context compression is active, ARouter first tries to find models whose context length is at least half of your total required tokens (input + estimated completion). For example, if your prompt requires 10,000 tokens total:- Models with at least 5,000 context length are considered
- If no models meet this threshold, ARouter uses the model with the highest available context length
When to Use
Context compression is useful when:- You have long multi-turn conversations that grow over time
- You’re processing documents that may occasionally exceed the context window
- You want resilient behavior without manually managing context length
- Perfect recall of all conversation history is required (e.g. document Q&A where any message may contain the answer)
- You need deterministic behavior (compression is non-deterministic in which messages are removed)
:extended).