Minimal Overhead
ARouter adds minimal latency through:- Edge computing: Gateway nodes are deployed globally to stay as close as possible to your application
- Efficient caching: User credentials and API key data are cached at the edge to avoid database roundtrips on every request
- Optimized routing: Provider selection and key pool lookup are designed to complete in single-digit milliseconds
Performance Considerations
Cache Warming
When edge caches are cold (typically during the first 1–2 minutes after a deployment or in a new region), you may experience slightly higher latency as caches warm up. This normalizes quickly.Credit Balance Checks
To maintain accurate billing and prevent overages, ARouter performs additional database checks when:- A user’s credit balance is in single-digit dollars
- An API key is approaching its configured credit limit
- Maintain a healthy credit balance (recommended minimum: $10–20)
- Set up auto-topup or periodic billing alerts
Multi-Model Routing Latency
When using ordered candidate model lists, if the first candidate is unavailable, ARouter routes to the next model. A failed first attempt adds latency to that request. ARouter tracks provider health continuously and routes around known-unavailable providers to minimize how often this occurs.Best Practices
1. Use Streaming
For user-facing applications, use streaming responses to reduce perceived latency. The first token arrives sooner than the full response, making the application feel faster even if total generation time is the same. See Streaming.2. Use Prompt Caching
For requests with repetitive prefixes (system prompts, few-shot examples, large documents), enable prompt caching. Cached tokens are served at significantly lower latency and reduced cost. See Prompt Caching.3. Choose the Right Model
Smaller models are faster. If your use case doesn’t require the highest capability, a smaller model (e.g.,google/gemini-2.5-flash, anthropic/claude-haiku-4-5) can reduce latency by 2–5x compared to the largest variants.
4. Use :nitro Provider Variants
For latency-critical workloads, append :nitro to a model ID to prefer high-throughput provider endpoints:
:nitro routes to the provider configuration optimized for maximum throughput and lowest time-to-first-token. See Provider Routing for details.
5. Maintain a Healthy Credit Balance
Keeping your credit balance above a reasonable threshold (≥$10) prevents aggressive cache invalidation during billing checks, which can add measurable latency.Measuring Performance
Use thex-response-time response header (if present) or measure round-trip time in your client to benchmark. ARouter also surfaces per-request latency data in your Activity feed.