Skip to main content
ARouter is designed with performance as a top priority. The gateway is heavily optimized to add as little overhead as possible to your requests.

Minimal Overhead

ARouter adds minimal latency through:
  • Edge computing: Gateway nodes are deployed globally to stay as close as possible to your application
  • Efficient caching: User credentials and API key data are cached at the edge to avoid database roundtrips on every request
  • Optimized routing: Provider selection and key pool lookup are designed to complete in single-digit milliseconds
The gateway overhead on a typical request is well under 50ms.

Performance Considerations

Cache Warming

When edge caches are cold (typically during the first 1–2 minutes after a deployment or in a new region), you may experience slightly higher latency as caches warm up. This normalizes quickly.

Credit Balance Checks

To maintain accurate billing and prevent overages, ARouter performs additional database checks when:
  • A user’s credit balance is in single-digit dollars
  • An API key is approaching its configured credit limit
Caches are invalidated more aggressively under these conditions, increasing latency until more credits are added. To avoid this:
  • Maintain a healthy credit balance (recommended minimum: $10–20)
  • Set up auto-topup or periodic billing alerts

Multi-Model Routing Latency

When using ordered candidate model lists, if the first candidate is unavailable, ARouter routes to the next model. A failed first attempt adds latency to that request. ARouter tracks provider health continuously and routes around known-unavailable providers to minimize how often this occurs.

Best Practices

1. Use Streaming

For user-facing applications, use streaming responses to reduce perceived latency. The first token arrives sooner than the full response, making the application feel faster even if total generation time is the same. See Streaming.

2. Use Prompt Caching

For requests with repetitive prefixes (system prompts, few-shot examples, large documents), enable prompt caching. Cached tokens are served at significantly lower latency and reduced cost. See Prompt Caching.

3. Choose the Right Model

Smaller models are faster. If your use case doesn’t require the highest capability, a smaller model (e.g., google/gemini-2.5-flash, anthropic/claude-haiku-4-5) can reduce latency by 2–5x compared to the largest variants.

4. Use :nitro Provider Variants

For latency-critical workloads, append :nitro to a model ID to prefer high-throughput provider endpoints:
{ "model": "anthropic/claude-sonnet-4-6:nitro" }
:nitro routes to the provider configuration optimized for maximum throughput and lowest time-to-first-token. See Provider Routing for details.

5. Maintain a Healthy Credit Balance

Keeping your credit balance above a reasonable threshold (≥$10) prevents aggressive cache invalidation during billing checks, which can add measurable latency.

Measuring Performance

Use the x-response-time response header (if present) or measure round-trip time in your client to benchmark. ARouter also surfaces per-request latency data in your Activity feed.