Skip to main content
Some models perform internal chain-of-thought reasoning before generating a final response. These reasoning steps consume tokens — called reasoning tokens — which affect cost and latency but are not visible to the user by default.

Supported Models

ModelReasoning Support
openai/o4-miniAlways-on reasoning
openai/o3Always-on reasoning
anthropic/claude-sonnet-4-6Optional extended thinking
anthropic/claude-opus-4-6Optional extended thinking
deepseek/deepseek-r1Always-on reasoning
google/gemini-2.5-proOptional thinking mode
google/gemini-2.5-flashOptional thinking mode

How Reasoning Tokens Appear in Usage

Reasoning tokens are reported in the usage object as part of completion_tokens_details:
{
  "usage": {
    "prompt_tokens": 150,
    "completion_tokens": 520,
    "total_tokens": 670,
    "completion_tokens_details": {
      "reasoning_tokens": 400
    }
  }
}
In this example, 400 of the 520 completion tokens were used for internal reasoning. Only the remaining 120 tokens appear in the visible response.

Billing for Reasoning Tokens

Reasoning tokens are billed at the completion token rate for that model. They are included in completion_tokens for billing purposes — the breakdown is informational. ARouter passes through the upstream provider’s reasoning token counts without modification.

Controlling Reasoning Behavior

OpenAI o-series (o4-mini, o3)

Reasoning is always on for o-series models. Use reasoning_effort to control how much reasoning the model does:
{
  "model": "openai/o4-mini",
  "reasoning_effort": "high",
  "messages": [...]
}
Valid values: "low", "medium", "high". Higher effort = more reasoning tokens = higher quality and cost.

Anthropic Extended Thinking

Enable extended thinking by passing thinking in your request:
import anthropic

client = anthropic.Anthropic(
    api_key="your-arouter-key",
    base_url="https://api.arouter.ai/anthropic",
)

response = client.messages.create(
    model="claude-sonnet-4-6",
    max_tokens=16000,
    thinking={
        "type": "enabled",
        "budget_tokens": 10000,
    },
    messages=[{"role": "user", "content": "Solve this step by step: ..."}],
)
budget_tokens caps how many tokens can be used for thinking. The thinking content is returned as a separate block in the response.

DeepSeek R1

Reasoning is always on for DeepSeek R1. The model returns a reasoning_content field alongside the regular content:
from openai import OpenAI

client = OpenAI(
    api_key="your-arouter-key",
    base_url="https://api.arouter.ai/v1",
)

response = client.chat.completions.create(
    model="deepseek/deepseek-r1",
    messages=[{"role": "user", "content": "Prove that √2 is irrational."}],
)

# Reasoning content (if exposed by provider)
print(response.choices[0].message.reasoning_content)
# Final answer
print(response.choices[0].message.content)

Google Gemini Thinking

Enable thinking for Gemini 2.5 models via the thinking parameter:
response = client.chat.completions.create(
    model="google/gemini-2.5-flash",
    messages=[...],
    extra_body={
        "thinking": {
            "type": "enabled",
            "budget_tokens": 5000
        }
    }
)

Activity Export and Reasoning Tokens

The Activity Export includes a breakdown of reasoning tokens, so you can accurately track their contribution to total costs. Reasoning tokens are included in completion tokens in the export summary.

Best Practices

  • Start with "low" or "medium" effort for o-series models unless you need maximum reasoning quality. This reduces cost and latency significantly.
  • Set a budget_tokens cap for Anthropic and Gemini thinking models to avoid unexpectedly large bills on complex queries.
  • Monitor reasoning token ratios in your activity feed. A high ratio of reasoning to output tokens is normal for complex tasks but may indicate the model is overthinking simple queries.
  • Don’t disable reasoning to save costs on tasks that genuinely require multi-step reasoning — output quality degrades significantly.