Skip to main content
ARouter supports multimodal inputs and outputs — you can send images, PDFs, and audio alongside text messages, and receive images or spoken audio as output.

Supported Modalities

ModalityDirectionNotes
TextInput + OutputAll models
Images (URL / base64)InputVision models — JPEG, PNG, GIF, WebP
PDFs (base64)InputAnthropic Claude, Google Gemini
Audio (base64)InputMultimodal audio models
Image generationOutputDALL-E 3, Flux, Stable Diffusion
Audio output (TTS / spoken)OutputTTS models, audio chat models
Use GET /v1/models with query parameters to discover models supporting specific modalities:
# Models that accept image input
GET /v1/models?supported_parameters=vision

# Models that output images
GET /v1/models?output_modalities=image

# Models that output audio
GET /v1/models?output_modalities=audio

Images

Using an Image URL

Pass a publicly accessible image URL in the image_url content part:
{
  "model": "openai/gpt-5.4",
  "messages": [
    {
      "role": "user",
      "content": [
        {
          "type": "text",
          "text": "What's in this image?"
        },
        {
          "type": "image_url",
          "image_url": {
            "url": "https://upload.wikimedia.org/wikipedia/commons/thumb/d/dd/Gfp-wisconsin-madison-the-nature-boardwalk.jpg/2560px-Gfp-wisconsin-madison-the-nature-boardwalk.jpg"
          }
        }
      ]
    }
  ]
}

Using Base64-Encoded Images

For private images or when you don’t have a public URL, encode the image as base64:
{
  "model": "openai/gpt-5.4",
  "messages": [
    {
      "role": "user",
      "content": [
        {
          "type": "text",
          "text": "Describe this image."
        },
        {
          "type": "image_url",
          "image_url": {
            "url": "data:image/jpeg;base64,/9j/4AAQSkZJRgABAQAA..."
          }
        }
      ]
    }
  ]
}

Image Detail Level

Use the detail parameter to control resolution. Higher detail costs more tokens:
ValueDescription
auto (default)Provider decides based on image size
lowFaster, cheaper — 85 tokens, resize to 512×512
highFull resolution — tiles the image, more tokens
{
  "type": "image_url",
  "image_url": {
    "url": "https://example.com/image.jpg",
    "detail": "high"
  }
}

Full Example — Vision

import base64
from openai import OpenAI

client = OpenAI(
    base_url="https://api.arouter.ai/v1",
    api_key="lr_live_xxxx",
)

# Option 1: Image URL
response = client.chat.completions.create(
    model="openai/gpt-5.4",
    messages=[
        {
            "role": "user",
            "content": [
                {"type": "text", "text": "What's in this image?"},
                {
                    "type": "image_url",
                    "image_url": {
                        "url": "https://upload.wikimedia.org/wikipedia/commons/thumb/d/dd/Gfp-wisconsin-madison-the-nature-boardwalk.jpg/2560px-Gfp-wisconsin-madison-the-nature-boardwalk.jpg",
                        "detail": "auto",
                    },
                },
            ],
        }
    ],
)
print(response.choices[0].message.content)

# Option 2: Base64 image
with open("image.jpg", "rb") as f:
    image_data = base64.b64encode(f.read()).decode("utf-8")

response = client.chat.completions.create(
    model="openai/gpt-5.4",
    messages=[
        {
            "role": "user",
            "content": [
                {"type": "text", "text": "Describe this image."},
                {
                    "type": "image_url",
                    "image_url": {
                        "url": f"data:image/jpeg;base64,{image_data}",
                    },
                },
            ],
        }
    ],
)
print(response.choices[0].message.content)

PDFs

Some models can process PDF documents directly. PDFs are passed as base64-encoded content.

Anthropic Claude — PDF Support

import base64
import anthropic

client = anthropic.Anthropic(
    base_url="https://api.arouter.ai",
    api_key="lr_live_xxxx",
)

with open("document.pdf", "rb") as f:
    pdf_data = base64.standard_b64encode(f.read()).decode("utf-8")

response = client.messages.create(
    model="claude-sonnet-4.6",
    max_tokens=2048,
    messages=[
        {
            "role": "user",
            "content": [
                {
                    "type": "document",
                    "source": {
                        "type": "base64",
                        "media_type": "application/pdf",
                        "data": pdf_data,
                    },
                },
                {"type": "text", "text": "Summarize the key points of this document."},
            ],
        }
    ],
)
print(response.content[0].text)

Google Gemini — PDF Support

import base64
import google.generativeai as genai

genai.configure(
    api_key="lr_live_xxxx",
    transport="rest",
    client_options={"api_endpoint": "https://api.arouter.ai"},
)

with open("document.pdf", "rb") as f:
    pdf_data = base64.b64encode(f.read()).decode("utf-8")

model = genai.GenerativeModel("gemini-2.5-flash")
response = model.generate_content([
    {
        "inline_data": {
            "mime_type": "application/pdf",
            "data": pdf_data,
        }
    },
    "Summarize the key points of this document.",
])
print(response.text)

Model Compatibility

ModelImage URLImage Base64PDFAudio Input
openai/gpt-5.4
openai/gpt-5.4-pro
openai/gpt-5.4-audio-preview
anthropic/claude-sonnet-4.6
anthropic/claude-opus-4.5
google/gemini-2.5-flash
google/gemini-2.5-pro
Use GET /v1/models to query the latest capability information.

Input Format Support

FormatWhen to Use
Image URLPublic images accessible on the internet
Image base64Private images, local files, or when URL is not available
PDF base64Document analysis (Claude and Gemini only)
Audio base64Voice input for audio chat models
Image tokens count toward the prompt token limit. Large, high-resolution images with detail: "high" can consume significantly more tokens than text. Always check usage.prompt_tokens to monitor consumption.

Other Modalities

For dedicated audio and image generation documentation:
  • Audio — Speech-to-text, text-to-speech, and audio chat models
  • Image Generation — Generate images from text prompts using DALL-E, Flux, and more