Audio

ARouter provides comprehensive audio support across three modes: speech-to-text (transcription and translation), text-to-speech (TTS), and audio chat (multimodal models that accept audio input and produce spoken output).

Audio Transcription

Transcribe audio files to text using the OpenAI-compatible /v1/audio/transcriptions endpoint.

curl https://api.arouter.ai/v1/audio/transcriptions \
  -H "Authorization: Bearer lr_live_xxxx" \
  -F file="@audio.mp3" \
  -F model="openai/whisper-large-v3"

Python
Node.js
cURL

from openai import OpenAI

client = OpenAI(
    base_url="https://api.arouter.ai/v1",
    api_key="lr_live_xxxx",
)

with open("audio.mp3", "rb") as audio_file:
    transcription = client.audio.transcriptions.create(
        model="openai/whisper-large-v3",
        file=audio_file,
        response_format="text",
    )

print(transcription.text)

import OpenAI from "openai";
import fs from "fs";

const client = new OpenAI({
  baseURL: "https://api.arouter.ai/v1",
  apiKey: "lr_live_xxxx",
});

const transcription = await client.audio.transcriptions.create({
  model: "openai/whisper-large-v3",
  file: fs.createReadStream("audio.mp3"),
  response_format: "text",
});

console.log(transcription.text);

curl https://api.arouter.ai/v1/audio/transcriptions \
  -H "Authorization: Bearer lr_live_xxxx" \
  -F file="@audio.mp3" \
  -F model="openai/whisper-large-v3" \
  -F response_format="text"

Transcription Parameters

Parameter	Type	Description
`file`	`file`	Audio file to transcribe. Supported formats: `flac`, `mp3`, `mp4`, `mpeg`, `mpga`, `m4a`, `ogg`, `wav`, `webm`
`model`	`string`	Model ID, e.g. `openai/whisper-large-v3`
`language`	`string`	BCP-47 language code (e.g. `"en"`, `"zh"`). Improves accuracy when specified.
`prompt`	`string`	Optional text to guide transcription style or provide vocabulary hints
`response_format`	`string`	Output format: `json` (default), `text`, `srt`, `verbose_json`, `vtt`
`temperature`	`number`	Sampling temperature 0–1. Higher values increase randomness.
`timestamp_granularities`	`string[]`	`["word"]` or `["segment"]` for timestamped output (requires `verbose_json`)

Word-Level Timestamps

transcription = client.audio.transcriptions.create(
    model="openai/whisper-large-v3",
    file=audio_file,
    response_format="verbose_json",
    timestamp_granularities=["word"],
)

for word in transcription.words:
    print(f"{word.start:.2f}s - {word.end:.2f}s: {word.word}")

Audio Translation

Translate audio from any language into English text:

Python
cURL

with open("foreign_audio.mp3", "rb") as audio_file:
    translation = client.audio.translations.create(
        model="openai/whisper-large-v3",
        file=audio_file,
        response_format="text",
    )

print(translation.text)

curl https://api.arouter.ai/v1/audio/translations \
  -H "Authorization: Bearer lr_live_xxxx" \
  -F file="@foreign_audio.mp3" \
  -F model="openai/whisper-large-v3"

Text-to-Speech

Convert text to natural-sounding speech:

Python
Node.js
cURL

response = client.audio.speech.create(
    model="openai/tts-1-hd",
    voice="nova",
    input="Hello! Welcome to ARouter, the universal AI gateway.",
)

response.stream_to_file("output.mp3")

import fs from "fs";

const response = await client.audio.speech.create({
  model: "openai/tts-1-hd",
  voice: "nova",
  input: "Hello! Welcome to ARouter, the universal AI gateway.",
});

const buffer = Buffer.from(await response.arrayBuffer());
fs.writeFileSync("output.mp3", buffer);

curl https://api.arouter.ai/v1/audio/speech \
  -H "Authorization: Bearer lr_live_xxxx" \
  -H "Content-Type: application/json" \
  -d '{
    "model": "openai/tts-1-hd",
    "input": "Hello! Welcome to ARouter.",
    "voice": "nova"
  }' \
  --output output.mp3

TTS Parameters

Parameter	Type	Description
`model`	`string`	TTS model, e.g. `openai/tts-1` or `openai/tts-1-hd`
`input`	`string`	Text to synthesize. Maximum 4,096 characters.
`voice`	`string`	Voice to use: `alloy`, `echo`, `fable`, `onyx`, `nova`, `shimmer`
`response_format`	`string`	Audio format: `mp3` (default), `opus`, `aac`, `flac`, `wav`, `pcm`
`speed`	`number`	Playback speed from `0.25` to `4.0` (default `1.0`)

Available Voices

Voice	Character
`alloy`	Neutral, balanced
`echo`	Softer, reflective
`fable`	Expressive, storytelling
`onyx`	Deep, authoritative
`nova`	Friendly, energetic
`shimmer`	Warm, gentle

Audio Chat (Multimodal Models)

Some models accept audio directly as a chat message input and can respond with spoken audio. Use the standard chat completions endpoint with input_audio content parts.

Audio Input

Send audio alongside text in a chat message:

{
  "model": "openai/gpt-5.4-audio-preview",
  "messages": [
    {
      "role": "user",
      "content": [
        {
          "type": "input_audio",
          "input_audio": {
            "data": "<base64-encoded-audio>",
            "format": "wav"
          }
        }
      ]
    }
  ]
}

Python
Node.js

import base64

with open("question.wav", "rb") as f:
    audio_data = base64.b64encode(f.read()).decode("utf-8")

response = client.chat.completions.create(
    model="openai/gpt-5.4-audio-preview",
    messages=[
        {
            "role": "user",
            "content": [
                {
                    "type": "input_audio",
                    "input_audio": {
                        "data": audio_data,
                        "format": "wav",
                    },
                }
            ],
        }
    ],
)

print(response.choices[0].message.content)

import fs from "fs";

const audioData = fs.readFileSync("question.wav").toString("base64");

const response = await client.chat.completions.create({
  model: "openai/gpt-5.4-audio-preview",
  messages: [
    {
      role: "user",
      content: [
        {
          type: "input_audio",
          input_audio: {
            data: audioData,
            format: "wav",
          },
        },
      ],
    },
  ],
});

console.log(response.choices[0].message.content);

Supported Input Audio Formats

Format	MIME Type
`wav`	`audio/wav`
`mp3`	`audio/mpeg`
`ogg`	`audio/ogg`
`flac`	`audio/flac`
`m4a`	`audio/m4a`
`webm`	`audio/webm`

Audio Output

Request spoken audio as part of the model response:

{
  "model": "openai/gpt-5.4-audio-preview",
  "modalities": ["text", "audio"],
  "audio": {
    "voice": "nova",
    "format": "mp3"
  },
  "messages": [{"role": "user", "content": "Tell me a short joke."}]
}

The response includes an audio field with base64-encoded audio:

{
  "choices": [
    {
      "message": {
        "role": "assistant",
        "content": null,
        "audio": {
          "id": "audio_abc123",
          "data": "<base64-encoded-mp3>",
          "expires_at": 1234567890,
          "transcript": "Why don't scientists trust atoms? Because they make up everything!"
        }
      }
    }
  ]
}

Streaming Audio Output

Audio output can be streamed for real-time playback:

with client.chat.completions.stream(
    model="openai/gpt-5.4-audio-preview",
    modalities=["text", "audio"],
    audio={"voice": "nova", "format": "pcm16"},
    messages=[{"role": "user", "content": "Tell me a short joke."}],
) as stream:
    for chunk in stream:
        delta = chunk.choices[0].delta
        if delta.audio:
            # delta.audio.data contains base64-encoded PCM chunk
            play_audio_chunk(delta.audio.data)

Supported Models

Speech-to-Text

Model	Languages	Notes
`openai/whisper-large-v3`	99+	Best accuracy
`openai/whisper-large-v3-turbo`	99+	Faster, lower cost

Text-to-Speech

Model	Quality	Latency
`openai/tts-1`	Standard	Low
`openai/tts-1-hd`	High	Medium

Audio Chat

Use GET /v1/models?output_modalities=audio to discover models supporting audio output.

Token Pricing

Audio tokens are tracked separately in usage.prompt_tokens_details:

{
  "usage": {
    "prompt_tokens": 150,
    "prompt_tokens_details": {
      "audio_tokens": 100,
      "cached_tokens": 0
    },
    "completion_tokens": 50,
    "completion_tokens_details": {
      "audio_tokens": 30
    }
  }
}

Audio tokens are priced differently from text tokens. Check usage.cost in the response for the actual charge for each request.

Get Started

Core Concepts

Routing

Features

Guides

Privacy

Administration

Best Practices

Frameworks & Integrations

For Providers

Support

Audio Transcription

Transcription Parameters

Word-Level Timestamps

Audio Translation

Text-to-Speech

TTS Parameters

Available Voices

Audio Chat (Multimodal Models)

Audio Input

Supported Input Audio Formats

Audio Output

Streaming Audio Output

Supported Models

Speech-to-Text

Text-to-Speech

Audio Chat

Token Pricing

Get Started

Core Concepts

Routing

Features

Guides

Privacy

Administration

Best Practices

Frameworks & Integrations

For Providers

Support

​Audio Transcription

​Transcription Parameters

​Word-Level Timestamps

​Audio Translation

​Text-to-Speech

​TTS Parameters

​Available Voices

​Audio Chat (Multimodal Models)

​Audio Input

​Supported Input Audio Formats

​Audio Output

​Streaming Audio Output

​Supported Models

​Speech-to-Text

​Text-to-Speech

​Audio Chat

​Token Pricing

Audio Transcription

Transcription Parameters

Word-Level Timestamps

Audio Translation

Text-to-Speech

TTS Parameters

Available Voices

Audio Chat (Multimodal Models)

Audio Input

Supported Input Audio Formats

Audio Output

Streaming Audio Output

Supported Models

Speech-to-Text

Text-to-Speech

Audio Chat

Token Pricing