Skip to main content
ARouter provides comprehensive audio support across three modes: speech-to-text (transcription and translation), text-to-speech (TTS), and audio chat (multimodal models that accept audio input and produce spoken output).

Audio Transcription

Transcribe audio files to text using the OpenAI-compatible /v1/audio/transcriptions endpoint.
curl https://api.arouter.ai/v1/audio/transcriptions \
  -H "Authorization: Bearer lr_live_xxxx" \
  -F file="@audio.mp3" \
  -F model="openai/whisper-large-v3"
from openai import OpenAI

client = OpenAI(
    base_url="https://api.arouter.ai/v1",
    api_key="lr_live_xxxx",
)

with open("audio.mp3", "rb") as audio_file:
    transcription = client.audio.transcriptions.create(
        model="openai/whisper-large-v3",
        file=audio_file,
        response_format="text",
    )

print(transcription.text)

Transcription Parameters

ParameterTypeDescription
filefileAudio file to transcribe. Supported formats: flac, mp3, mp4, mpeg, mpga, m4a, ogg, wav, webm
modelstringModel ID, e.g. openai/whisper-large-v3
languagestringBCP-47 language code (e.g. "en", "zh"). Improves accuracy when specified.
promptstringOptional text to guide transcription style or provide vocabulary hints
response_formatstringOutput format: json (default), text, srt, verbose_json, vtt
temperaturenumberSampling temperature 0–1. Higher values increase randomness.
timestamp_granularitiesstring[]["word"] or ["segment"] for timestamped output (requires verbose_json)

Word-Level Timestamps

transcription = client.audio.transcriptions.create(
    model="openai/whisper-large-v3",
    file=audio_file,
    response_format="verbose_json",
    timestamp_granularities=["word"],
)

for word in transcription.words:
    print(f"{word.start:.2f}s - {word.end:.2f}s: {word.word}")

Audio Translation

Translate audio from any language into English text:
with open("foreign_audio.mp3", "rb") as audio_file:
    translation = client.audio.translations.create(
        model="openai/whisper-large-v3",
        file=audio_file,
        response_format="text",
    )

print(translation.text)

Text-to-Speech

Convert text to natural-sounding speech:
response = client.audio.speech.create(
    model="openai/tts-1-hd",
    voice="nova",
    input="Hello! Welcome to ARouter, the universal AI gateway.",
)

response.stream_to_file("output.mp3")

TTS Parameters

ParameterTypeDescription
modelstringTTS model, e.g. openai/tts-1 or openai/tts-1-hd
inputstringText to synthesize. Maximum 4,096 characters.
voicestringVoice to use: alloy, echo, fable, onyx, nova, shimmer
response_formatstringAudio format: mp3 (default), opus, aac, flac, wav, pcm
speednumberPlayback speed from 0.25 to 4.0 (default 1.0)

Available Voices

VoiceCharacter
alloyNeutral, balanced
echoSofter, reflective
fableExpressive, storytelling
onyxDeep, authoritative
novaFriendly, energetic
shimmerWarm, gentle

Audio Chat (Multimodal Models)

Some models accept audio directly as a chat message input and can respond with spoken audio. Use the standard chat completions endpoint with input_audio content parts.

Audio Input

Send audio alongside text in a chat message:
{
  "model": "openai/gpt-5.4-audio-preview",
  "messages": [
    {
      "role": "user",
      "content": [
        {
          "type": "input_audio",
          "input_audio": {
            "data": "<base64-encoded-audio>",
            "format": "wav"
          }
        }
      ]
    }
  ]
}
import base64

with open("question.wav", "rb") as f:
    audio_data = base64.b64encode(f.read()).decode("utf-8")

response = client.chat.completions.create(
    model="openai/gpt-5.4-audio-preview",
    messages=[
        {
            "role": "user",
            "content": [
                {
                    "type": "input_audio",
                    "input_audio": {
                        "data": audio_data,
                        "format": "wav",
                    },
                }
            ],
        }
    ],
)

print(response.choices[0].message.content)

Supported Input Audio Formats

FormatMIME Type
wavaudio/wav
mp3audio/mpeg
oggaudio/ogg
flacaudio/flac
m4aaudio/m4a
webmaudio/webm

Audio Output

Request spoken audio as part of the model response:
{
  "model": "openai/gpt-5.4-audio-preview",
  "modalities": ["text", "audio"],
  "audio": {
    "voice": "nova",
    "format": "mp3"
  },
  "messages": [{"role": "user", "content": "Tell me a short joke."}]
}
The response includes an audio field with base64-encoded audio:
{
  "choices": [
    {
      "message": {
        "role": "assistant",
        "content": null,
        "audio": {
          "id": "audio_abc123",
          "data": "<base64-encoded-mp3>",
          "expires_at": 1234567890,
          "transcript": "Why don't scientists trust atoms? Because they make up everything!"
        }
      }
    }
  ]
}

Streaming Audio Output

Audio output can be streamed for real-time playback:
with client.chat.completions.stream(
    model="openai/gpt-5.4-audio-preview",
    modalities=["text", "audio"],
    audio={"voice": "nova", "format": "pcm16"},
    messages=[{"role": "user", "content": "Tell me a short joke."}],
) as stream:
    for chunk in stream:
        delta = chunk.choices[0].delta
        if delta.audio:
            # delta.audio.data contains base64-encoded PCM chunk
            play_audio_chunk(delta.audio.data)

Supported Models

Speech-to-Text

ModelLanguagesNotes
openai/whisper-large-v399+Best accuracy
openai/whisper-large-v3-turbo99+Faster, lower cost

Text-to-Speech

ModelQualityLatency
openai/tts-1StandardLow
openai/tts-1-hdHighMedium

Audio Chat

Use GET /v1/models?output_modalities=audio to discover models supporting audio output.

Token Pricing

Audio tokens are tracked separately in usage.prompt_tokens_details:
{
  "usage": {
    "prompt_tokens": 150,
    "prompt_tokens_details": {
      "audio_tokens": 100,
      "cached_tokens": 0
    },
    "completion_tokens": 50,
    "completion_tokens_details": {
      "audio_tokens": 30
    }
  }
}
Audio tokens are priced differently from text tokens. Check usage.cost in the response for the actual charge for each request.