ARouter provides comprehensive audio support across three modes: speech-to-text (transcription and translation), text-to-speech (TTS), and audio chat (multimodal models that accept audio input and produce spoken output).
Audio Transcription
Transcribe audio files to text using the OpenAI-compatible /v1/audio/transcriptions endpoint.
curl https://api.arouter.ai/v1/audio/transcriptions \
-H "Authorization: Bearer lr_live_xxxx" \
-F file="@audio.mp3" \
-F model="openai/whisper-large-v3"
from openai import OpenAI
client = OpenAI(
base_url="https://api.arouter.ai/v1",
api_key="lr_live_xxxx",
)
with open("audio.mp3", "rb") as audio_file:
transcription = client.audio.transcriptions.create(
model="openai/whisper-large-v3",
file=audio_file,
response_format="text",
)
print(transcription.text)
import OpenAI from "openai";
import fs from "fs";
const client = new OpenAI({
baseURL: "https://api.arouter.ai/v1",
apiKey: "lr_live_xxxx",
});
const transcription = await client.audio.transcriptions.create({
model: "openai/whisper-large-v3",
file: fs.createReadStream("audio.mp3"),
response_format: "text",
});
console.log(transcription.text);
curl https://api.arouter.ai/v1/audio/transcriptions \
-H "Authorization: Bearer lr_live_xxxx" \
-F file="@audio.mp3" \
-F model="openai/whisper-large-v3" \
-F response_format="text"
Transcription Parameters
| Parameter | Type | Description |
|---|
file | file | Audio file to transcribe. Supported formats: flac, mp3, mp4, mpeg, mpga, m4a, ogg, wav, webm |
model | string | Model ID, e.g. openai/whisper-large-v3 |
language | string | BCP-47 language code (e.g. "en", "zh"). Improves accuracy when specified. |
prompt | string | Optional text to guide transcription style or provide vocabulary hints |
response_format | string | Output format: json (default), text, srt, verbose_json, vtt |
temperature | number | Sampling temperature 0–1. Higher values increase randomness. |
timestamp_granularities | string[] | ["word"] or ["segment"] for timestamped output (requires verbose_json) |
Word-Level Timestamps
transcription = client.audio.transcriptions.create(
model="openai/whisper-large-v3",
file=audio_file,
response_format="verbose_json",
timestamp_granularities=["word"],
)
for word in transcription.words:
print(f"{word.start:.2f}s - {word.end:.2f}s: {word.word}")
Audio Translation
Translate audio from any language into English text:
with open("foreign_audio.mp3", "rb") as audio_file:
translation = client.audio.translations.create(
model="openai/whisper-large-v3",
file=audio_file,
response_format="text",
)
print(translation.text)
curl https://api.arouter.ai/v1/audio/translations \
-H "Authorization: Bearer lr_live_xxxx" \
-F file="@foreign_audio.mp3" \
-F model="openai/whisper-large-v3"
Text-to-Speech
Convert text to natural-sounding speech:
response = client.audio.speech.create(
model="openai/tts-1-hd",
voice="nova",
input="Hello! Welcome to ARouter, the universal AI gateway.",
)
response.stream_to_file("output.mp3")
import fs from "fs";
const response = await client.audio.speech.create({
model: "openai/tts-1-hd",
voice: "nova",
input: "Hello! Welcome to ARouter, the universal AI gateway.",
});
const buffer = Buffer.from(await response.arrayBuffer());
fs.writeFileSync("output.mp3", buffer);
curl https://api.arouter.ai/v1/audio/speech \
-H "Authorization: Bearer lr_live_xxxx" \
-H "Content-Type: application/json" \
-d '{
"model": "openai/tts-1-hd",
"input": "Hello! Welcome to ARouter.",
"voice": "nova"
}' \
--output output.mp3
TTS Parameters
| Parameter | Type | Description |
|---|
model | string | TTS model, e.g. openai/tts-1 or openai/tts-1-hd |
input | string | Text to synthesize. Maximum 4,096 characters. |
voice | string | Voice to use: alloy, echo, fable, onyx, nova, shimmer |
response_format | string | Audio format: mp3 (default), opus, aac, flac, wav, pcm |
speed | number | Playback speed from 0.25 to 4.0 (default 1.0) |
Available Voices
| Voice | Character |
|---|
alloy | Neutral, balanced |
echo | Softer, reflective |
fable | Expressive, storytelling |
onyx | Deep, authoritative |
nova | Friendly, energetic |
shimmer | Warm, gentle |
Audio Chat (Multimodal Models)
Some models accept audio directly as a chat message input and can respond with spoken audio. Use the standard chat completions endpoint with input_audio content parts.
Send audio alongside text in a chat message:
{
"model": "openai/gpt-5.4-audio-preview",
"messages": [
{
"role": "user",
"content": [
{
"type": "input_audio",
"input_audio": {
"data": "<base64-encoded-audio>",
"format": "wav"
}
}
]
}
]
}
import base64
with open("question.wav", "rb") as f:
audio_data = base64.b64encode(f.read()).decode("utf-8")
response = client.chat.completions.create(
model="openai/gpt-5.4-audio-preview",
messages=[
{
"role": "user",
"content": [
{
"type": "input_audio",
"input_audio": {
"data": audio_data,
"format": "wav",
},
}
],
}
],
)
print(response.choices[0].message.content)
import fs from "fs";
const audioData = fs.readFileSync("question.wav").toString("base64");
const response = await client.chat.completions.create({
model: "openai/gpt-5.4-audio-preview",
messages: [
{
role: "user",
content: [
{
type: "input_audio",
input_audio: {
data: audioData,
format: "wav",
},
},
],
},
],
});
console.log(response.choices[0].message.content);
| Format | MIME Type |
|---|
wav | audio/wav |
mp3 | audio/mpeg |
ogg | audio/ogg |
flac | audio/flac |
m4a | audio/m4a |
webm | audio/webm |
Audio Output
Request spoken audio as part of the model response:
{
"model": "openai/gpt-5.4-audio-preview",
"modalities": ["text", "audio"],
"audio": {
"voice": "nova",
"format": "mp3"
},
"messages": [{"role": "user", "content": "Tell me a short joke."}]
}
The response includes an audio field with base64-encoded audio:
{
"choices": [
{
"message": {
"role": "assistant",
"content": null,
"audio": {
"id": "audio_abc123",
"data": "<base64-encoded-mp3>",
"expires_at": 1234567890,
"transcript": "Why don't scientists trust atoms? Because they make up everything!"
}
}
}
]
}
Streaming Audio Output
Audio output can be streamed for real-time playback:
with client.chat.completions.stream(
model="openai/gpt-5.4-audio-preview",
modalities=["text", "audio"],
audio={"voice": "nova", "format": "pcm16"},
messages=[{"role": "user", "content": "Tell me a short joke."}],
) as stream:
for chunk in stream:
delta = chunk.choices[0].delta
if delta.audio:
# delta.audio.data contains base64-encoded PCM chunk
play_audio_chunk(delta.audio.data)
Supported Models
Speech-to-Text
| Model | Languages | Notes |
|---|
openai/whisper-large-v3 | 99+ | Best accuracy |
openai/whisper-large-v3-turbo | 99+ | Faster, lower cost |
Text-to-Speech
| Model | Quality | Latency |
|---|
openai/tts-1 | Standard | Low |
openai/tts-1-hd | High | Medium |
Audio Chat
Use GET /v1/models?output_modalities=audio to discover models supporting audio output.
Token Pricing
Audio tokens are tracked separately in usage.prompt_tokens_details:
{
"usage": {
"prompt_tokens": 150,
"prompt_tokens_details": {
"audio_tokens": 100,
"cached_tokens": 0
},
"completion_tokens": 50,
"completion_tokens_details": {
"audio_tokens": 30
}
}
}
Audio tokens are priced differently from text tokens. Check usage.cost in the response for the actual charge for each request.