音频 - ARouter

ARouter 提供三种模式的全面音频支持：语音转文字（转录和翻译）、文字转语音（TTS）以及音频对话（接受音频输入并产生语音输出的多模态模型）。

音频转录

使用与 OpenAI 兼容的 /v1/audio/transcriptions 端点将音频文件转录为文字。

curl https://api.arouter.ai/v1/audio/transcriptions \
  -H "Authorization: Bearer lr_live_xxxx" \
  -F file="@audio.mp3" \
  -F model="openai/whisper-large-v3"

Python
Node.js
cURL

from openai import OpenAI
client = OpenAI(base_url="https://api.arouter.ai/v1", api_key="lr_live_xxxx")
with open("audio.mp3", "rb") as audio_file:
    transcription = client.audio.transcriptions.create(
        model="openai/whisper-large-v3", file=audio_file, response_format="text"
    )
print(transcription.text)

import OpenAI from "openai";
import fs from "fs";
const client = new OpenAI({ baseURL: "https://api.arouter.ai/v1", apiKey: "lr_live_xxxx" });
const transcription = await client.audio.transcriptions.create({
  model: "openai/whisper-large-v3", file: fs.createReadStream("audio.mp3"), response_format: "text"
});
console.log(transcription.text);

curl https://api.arouter.ai/v1/audio/transcriptions \
  -H "Authorization: Bearer lr_live_xxxx" \
  -F file="@audio.mp3" \
  -F model="openai/whisper-large-v3" \
  -F response_format="text"

转录参数

参数	类型	描述
`file`	`file`	要转录的音频文件。支持格式：`flac`、`mp3`、`mp4`、`mpeg`、`mpga`、`m4a`、`ogg`、`wav`、`webm`
`model`	`string`	模型 ID，例如 `openai/whisper-large-v3`
`language`	`string`	BCP-47 语言代码（如 `"en"`、`"zh"`）。指定后可提高准确性。
`prompt`	`string`	可选文本，用于引导转录风格或提供词汇提示
`response_format`	`string`	输出格式：`json`（默认）、`text`、`srt`、`verbose_json`、`vtt`
`temperature`	`number`	采样温度 0–1。值越高随机性越大。
`timestamp_granularities`	`string[]`	带时间戳输出的粒度：`["word"]` 或 `["segment"]`（需要 `verbose_json`）

单词级时间戳

transcription = client.audio.transcriptions.create(
    model="openai/whisper-large-v3", file=audio_file,
    response_format="verbose_json", timestamp_granularities=["word"]
)
for word in transcription.words:
    print(f"{word.start:.2f}s - {word.end:.2f}s: {word.word}")

音频翻译

将任意语言的音频翻译为英文文字：

Python
cURL

with open("foreign_audio.mp3", "rb") as audio_file:
    translation = client.audio.translations.create(
        model="openai/whisper-large-v3", file=audio_file, response_format="text"
    )
print(translation.text)

curl https://api.arouter.ai/v1/audio/translations \
  -H "Authorization: Bearer lr_live_xxxx" \
  -F file="@foreign_audio.mp3" \
  -F model="openai/whisper-large-v3"

文字转语音

将文字转换为自然语音：

Python
Node.js
cURL

response = client.audio.speech.create(
    model="openai/tts-1-hd", voice="nova",
    input="Hello! Welcome to ARouter, the universal AI gateway."
)
response.stream_to_file("output.mp3")

import fs from "fs";
const response = await client.audio.speech.create({
  model: "openai/tts-1-hd", voice: "nova",
  input: "Hello! Welcome to ARouter, the universal AI gateway."
});
const buffer = Buffer.from(await response.arrayBuffer());
fs.writeFileSync("output.mp3", buffer);

curl https://api.arouter.ai/v1/audio/speech \
  -H "Authorization: Bearer lr_live_xxxx" \
  -H "Content-Type: application/json" \
  -d '{"model": "openai/tts-1-hd", "input": "Hello! Welcome to ARouter.", "voice": "nova"}' \
  --output output.mp3

TTS 参数

参数	类型	描述
`model`	`string`	TTS 模型，例如 `openai/tts-1` 或 `openai/tts-1-hd`
`input`	`string`	要合成的文字。最多 4,096 个字符。
`voice`	`string`	使用的声音：`alloy`、`echo`、`fable`、`onyx`、`nova`、`shimmer`
`response_format`	`string`	音频格式：`mp3`（默认）、`opus`、`aac`、`flac`、`wav`、`pcm`
`speed`	`number`	播放速度，范围 `0.25` 到 `4.0`（默认 `1.0`）

可用声音

声音	特点
`alloy`	中性、均衡
`echo`	柔和、沉思
`fable`	富有表现力、叙事感
`onyx`	深沉、权威
`nova`	友好、充满活力
`shimmer`	温暖、温柔

音频对话（多模态模型）

部分模型直接接受音频作为聊天消息输入，并能以语音音频进行回复。

音频输入

{
  "model": "openai/gpt-5.4-audio-preview",
  "messages": [{
    "role": "user",
    "content": [{"type": "input_audio", "input_audio": {"data": "<base64-encoded-audio>", "format": "wav"}}]
  }]
}

支持的输入音频格式

格式	MIME 类型
`wav`	`audio/wav`
`mp3`	`audio/mpeg`
`ogg`	`audio/ogg`
`flac`	`audio/flac`
`m4a`	`audio/m4a`
`webm`	`audio/webm`

音频输出

在模型响应中请求语音音频：

{
  "model": "openai/gpt-5.4-audio-preview",
  "modalities": ["text", "audio"],
  "audio": {"voice": "nova", "format": "mp3"},
  "messages": [{"role": "user", "content": "Tell me a short joke."}]
}

支持的模型

语音转文字

模型	语言	备注
`openai/whisper-large-v3`	99+	最佳准确性
`openai/whisper-large-v3-turbo`	99+	更快、成本更低

文字转语音

模型	质量	延迟
`openai/tts-1`	标准	低
`openai/tts-1-hd`	高	中

Token 定价

音频 Token 在 usage.prompt_tokens_details 中单独统计：

{
  "usage": {
    "prompt_tokens": 150,
    "prompt_tokens_details": {"audio_tokens": 100, "cached_tokens": 0},
    "completion_tokens": 50,
    "completion_tokens_details": {"audio_tokens": 30}
  }
}

音频 Token 的定价与文字 Token 不同。请查看响应中的 usage.cost 了解每次请求的实际费用。

​音频转录

​转录参数

​单词级时间戳

​音频翻译

​文字转语音

​TTS 参数

​可用声音

​音频对话（多模态模型）

​音频输入

​支持的输入音频格式

​音频输出

​支持的模型

​语音转文字

​文字转语音

​Token 定价

音频转录

转录参数

单词级时间戳

音频翻译

文字转语音

TTS 参数

可用声音

音频对话（多模态模型）

音频输入

支持的输入音频格式

音频输出

支持的模型

语音转文字

文字转语音

Token 定价