音訊 - ARouter

ARouter 提供三種模式的全面音訊支援：語音轉文字（轉錄和翻譯）、文字轉語音（TTS）以及音訊對話（接受音訊輸入並產生語音輸出的多模態模型）。

音訊轉錄

使用與 OpenAI 相容的 /v1/audio/transcriptions 端點將音訊檔案轉錄為文字。

curl https://api.arouter.ai/v1/audio/transcriptions \
  -H "Authorization: Bearer lr_live_xxxx" \
  -F file="@audio.mp3" \
  -F model="openai/whisper-large-v3"

Python
Node.js
cURL

from openai import OpenAI
client = OpenAI(base_url="https://api.arouter.ai/v1", api_key="lr_live_xxxx")
with open("audio.mp3", "rb") as audio_file:
    transcription = client.audio.transcriptions.create(
        model="openai/whisper-large-v3", file=audio_file, response_format="text"
    )
print(transcription.text)

import OpenAI from "openai";
import fs from "fs";
const client = new OpenAI({ baseURL: "https://api.arouter.ai/v1", apiKey: "lr_live_xxxx" });
const transcription = await client.audio.transcriptions.create({
  model: "openai/whisper-large-v3", file: fs.createReadStream("audio.mp3"), response_format: "text"
});
console.log(transcription.text);

curl https://api.arouter.ai/v1/audio/transcriptions \
  -H "Authorization: Bearer lr_live_xxxx" \
  -F file="@audio.mp3" \
  -F model="openai/whisper-large-v3" \
  -F response_format="text"

轉錄參數

參數	類型	描述
`file`	`file`	要轉錄的音訊檔案。支援格式：`flac`、`mp3`、`mp4`、`mpeg`、`mpga`、`m4a`、`ogg`、`wav`、`webm`
`model`	`string`	模型 ID，例如 `openai/whisper-large-v3`
`language`	`string`	BCP-47 語言代碼（如 `"en"`、`"zh"`）。指定後可提高準確性。
`prompt`	`string`	可選文字，用於引導轉錄風格或提供詞彙提示
`response_format`	`string`	輸出格式：`json`（預設）、`text`、`srt`、`verbose_json`、`vtt`
`temperature`	`number`	採樣溫度 0–1。值越高隨機性越大。
`timestamp_granularities`	`string[]`	帶時間戳輸出的粒度：`["word"]` 或 `["segment"]`（需要 `verbose_json`）

單詞級時間戳

transcription = client.audio.transcriptions.create(
    model="openai/whisper-large-v3", file=audio_file,
    response_format="verbose_json", timestamp_granularities=["word"]
)
for word in transcription.words:
    print(f"{word.start:.2f}s - {word.end:.2f}s: {word.word}")

音訊翻譯

將任意語言的音訊翻譯為英文文字：

Python
cURL

with open("foreign_audio.mp3", "rb") as audio_file:
    translation = client.audio.translations.create(
        model="openai/whisper-large-v3", file=audio_file, response_format="text"
    )
print(translation.text)

curl https://api.arouter.ai/v1/audio/translations \
  -H "Authorization: Bearer lr_live_xxxx" \
  -F file="@foreign_audio.mp3" \
  -F model="openai/whisper-large-v3"

文字轉語音

將文字轉換為自然語音：

Python
Node.js
cURL

response = client.audio.speech.create(
    model="openai/tts-1-hd", voice="nova",
    input="Hello! Welcome to ARouter, the universal AI gateway."
)
response.stream_to_file("output.mp3")

import fs from "fs";
const response = await client.audio.speech.create({
  model: "openai/tts-1-hd", voice: "nova",
  input: "Hello! Welcome to ARouter, the universal AI gateway."
});
const buffer = Buffer.from(await response.arrayBuffer());
fs.writeFileSync("output.mp3", buffer);

curl https://api.arouter.ai/v1/audio/speech \
  -H "Authorization: Bearer lr_live_xxxx" \
  -H "Content-Type: application/json" \
  -d '{"model": "openai/tts-1-hd", "input": "Hello! Welcome to ARouter.", "voice": "nova"}' \
  --output output.mp3

TTS 參數

參數	類型	描述
`model`	`string`	TTS 模型，例如 `openai/tts-1` 或 `openai/tts-1-hd`
`input`	`string`	要合成的文字。最多 4,096 個字元。
`voice`	`string`	使用的聲音：`alloy`、`echo`、`fable`、`onyx`、`nova`、`shimmer`
`response_format`	`string`	音訊格式：`mp3`（預設）、`opus`、`aac`、`flac`、`wav`、`pcm`
`speed`	`number`	播放速度，範圍 `0.25` 到 `4.0`（預設 `1.0`）

可用聲音

聲音	特點
`alloy`	中性、均衡
`echo`	柔和、沉思
`fable`	富有表現力、敘事感
`onyx`	深沉、權威
`nova`	友善、充滿活力
`shimmer`	溫暖、溫柔

音訊對話（多模態模型）

部分模型直接接受音訊作為聊天訊息輸入，並能以語音音訊進行回覆。

音訊輸入

{
  "model": "openai/gpt-5.4-audio-preview",
  "messages": [{
    "role": "user",
    "content": [{"type": "input_audio", "input_audio": {"data": "<base64-encoded-audio>", "format": "wav"}}]
  }]
}

支援的輸入音訊格式

格式	MIME 類型
`wav`	`audio/wav`
`mp3`	`audio/mpeg`
`ogg`	`audio/ogg`
`flac`	`audio/flac`
`m4a`	`audio/m4a`
`webm`	`audio/webm`

音訊輸出

在模型回應中請求語音音訊：

{
  "model": "openai/gpt-5.4-audio-preview",
  "modalities": ["text", "audio"],
  "audio": {"voice": "nova", "format": "mp3"},
  "messages": [{"role": "user", "content": "Tell me a short joke."}]
}

支援的模型

語音轉文字

模型	語言	備註
`openai/whisper-large-v3`	99+	最佳準確性
`openai/whisper-large-v3-turbo`	99+	更快、成本更低

文字轉語音

模型	品質	延遲
`openai/tts-1`	標準	低
`openai/tts-1-hd`	高	中

Token 定價

音訊 Token 在 usage.prompt_tokens_details 中單獨統計：

{
  "usage": {
    "prompt_tokens": 150,
    "prompt_tokens_details": {"audio_tokens": 100, "cached_tokens": 0},
    "completion_tokens": 50,
    "completion_tokens_details": {"audio_tokens": 30}
  }
}

音訊 Token 的定價與文字 Token 不同。請查看回應中的 usage.cost 了解每次請求的實際費用。

​音訊轉錄

​轉錄參數

​單詞級時間戳

​音訊翻譯

​文字轉語音

​TTS 參數

​可用聲音

​音訊對話（多模態模型）

​音訊輸入

​支援的輸入音訊格式

​音訊輸出

​支援的模型

​語音轉文字

​文字轉語音

​Token 定價

音訊轉錄

轉錄參數

單詞級時間戳

音訊翻譯

文字轉語音

TTS 參數

可用聲音

音訊對話（多模態模型）

音訊輸入

支援的輸入音訊格式

音訊輸出

支援的模型

語音轉文字

文字轉語音

Token 定價