音声 - ARouter

ARouter は3つのモードで包括的な音声サポートを提供します：音声テキスト変換（文字起こしと翻訳）、テキスト音声変換（TTS）、音声チャット（音声入力を受け付け、音声出力を生成するマルチモーダルモデル）。

音声文字起こし

OpenAI 互換の /v1/audio/transcriptions エンドポイントを使用して音声ファイルをテキストに文字起こしします。

curl https://api.arouter.ai/v1/audio/transcriptions \
  -H "Authorization: Bearer lr_live_xxxx" \
  -F file="@audio.mp3" \
  -F model="openai/whisper-large-v3"

Python
Node.js
cURL

from openai import OpenAI
client = OpenAI(base_url="https://api.arouter.ai/v1", api_key="lr_live_xxxx")
with open("audio.mp3", "rb") as audio_file:
    transcription = client.audio.transcriptions.create(
        model="openai/whisper-large-v3", file=audio_file, response_format="text"
    )
print(transcription.text)

import OpenAI from "openai";
import fs from "fs";
const client = new OpenAI({ baseURL: "https://api.arouter.ai/v1", apiKey: "lr_live_xxxx" });
const transcription = await client.audio.transcriptions.create({
  model: "openai/whisper-large-v3", file: fs.createReadStream("audio.mp3"), response_format: "text"
});
console.log(transcription.text);

curl https://api.arouter.ai/v1/audio/transcriptions \
  -H "Authorization: Bearer lr_live_xxxx" \
  -F file="@audio.mp3" \
  -F model="openai/whisper-large-v3" \
  -F response_format="text"

文字起こしパラメータ

パラメータ	型	説明
`file`	`file`	文字起こしする音声ファイル。対応フォーマット：`flac`、`mp3`、`mp4`、`mpeg`、`mpga`、`m4a`、`ogg`、`wav`、`webm`
`model`	`string`	モデル ID（例：`openai/whisper-large-v3`）
`language`	`string`	BCP-47 言語コード（例：`"en"`、`"ja"`）。指定すると精度が向上します。
`prompt`	`string`	文字起こしスタイルを誘導したり語彙ヒントを提供する任意のテキスト
`response_format`	`string`	出力形式：`json`（デフォルト）、`text`、`srt`、`verbose_json`、`vtt`
`temperature`	`number`	サンプリング温度 0–1。値が高いほどランダム性が増します。
`timestamp_granularities`	`string[]`	タイムスタンプ付き出力の粒度：`["word"]` または `["segment"]`（`verbose_json` が必要）

単語レベルのタイムスタンプ

transcription = client.audio.transcriptions.create(
    model="openai/whisper-large-v3", file=audio_file,
    response_format="verbose_json", timestamp_granularities=["word"]
)
for word in transcription.words:
    print(f"{word.start:.2f}s - {word.end:.2f}s: {word.word}")

音声翻訳

任意の言語の音声を英語テキストに翻訳します：

Python
cURL

with open("foreign_audio.mp3", "rb") as audio_file:
    translation = client.audio.translations.create(
        model="openai/whisper-large-v3", file=audio_file, response_format="text"
    )
print(translation.text)

curl https://api.arouter.ai/v1/audio/translations \
  -H "Authorization: Bearer lr_live_xxxx" \
  -F file="@foreign_audio.mp3" \
  -F model="openai/whisper-large-v3"

テキスト音声変換

テキストを自然な音声に変換します：

Python
Node.js
cURL

response = client.audio.speech.create(
    model="openai/tts-1-hd", voice="nova",
    input="Hello! Welcome to ARouter, the universal AI gateway."
)
response.stream_to_file("output.mp3")

import fs from "fs";
const response = await client.audio.speech.create({
  model: "openai/tts-1-hd", voice: "nova",
  input: "Hello! Welcome to ARouter, the universal AI gateway."
});
const buffer = Buffer.from(await response.arrayBuffer());
fs.writeFileSync("output.mp3", buffer);

curl https://api.arouter.ai/v1/audio/speech \
  -H "Authorization: Bearer lr_live_xxxx" \
  -H "Content-Type: application/json" \
  -d '{"model": "openai/tts-1-hd", "input": "Hello! Welcome to ARouter.", "voice": "nova"}' \
  --output output.mp3

TTS パラメータ

パラメータ	型	説明
`model`	`string`	TTS モデル（例：`openai/tts-1` または `openai/tts-1-hd`）
`input`	`string`	合成するテキスト。最大 4,096 文字。
`voice`	`string`	使用する音声：`alloy`、`echo`、`fable`、`onyx`、`nova`、`shimmer`
`response_format`	`string`	音声フォーマット：`mp3`（デフォルト）、`opus`、`aac`、`flac`、`wav`、`pcm`
`speed`	`number`	再生速度 `0.25` から `4.0`（デフォルト `1.0`）

使用可能な音声

音声	特徴
`alloy`	ニュートラル、バランス
`echo`	柔らか、内省的
`fable`	表現豊か、物語調
`onyx`	低音、権威ある
`nova`	フレンドリー、エネルギッシュ
`shimmer`	温かみ、穏やか

音声チャット（マルチモーダルモデル）

一部のモデルはチャットメッセージの入力として音声を直接受け付け、音声オーディオで応答できます。

音声入力

{
  "model": "openai/gpt-5.4-audio-preview",
  "messages": [{
    "role": "user",
    "content": [{"type": "input_audio", "input_audio": {"data": "<base64-encoded-audio>", "format": "wav"}}]
  }]
}

対応入力音声フォーマット

フォーマット	MIME タイプ
`wav`	`audio/wav`
`mp3`	`audio/mpeg`
`ogg`	`audio/ogg`
`flac`	`audio/flac`
`m4a`	`audio/m4a`
`webm`	`audio/webm`

音声出力

モデルのレスポンスに音声オーディオをリクエストします：

{
  "model": "openai/gpt-5.4-audio-preview",
  "modalities": ["text", "audio"],
  "audio": {"voice": "nova", "format": "mp3"},
  "messages": [{"role": "user", "content": "Tell me a short joke."}]
}

対応モデル

音声テキスト変換

モデル	言語	備考
`openai/whisper-large-v3`	99以上	最高精度
`openai/whisper-large-v3-turbo`	99以上	より高速、低コスト

テキスト音声変換

モデル	品質	レイテンシ
`openai/tts-1`	標準	低
`openai/tts-1-hd`	高	中

Token 料金

音声 Token は usage.prompt_tokens_details で個別に追跡されます：

{
  "usage": {
    "prompt_tokens": 150,
    "prompt_tokens_details": {"audio_tokens": 100, "cached_tokens": 0},
    "completion_tokens": 50,
    "completion_tokens_details": {"audio_tokens": 30}
  }
}

音声 Token はテキスト Token とは異なる料金が適用されます。各リクエストの実際の料金はレスポンスの usage.cost を確認してください。

​音声文字起こし

​文字起こしパラメータ

​単語レベルのタイムスタンプ

​音声翻訳

​テキスト音声変換

​TTS パラメータ

​使用可能な音声

​音声チャット（マルチモーダルモデル）

​音声入力

​対応入力音声フォーマット

​音声出力

​対応モデル

​音声テキスト変換

​テキスト音声変換

​Token 料金

音声文字起こし

文字起こしパラメータ

単語レベルのタイムスタンプ

音声翻訳

テキスト音声変換

TTS パラメータ

使用可能な音声

音声チャット（マルチモーダルモデル）

音声入力

対応入力音声フォーマット

音声出力

対応モデル

音声テキスト変換

テキスト音声変換

Token 料金