多模态

支持的模态

模态	方向	备注
文字	输入 + 输出	所有模型
图像（URL / base64）	输入	视觉模型 — JPEG、PNG、GIF、WebP
PDF（base64）	输入	Anthropic Claude、Google Gemini
音频（base64）	输入	多模态音频模型
图像生成	输出	DALL-E 3、Flux、Stable Diffusion
音频输出（TTS / 语音）	输出	TTS 模型、音频对话模型

# 支持图像输入的模型
GET /v1/models?supported_parameters=vision
# 输出图像的模型
GET /v1/models?output_modalities=image
# 输出音频的模型
GET /v1/models?output_modalities=audio

图像

使用图像 URL

{
  "model": "openai/gpt-5.4",
  "messages": [{
    "role": "user",
    "content": [
      {"type": "text", "text": "What's in this image?"},
      {"type": "image_url", "image_url": {"url": "https://example.com/image.jpg"}}
    ]
  }]
}

图像细节级别

值	描述
`auto`（默认）	提供商根据图像大小决定
`low`	更快、更便宜 — 85 个 token，调整为 512×512
`high`	全分辨率 — 对图像进行分块，消耗更多 token

模型	图像 URL	图像 Base64	PDF	音频输入
`openai/gpt-5.4`	✓	✓	—	—
`anthropic/claude-sonnet-4.6`	✓	✓	✓	—
`google/gemini-2.5-flash`	✓	✓	✓	✓
`google/gemini-2.5-pro`	✓	✓	✓	✓

模型

图像 URL

图像 Base64

PDF

音频输入

openai/gpt-5.4

✓

—

anthropic/claude-sonnet-4.6

✓

—

google/gemini-2.5-flash

✓

google/gemini-2.5-pro

✓

图像 token 计入提示词 token 限制。使用 detail: "high" 的大型高分辨率图像可能比文字消耗多得多的 token。

支持的模态

图像

使用图像 URL

图像细节级别

模型兼容性

其他模态

​支持的模态

​图像

​使用图像 URL

​图像细节级别

​模型兼容性

​其他模态

支持的模态

图像

使用图像 URL

图像细节级别

模型兼容性

其他模态