语音转文本（转录/翻译）支持¶

本文档将指导您通过实现 SupportsTranscription 接口，为 vLLM 的转录和翻译 API 添加语音转文本（ASR）模型支持。有关更多指导，请参阅支持的模型。

更新基础 vLLM 模型¶

假设您已根据基础模型指南在 vLLM 中实现了您的模型。请扩展您的模型，实现 SupportsTranscription 接口，并实现以下类属性和方法。

`supported_languages` 和 `supports_transcription_only`¶

声明支持的语言和功能：

supported_languages 映射会在初始化时进行验证。
如果模型不应提供文本生成服务（例如 Whisper），请设置 supports_transcription_only=True。

supported_languages 和 supports_transcription_only

from typing import ClassVar, Mapping, Literal
import numpy as np
import torch
from torch import nn

from vllm.config import ModelConfig, SpeechToTextConfig
from vllm.inputs.data import PromptType
from vllm.model_executor.models.interfaces import SupportsTranscription

class YourASRModel(nn.Module, SupportsTranscription):
    # ISO 639-1 语言代码到语言名称的映射
    supported_languages: ClassVar[Mapping[str, str]] = {
        "en": "English",
        "it": "Italian",
        # ... 根据需要添加更多语言
    }

    # 如果您的模型仅支持音频条件生成
    # （不支持纯文本生成），请启用此标志。
    supports_transcription_only: ClassVar[bool] = True

通过 get_speech_to_text_config 提供 ASR 配置。

这用于控制在提供模型时 API 的一般行为：

get_speech_to_text_config()

class YourASRModel(nn.Module, SupportsTranscription):
    ...

    @classmethod
    def get_speech_to_text_config(
        cls,
        model_config: ModelConfig,
        task_type: Literal["transcribe", "translate"],
    ) -> SpeechToTextConfig:
        return SpeechToTextConfig(
            sample_rate=16_000,
            max_audio_clip_s=30,
            # 如果您的模型/处理器已处理分块，
            # 请设置为 None 以禁用服务器端分块
            min_energy_split_window_size=None,
        )

有关每个字段控制的内容，请参阅音频预处理和分块。

通过 get_generation_prompt 实现提示构建。服务器会将重采样后的波形和任务参数传递给您；您需要返回一个有效的 PromptType。有两种常见模式：

带音频嵌入的多模态 LLM（例如 Voxtral、Gemma3n）¶

返回一个包含 multi_modal_data（含音频）的字典，以及 prompt 字符串或 prompt_token_ids：

get_generation_prompt()

class YourASRModel(nn.Module, SupportsTranscription):
    ...

    @classmethod
    def get_generation_prompt(
        cls,
        audio: np.ndarray,
        stt_config: SpeechToTextConfig,
        model_config: ModelConfig,
        language: str | None,
        task_type: Literal["transcribe", "translate"],
        request_prompt: str,
        to_language: str | None,
    ) -> PromptType:
        # 使用自由格式指令提示的示例
        task_word = "Transcribe" if task_type == "transcribe" else "Translate"
        prompt = (
            "<start_of_turn>user\n"
            f"{task_word} this audio: <audio_soft_token>"
            "<end_of_turn>\n<start_of_turn>model\n"
        )

        return {
            "multi_modal_data": {"audio": (audio, stt_config.sample_rate)},
            "prompt": prompt,
        }

有关多模态输入的进一步说明，请参阅多模态输入。

编码器-解码器纯音频模型（例如 Whisper）¶

返回一个包含独立 encoder_prompt 和 decoder_prompt 条目的字典：

get_generation_prompt()

class YourASRModel(nn.Module, SupportsTranscription):
    ...

    @classmethod
    def get_generation_prompt(
        cls,
        audio: np.ndarray,
        stt_config: SpeechToTextConfig,
        model_config: ModelConfig,
        language: str | None,
        task_type: Literal["transcribe", "translate"],
        request_prompt: str,
        to_language: str | None,
    ) -> PromptType:
        if language is None:
            raise ValueError("Language must be specified")

        prompt = {
            "encoder_prompt": {
                "prompt": "",
                "multi_modal_data": {
                    "audio": (audio, stt_config.sample_rate),
                },
            },
            "decoder_prompt": (
                (f"<|prev|>{request_prompt}" if request_prompt else "")
                + f"<|startoftranscript|><|{language}|>"
                + f"<|{task_type}|><|notimestamps|>"
            ),
        }
        return cast(PromptType, prompt)

`validate_language`（可选）¶

通过 validate_language 进行语言验证

如果您的模型需要指定语言且您希望设置默认值，请重写此方法（参见 Whisper 示例）：

validate_language()

@classmethod
def validate_language(cls, language: str | None) -> str | None:
    if language is None:
        logger.warning(
            "Defaulting to language='en'. If you wish to transcribe "
            "audio in a different language, pass the `language` field "
            "in the TranscriptionRequest."
        )
        language = "en"
    return super().validate_language(language)

`get_num_audio_tokens`（可选）¶

通过 get_num_audio_tokens 进行流式传输的 token 计数

提供快速的持续时间→token 估算，以改进流式传输的使用统计：

get_num_audio_tokens()

class YourASRModel(nn.Module, SupportsTranscription):
    ...

    @classmethod
    def get_num_audio_tokens(
        cls,
        audio_duration_s: float,
        stt_config: SpeechToTextConfig,
        model_config: ModelConfig,
    ) -> int | None:
        # 如果未知则返回 None；否则返回估算值。
        return int(audio_duration_s * stt_config.sample_rate // 320)  # 示例

音频预处理和分块¶

API 服务器在构建提示之前负责基本的音频 I/O 和可选分块：

重采样：使用 librosa 将输入音频重采样至 SpeechToTextConfig.sample_rate。
分块：如果 SpeechToTextConfig.allow_audio_chunking 为 True 且持续时间超过 max_audio_clip_s，服务器会将音频分割为重叠的块，并为每个块生成一个提示。重叠量由 overlap_chunk_second 控制。
能量感知分割：当设置了 min_energy_split_window_size 时，服务器会寻找低能量区域，以尽量减少在单词中间切割的情况。

自动暴露任务¶

如果您的模型实现了接口，vLLM 会自动提供转录支持：

if supports_transcription(model):
    if model.supports_transcription_only:
        return ["transcription"]
    supported_tasks.append("transcription")

启用后，服务器会初始化转录和翻译处理器：

state.openai_serving_transcription = OpenAIServingTranscription(...) if "transcription" in supported_tasks else None
state.openai_serving_translation = OpenAIServingTranslation(...) if "transcription" in supported_tasks else None

除了通过模型注册表提供模型类并实现 SupportsTranscription 外，无需额外注册。

内置示例¶

Whisper 编码器-解码器（仅音频）： vllm/model_executor/models/whisper.py
Voxtral 仅解码器（音频嵌入 + LLM）： vllm/model_executor/models/voxtral.py。请确保已安装 mistral-common[audio]。
Gemma3n 仅解码器（带固定指令提示）： vllm/model_executor/models/gemma3n_mm.py

使用 API 测试¶

一旦您的模型实现了 SupportsTranscription，您就可以测试端点（API 模仿 OpenAI）：

转录（ASR）：

curl -s -X POST \
  -H "Authorization: Bearer $VLLM_API_KEY" \
  -H "Content-Type: multipart/form-data" \
  -F "file=@/path/to/audio.wav" \
  -F "model=$MODEL_ID" \
  http://localhost:8000/v1/audio/transcriptions

翻译（源语言 → 英语，除非另有支持）：

curl -s -X POST \
  -H "Authorization: Bearer $VLLM_API_KEY" \
  -H "Content-Type: multipart/form-data" \
  -F "file=@/path/to/audio.wav" \
  -F "model=$MODEL_ID" \
  http://localhost:8000/v1/audio/translations

或者查看 examples/online_serving 中的更多示例。

Note

如果您的模型内部处理分块（例如通过其处理器或编码器），请在返回的 SpeechToTextConfig 中设置 min_energy_split_window_size=None 以禁用服务器端分块。
实现 get_num_audio_tokens 可提高流式使用指标（prompt_tokens）的准确性，而无需额外的正向传递。
对于多语言行为，请保持 supported_languages 与实际模型能力一致。