OpenAI 兼容服务器¶
vLLM 提供了一个 HTTP 服务器,实现了 OpenAI 的 Completions API、Chat API 等更多功能!此功能允许您通过 HTTP 客户端来部署模型并与之交互。
在终端中,您可以安装 vLLM,然后使用 vllm serve 命令启动服务器。(您也可以使用我们的 Docker 镜像。)
要调用服务器,请在您首选的文本编辑器中创建一个使用 HTTP 客户端的脚本。包含您想要发送给模型的任何消息。然后运行该脚本。以下是一个使用 官方 OpenAI Python 客户端 的示例脚本。
Code
Tip
vLLM 支持一些 OpenAI 不支持的参数,例如 top_k。 您可以通过 OpenAI 客户端的 extra_body 参数将这些参数传递给 vLLM,例如对于 top_k 使用 extra_body={"top_k": 50}。
Important
默认情况下,如果 Hugging Face 模型仓库中存在 generation_config.json,服务器会应用它。这意味着某些采样参数的默认值可能会被模型创建者推荐的值覆盖。
要禁用此行为,请在启动服务器时传递 --generation-config vllm。
支持的 API¶
我们目前支持以下 OpenAI API:
- Completions API (
/v1/completions)- 仅适用于文本生成模型。
- 注意:不支持
suffix参数。
- Responses API (
/v1/responses)- 仅适用于文本生成模型。
- Chat Completions API (
/v1/chat/completions) - Embeddings API (
/v1/embeddings)- 仅适用于嵌入模型。
- Transcriptions API (
/v1/audio/transcriptions)- 仅适用于自动语音识别 (ASR) 模型。
- Translation API (
/v1/audio/translations)- 仅适用于自动语音识别 (ASR) 模型。
- Realtime API (
/v1/realtime)- 仅适用于自动语音识别 (ASR) 模型。
此外,我们还有以下自定义 API:
- Tokenizer API (
/tokenize,/detokenize)- 适用于任何带有分词器的模型。
- Pooling API (
/pooling)- 适用于所有池化模型。
- Classification API (
/classify)- 仅适用于分类模型。
- Score API (
/score)- 适用于嵌入模型和交叉编码器模型。
- Re-rank API (
/rerank,/v1/rerank,/v2/rerank)- 实现了 Jina AI 的 v1 re-rank API
- 同时兼容 Cohere 的 v1 和 v2 re-rank APIs
- Jina 和 Cohere 的 API 非常相似;Jina 的 API 在 rerank 端点的响应中包含额外信息。
- 仅适用于交叉编码器模型。
聊天模板¶
为了使语言模型支持聊天协议,vLLM 要求模型在其分词器配置中包含一个聊天模板。聊天模板是一个 Jinja2 模板,它指定了角色、消息和其他聊天特定标记在输入中的编码方式。
NousResearch/Meta-Llama-3-8B-Instruct 的聊天模板示例可以在此处找到。
有些模型即使经过指令/聊天微调,也没有提供聊天模板。对于这些模型,您可以在 --chat-template 参数中手动指定其聊天模板,参数值为聊天模板的文件路径或字符串形式的模板。没有聊天模板,服务器将无法处理聊天请求,所有聊天请求都会出错。
vLLM 社区为流行模型提供了一组聊天模板。您可以在 examples 目录下找到它们。
随着多模态聊天 API 的加入,OpenAI 规范现在接受一种新格式的聊天消息,该格式同时指定了 type 和 text 字段。下面提供了一个示例:
completion = client.chat.completions.create(
model="NousResearch/Meta-Llama-3-8B-Instruct",
messages=[
{
"role": "user",
"content": [
{"type": "text", "text": "Classify this sentiment: vLLM is wonderful!"},
],
},
],
)
大多数 LLM 的聊天模板期望 content 字段是一个字符串,但也有一些较新的模型,如 meta-llama/Llama-Guard-3-1B,期望内容按照请求中的 OpenAI 模式进行格式化。vLLM 提供了尽最大努力自动检测此格式的支持,这会记录为类似 "Detected the chat template content format to be..." 的字符串,并在内部转换传入的请求以匹配检测到的格式,格式可以是以下之一:
"string":字符串。- 示例:
"Hello world"
- 示例:
"openai":字典列表,类似于 OpenAI 模式。- 示例:
[{"type": "text", "text": "Hello world!"}]
- 示例:
如果结果不符合您的预期,您可以设置 --chat-template-content-format CLI 参数来覆盖要使用的格式。
额外参数¶
vLLM 支持一组不属于 OpenAI API 的参数。 为了使用它们,您可以在 OpenAI 客户端中将其作为额外参数传递。 或者,如果您直接使用 HTTP 调用,可以直接将它们合并到 JSON 负载中。
completion = client.chat.completions.create(
model="NousResearch/Meta-Llama-3-8B-Instruct",
messages=[
{"role": "user", "content": "Classify this sentiment: vLLM is wonderful!"},
],
extra_body={
"structured_outputs": {"choice": ["positive", "negative"]},
},
)
额外 HTTP 头¶
目前仅支持 X-Request-Id HTTP 请求头。可以通过 --enable-request-id-headers 启用它。
Code
completion = client.completions.create( model="NousResearch/Meta-Llama-3-8B-Instruct", prompt="A robot may not injure a human being", extra_headers={ "x-request-id": "completion-test", }, ) print(completion._request_id) ```
离线 API 文档¶
默认情况下,FastAPI 的 /docs 端点需要互联网连接。要在隔离网络环境中启用离线访问,请使用 --enable-offline-docs 标志:
API 参考¶
Completions API¶
我们的 Completions API 兼容 OpenAI 的 Completions API;你可以使用 官方的 OpenAI Python 客户端 与之交互。
代码示例: examples/online_serving/openai_completion_client.py
额外参数¶
支持以下 采样参数。
Code
use_beam_search: bool = False
top_k: int | None = None
min_p: float | None = None
repetition_penalty: float | None = None
length_penalty: float = 1.0
stop_token_ids: list[int] | None = []
include_stop_str_in_output: bool = False
ignore_eos: bool = False
min_tokens: int = 0
skip_special_tokens: bool = True
spaces_between_special_tokens: bool = True
truncate_prompt_tokens: Annotated[int, Field(ge=-1, le=_LONG_INFO.max)] | None = (
None
)
allowed_token_ids: list[int] | None = None
prompt_logprobs: int | None = None
支持以下额外参数:
Code
prompt_embeds: bytes | list[bytes] | None = None
add_special_tokens: bool = Field(
default=True,
description=(
"If true (the default), special tokens (e.g. BOS) will be added to "
"the prompt."
),
)
response_format: AnyResponseFormat | None = Field(
default=None,
description=(
"Similar to chat completion, this parameter specifies the format "
"of output. Only {'type': 'json_object'}, {'type': 'json_schema'}"
", {'type': 'structural_tag'}, or {'type': 'text' } is supported."
),
)
structured_outputs: StructuredOutputsParams | None = Field(
default=None,
description="Additional kwargs for structured outputs",
)
priority: int = Field(
default=0,
description=(
"The priority of the request (lower means earlier handling; "
"default: 0). Any priority other than 0 will raise an error "
"if the served model does not use priority scheduling."
),
)
request_id: str = Field(
default_factory=random_uuid,
description=(
"The request_id related to this request. If the caller does "
"not set it, a random_uuid will be generated. This id is used "
"through out the inference process and return in response."
),
)
logits_processors: LogitsProcessors | None = Field(
default=None,
description=(
"A list of either qualified names of logits processors, or "
"constructor objects, to apply when sampling. A constructor is "
"a JSON object with a required 'qualname' field specifying the "
"qualified name of the processor class/factory, and optional "
"'args' and 'kwargs' fields containing positional and keyword "
"arguments. For example: {'qualname': "
"'my_module.MyLogitsProcessor', 'args': [1, 2], 'kwargs': "
"{'param': 'value'}}."
),
)
return_tokens_as_token_ids: bool | None = Field(
default=None,
description=(
"If specified with 'logprobs', tokens are represented "
" as strings of the form 'token_id:{token_id}' so that tokens "
"that are not JSON-encodable can be identified."
),
)
return_token_ids: bool | None = Field(
default=None,
description=(
"If specified, the result will include token IDs alongside the "
"generated text. In streaming mode, prompt_token_ids is included "
"only in the first chunk, and token_ids contains the delta tokens "
"for each chunk. This is useful for debugging or when you "
"need to map generated text back to input tokens."
),
)
cache_salt: str | None = Field(
default=None,
description=(
"If specified, the prefix cache will be salted with the provided "
"string to prevent an attacker to guess prompts in multi-user "
"environments. The salt should be random, protected from "
"access by 3rd parties, and long enough to be "
"unpredictable (e.g., 43 characters base64-encoded, corresponding "
"to 256 bit)."
),
)
kv_transfer_params: dict[str, Any] | None = Field(
default=None,
description="KVTransfer parameters used for disaggregated serving.",
)
vllm_xargs: dict[str, str | int | float] | None = Field(
default=None,
description=(
"Additional request parameters with string or "
"numeric values, used by custom extensions."
),
)
Chat API¶
我们的 Chat API 兼容 OpenAI 的 Chat Completions API;你可以使用 官方的 OpenAI Python 客户端 与之交互。
我们同时支持 视觉 和 音频 相关参数;更多信息请参阅我们的 多模态输入 指南。
- 注意:不支持
image_url.detail参数。
代码示例: examples/online_serving/openai_chat_completion_client.py
额外参数¶
支持以下 采样参数。
Code
use_beam_search: bool = False
top_k: int | None = None
min_p: float | None = None
repetition_penalty: float | None = None
length_penalty: float = 1.0
stop_token_ids: list[int] | None = []
include_stop_str_in_output: bool = False
ignore_eos: bool = False
min_tokens: int = 0
skip_special_tokens: bool = True
spaces_between_special_tokens: bool = True
truncate_prompt_tokens: Annotated[int, Field(ge=-1, le=_LONG_INFO.max)] | None = (
None
)
prompt_logprobs: int | None = None
allowed_token_ids: list[int] | None = None
bad_words: list[str] = Field(default_factory=list)
支持以下额外参数:
Code
echo: bool = Field(
default=False,
description=(
"If true, the new message will be prepended with the last message "
"if they belong to the same role."
),
)
add_generation_prompt: bool = Field(
default=True,
description=(
"If true, the generation prompt will be added to the chat template. "
"This is a parameter used by chat template in tokenizer config of the "
"model."
),
)
continue_final_message: bool = Field(
default=False,
description=(
"If this is set, the chat will be formatted so that the final "
"message in the chat is open-ended, without any EOS tokens. The "
"model will continue this message rather than starting a new one. "
'This allows you to "prefill" part of the model\'s response for it. '
"Cannot be used at the same time as `add_generation_prompt`."
),
)
add_special_tokens: bool = Field(
default=False,
description=(
"If true, special tokens (e.g. BOS) will be added to the prompt "
"on top of what is added by the chat template. "
"For most models, the chat template takes care of adding the "
"special tokens so this should be set to false (as is the "
"default)."
),
)
documents: list[dict[str, str]] | None = Field(
default=None,
description=(
"A list of dicts representing documents that will be accessible to "
"the model if it is performing RAG (retrieval-augmented generation)."
" If the template does not support RAG, this argument will have no "
"effect. We recommend that each document should be a dict containing "
'"title" and "text" keys.'
),
)
chat_template: str | None = Field(
default=None,
description=(
"A Jinja template to use for this conversion. "
"As of transformers v4.44, default chat template is no longer "
"allowed, so you must provide a chat template if the tokenizer "
"does not define one."
),
)
chat_template_kwargs: dict[str, Any] | None = Field(
default=None,
description=(
"Additional keyword args to pass to the template renderer. "
"Will be accessible by the chat template."
),
)
mm_processor_kwargs: dict[str, Any] | None = Field(
default=None,
description=("Additional kwargs to pass to the HF processor."),
)
structured_outputs: StructuredOutputsParams | None = Field(
default=None,
description="Additional kwargs for structured outputs",
)
priority: int = Field(
default=0,
description=(
"The priority of the request (lower means earlier handling; "
"default: 0). Any priority other than 0 will raise an error "
"if the served model does not use priority scheduling."
),
)
request_id: str = Field(
default_factory=random_uuid,
description=(
"The request_id related to this request. If the caller does "
"not set it, a random_uuid will be generated. This id is used "
"through out the inference process and return in response."
),
)
logits_processors: LogitsProcessors | None = Field(
default=None,
description=(
"A list of either qualified names of logits processors, or "
"constructor objects, to apply when sampling. A constructor is "
"a JSON object with a required 'qualname' field specifying the "
"qualified name of the processor class/factory, and optional "
"'args' and 'kwargs' fields containing positional and keyword "
"arguments. For example: {'qualname': "
"'my_module.MyLogitsProcessor', 'args': [1, 2], 'kwargs': "
"{'param': 'value'}}."
),
)
return_tokens_as_token_ids: bool | None = Field(
default=None,
description=(
"If specified with 'logprobs', tokens are represented "
" as strings of the form 'token_id:{token_id}' so that tokens "
"that are not JSON-encodable can be identified."
),
)
return_token_ids: bool | None = Field(
default=None,
description=(
"If specified, the result will include token IDs alongside the "
"generated text. In streaming mode, prompt_token_ids is included "
"only in the first chunk, and token_ids contains the delta tokens "
"for each chunk. This is useful for debugging or when you "
"need to map generated text back to input tokens."
),
)
cache_salt: str | None = Field(
default=None,
description=(
"If specified, the prefix cache will be salted with the provided "
"string to prevent an attacker to guess prompts in multi-user "
"environments. The salt should be random, protected from "
"access by 3rd parties, and long enough to be "
"unpredictable (e.g., 43 characters base64-encoded, corresponding "
"to 256 bit)."
),
)
kv_transfer_params: dict[str, Any] | None = Field(
default=None,
description="KVTransfer parameters used for disaggregated serving.",
)
vllm_xargs: dict[str, str | int | float | list[str | int | float]] | None = Field(
default=None,
description=(
"Additional request parameters with (list of) string or "
"numeric values, used by custom extensions."
),
)
Responses API¶
我们的 Responses API 兼容 OpenAI 的 Responses API;你可以使用 官方的 OpenAI Python 客户端 与之交互。
代码示例: examples/online_serving/openai_responses_client_with_tools.py
额外参数¶
请求对象中支持以下额外参数:
Code
request_id: str = Field(
default_factory=lambda: f"resp_{random_uuid()}",
description=(
"The request_id related to this request. If the caller does "
"not set it, a random_uuid will be generated. This id is used "
"through out the inference process and return in response."
),
)
mm_processor_kwargs: dict[str, Any] | None = Field(
default=None,
description=("Additional kwargs to pass to the HF processor."),
)
priority: int = Field(
default=0,
description=(
"The priority of the request (lower means earlier handling; "
"default: 0). Any priority other than 0 will raise an error "
"if the served model does not use priority scheduling."
),
)
cache_salt: str | None = Field(
default=None,
description=(
"If specified, the prefix cache will be salted with the provided "
"string to prevent an attacker to guess prompts in multi-user "
"environments. The salt should be random, protected from "
"access by 3rd parties, and long enough to be "
"unpredictable (e.g., 43 characters base64-encoded, corresponding "
"to 256 bit)."
),
)
enable_response_messages: bool = Field(
default=False,
description=(
"Dictates whether or not to return messages as part of the "
"response object. Currently only supported for"
"non-background and gpt-oss only. "
),
)
# similar to input_messages / output_messages in ResponsesResponse
# we take in previous_input_messages (ie in harmony format)
# this cannot be used in conjunction with previous_response_id
# TODO: consider supporting non harmony messages as well
previous_input_messages: list[OpenAIHarmonyMessage | dict] | None = None
响应对象中支持以下额外参数:
Code
# These are populated when enable_response_messages is set to True
# NOTE: custom serialization is needed
# see serialize_input_messages and serialize_output_messages
input_messages: ResponseInputOutputMessage | None = Field(
default=None,
description=(
"If enable_response_messages, we can show raw token input to model."
),
)
output_messages: ResponseInputOutputMessage | None = Field(
default=None,
description=(
"If enable_response_messages, we can show raw token output of model."
),
)
Embeddings API¶
我们的 Embeddings API 兼容 OpenAI 的 Embeddings API;你可以使用 官方的 OpenAI Python 客户端 与之交互。
代码示例: examples/pooling/embed/openai_embedding_client.py
如果模型有 聊天模板,你可以用 messages 列表(与 Chat API 相同的模式)替换 inputs,它将被视为模型的单个提示。这里提供了一个调用 API 的便捷函数,同时保留了 OpenAI 的类型注解:
Code
from openai import OpenAI
from openai._types import NOT_GIVEN, NotGiven
from openai.types.chat import ChatCompletionMessageParam
from openai.types.create_embedding_response import CreateEmbeddingResponse
def create_chat_embeddings(
client: OpenAI,
*,
messages: list[ChatCompletionMessageParam],
model: str,
encoding_format: Union[Literal["base64", "float"], NotGiven] = NOT_GIVEN,
) -> CreateEmbeddingResponse:
return client.post(
"/embeddings",
cast_to=CreateEmbeddingResponse,
body={"messages": messages, "model": model, "encoding_format": encoding_format},
)
多模态输入¶
你可以通过为服务器定义自定义聊天模板并在请求中传递 messages 列表,将多模态输入传递给嵌入模型。请参考下面的示例进行说明。
启动模型服务:
vllm serve TIGER-Lab/VLM2Vec-Full --runner pooling \
--trust-remote-code \
--max-model-len 4096 \
--chat-template examples/template_vlm2vec_phi3v.jinja
Important
由于 VLM2Vec 与 Phi-3.5-Vision 具有相同的模型架构,我们必须显式传递 --runner pooling 来以嵌入模式而非文本生成模式运行此模型。
自定义聊天模板与此模型的原始模板完全不同,可以在这里找到: examples/template_vlm2vec_phi3v.jinja
由于 OpenAI 客户端未定义请求模式,我们使用较低级别的 requests 库向服务器发送请求:
Code
from openai import OpenAI
client = OpenAI(
base_url="http://localhost:8000/v1",
api_key="EMPTY",
)
image_url = "https://vllm-public-assets.s3.us-west-2.amazonaws.com/vision_model_images/2560px-Gfp-wisconsin-madison-the-nature-boardwalk.jpg"
response = create_chat_embeddings(
client,
model="TIGER-Lab/VLM2Vec-Full",
messages=[
{
"role": "user",
"content": [
{"type": "image_url", "image_url": {"url": image_url}},
{"type": "text", "text": "Represent the given image."},
],
}
],
encoding_format="float",
)
print("Image embedding output:", response.data[0].embedding)
启动模型服务:
vllm serve MrLight/dse-qwen2-2b-mrl-v1 --runner pooling \
--trust-remote-code \
--max-model-len 8192 \
--chat-template examples/template_dse_qwen2_vl.jinja
Important
与 VLM2Vec 类似,我们必须显式传递 --runner pooling。
此外,MrLight/dse-qwen2-2b-mrl-v1 需要 EOS 令牌用于嵌入,这由一个自定义聊天模板处理: examples/template_dse_qwen2_vl.jinja
Important
MrLight/dse-qwen2-2b-mrl-v1 需要一张最小图像尺寸的占位图用于文本查询嵌入。详细信息请参阅下面的完整代码示例。
完整示例: examples/pooling/embed/vision_embedding_online.py
额外参数¶
支持以下 池化参数。
truncate_prompt_tokens: Annotated[int, msgspec.Meta(ge=-1)] | None = None
dimensions: int | None = None
normalize: bool | None = None
支持以下 Embeddings API 参数:
Code
支持以下额外参数:
Code
truncate_prompt_tokens: Annotated[int, Field(ge=-1)] | None = None
request_id: str = Field(
default_factory=random_uuid,
description=(
"The request_id related to this request. If the caller does "
"not set it, a random_uuid will be generated. This id is used "
"through out the inference process and return in response."
),
)
priority: int = Field(
default=0,
description=(
"The priority of the request (lower means earlier handling; "
"default: 0). Any priority other than 0 will raise an error "
"if the served model does not use priority scheduling."
),
)
add_special_tokens: bool = Field(
default=True,
description=(
"If true (the default), special tokens (e.g. BOS) will be added to "
"the prompt."
),
)
embed_dtype: EmbedDType = Field(
default="float32",
description=(
"What dtype to use for encoding. Default to using float32 for base64 "
"encoding to match the OpenAI python client behavior. "
"This parameter will affect base64 and binary_response."
),
)
endianness: Endianness = Field(
default="native",
description=(
"What endianness to use for encoding. Default to using native for "
"base64 encoding to match the OpenAI python client behavior."
"This parameter will affect base64 and binary_response."
),
)
normalize: bool | None = Field(
default=None,
description="Whether to normalize the embeddings outputs. Default is True.",
)
```
对于聊天式输入(即传递了 `messages` 参数时),支持以下参数:
默认支持以下参数:
??? code
```python
model: str | None = None
user: str | None = None
messages: list[ChatCompletionMessageParam]
encoding_format: EncodingFormat = "float"
dimensions: int | None = None
```
而是支持以下额外参数:
??? code
```python
truncate_prompt_tokens: Annotated[int, Field(ge=-1)] | None = None
request_id: str = Field(
default_factory=random_uuid,
description=(
"The request_id related to this request. If the caller does "
"not set it, a random_uuid will be generated. This id is used "
"through out the inference process and return in response."
),
)
priority: int = Field(
default=0,
description=(
"The priority of the request (lower means earlier handling; "
"default: 0). Any priority other than 0 will raise an error "
"if the served model does not use priority scheduling."
),
)
add_generation_prompt: bool = Field(
default=False,
description=(
"If true, the generation prompt will be added to the chat template. "
"This is a parameter used by chat template in tokenizer config of the "
"model."
),
)
continue_final_message: bool = Field(
default=False,
description=(
"If this is set, the chat will be formatted so that the final "
"message in the chat is open-ended, without any EOS tokens. The "
"model will continue this message rather than starting a new one. "
'This allows you to "prefill" part of the model\'s response for it. '
"Cannot be used at the same time as `add_generation_prompt`."
),
)
add_special_tokens: bool = Field(
default=False,
description=(
"If true, special tokens (e.g. BOS) will be added to the prompt "
"on top of what is added by the chat template. "
"For most models, the chat template takes care of adding the "
"special tokens so this should be set to false (as is the "
"default)."
),
)
chat_template: str | None = Field(
default=None,
description=(
"A Jinja template to use for this conversion. "
"As of transformers v4.44, default chat template is no longer "
"allowed, so you must provide a chat template if the tokenizer "
"does not define one."
),
)
chat_template_kwargs: dict[str, Any] | None = Field(
default=None,
description=(
"Additional keyword args to pass to the template renderer. "
"Will be accessible by the chat template."
),
)
embed_dtype: EmbedDType = Field(
default="float32",
description=(
"What dtype to use for encoding. Default to using float32 for base64 "
"encoding to match the OpenAI python client behavior. "
"This parameter will affect base64 and binary_response."
),
)
endianness: Endianness = Field(
default="native",
description=(
"What endianness to use for encoding. Default to using native for "
"base64 encoding to match the OpenAI python client behavior."
"This parameter will affect base64 and binary_response."
),
)
normalize: bool | None = Field(
default=None,
description="Whether to normalize the embeddings outputs. Default is True.",
)
```
### 语音转写 API
我们的语音转写 API 与 [OpenAI 的语音转写 API](https://platform.openai.com/docs/api-reference/audio/createTranscription) 兼容;你可以使用 [官方的 OpenAI Python 客户端](https://github.com/openai/openai-python) 与之交互。
!!! note
要使用语音转写 API,请使用 `pip install vllm[audio]` 安装额外的音频依赖。
代码示例:[:octicons-mark-github-16: examples/online_serving/openai_transcription_client.py](https://github.com/vllm-project/vllm/blob/main/examples/online_serving/openai_transcription_client.py)
#### API 强制限制
通过 `VLLM_MAX_AUDIO_CLIP_FILESIZE_MB` 环境变量设置 vLLM 将接受的最大音频文件大小(以 MB 为单位)。默认值为 25 MB。
#### 上传音频文件
语音转写 API 支持上传多种格式的音频文件,包括 FLAC、MP3、MP4、MPEG、MPGA、M4A、OGG、WAV 和 WEBM。
**使用 OpenAI Python 客户端:**
??? code
```python
from openai import OpenAI
client = OpenAI(
base_url="http://localhost:8000/v1",
api_key="token-abc123",
)
# 从磁盘上传音频文件
with open("audio.mp3", "rb") as audio_file:
transcription = client.audio.transcriptions.create(
model="openai/whisper-large-v3-turbo",
file=audio_file,
language="en",
response_format="verbose_json",
)
print(transcription.text)
```
**使用 curl 和 multipart/form-data:**
??? code
```bash
curl -X POST "http://localhost:8000/v1/audio/transcriptions" \
-H "Authorization: Bearer token-abc123" \
-F "[email protected]" \
-F "model=openai/whisper-large-v3-turbo" \
-F "language=en" \
-F "response_format=verbose_json"
```
**支持的参数:**
- `file`: 要转写的音频文件(必需)
- `model`: 用于转写的模型(必需)
- `language`: 语言代码(例如 "en"、"zh")(可选)
- `prompt`: 指导转写风格的可选文本(可选)
- `response_format`: 响应格式("json"、"text")(可选)
- `temperature`: 采样温度,介于 0 和 1 之间(可选)
有关支持的完整参数列表,包括采样参数和 vLLM 扩展,请参阅 [协议定义](https://github.com/vllm-project/vllm/blob/main/vllm/entrypoints/openai/protocol.py#L2182)。
**响应格式:**
对于 `verbose_json` 响应格式:
??? code
```json
{
"text": "Hello, this is a transcription of the audio file.",
"language": "en",
"duration": 5.42,
"segments": [
{
"id": 0,
"seek": 0,
"start": -0.0,
"end": 2.5,
"text": "Hello, this is a transcription",
"tokens": [50364, 938, 428, 307, 275, 28347],
"temperature": 0.0,
"avg_logprob": -0.245,
"compression_ratio": 1.235,
"no_speech_prob": 0.012
}
]
}
```
目前 `verbose_json` 响应格式不支持 `no_speech_prob`。
#### 额外参数
支持以下 [采样参数](../api/README.md#inference-parameters)。
??? code
```python
```
支持以下额外参数:
??? code
```python
```
### 语音翻译 API
我们的语音翻译 API 与 [OpenAI 的语音翻译 API](https://platform.openai.com/docs/api-reference/audio/createTranslation) 兼容;你可以使用 [官方的 OpenAI Python 客户端](https://github.com/openai/openai-python) 与之交互。
Whisper 模型可以将音频从 55 种支持的非英语语言之一翻译成英语。
请注意,流行的 `openai/whisper-large-v3-turbo` 模型不支持翻译功能。
!!! note
要使用语音翻译 API,请使用 `pip install vllm[audio]` 安装额外的音频依赖。
代码示例:[:octicons-mark-github-16: examples/online_serving/openai_translation_client.py](https://github.com/vllm-project/vllm/blob/main/examples/online_serving/openai_translation_client.py)
#### 额外参数
支持以下 [采样参数](../api/README.md#inference-parameters)。
```python
支持以下额外参数:
实时 API¶
实时 API 提供基于 WebSocket 的流式音频转写,允许在录制音频时进行实时语音转文本。
Note
要使用实时 API,请使用 uv pip install vllm[audio] 安装额外的音频依赖。
音频格式¶
音频必须以 base64 编码的 PCM16 格式发送,采样率为 16kHz,单声道。
协议概述¶
- 客户端连接到
ws://host/v1/realtime - 服务器发送
session.created事件 - 客户端可选地发送带有模型/参数的
session.update - 客户端准备就绪时发送
input_audio_buffer.commit - 客户端发送带有 base64 PCM16 数据块的
input_audio_buffer.append事件 - 服务器发送带有增量文本的
transcription.delta事件 - 服务器发送带有最终文本和用量的
transcription.done - 对于下一个话语,从步骤 5 重复
- 可选地,客户端发送带有
final=True的input_audio_buffer.commit以表示音频输入结束。在流式传输音频文件时很有用
客户端 → 服务器事件¶
| 事件 | 描述 |
|---|---|
input_audio_buffer.append | 发送 base64 编码的音频块:{"type": "input_audio_buffer.append", "audio": "<base64>"} |
input_audio_buffer.commit | 触发转写处理或结束:{"type": "input_audio_buffer.commit", "final": bool} |
session.update | 配置会话:{"type": "session.update", "model": "model-name"} |
服务器 → 客户端事件¶
| 事件 | 描述 |
|---|---|
session.created | 连接建立,包含会话 ID 和时间戳 |
transcription.delta | 增量转写文本:{"type": "transcription.delta", "delta": "text"} |
transcription.done | 最终转写结果及用量统计 |
error | 错误通知,包含消息和可选代码 |
Python WebSocket 示例¶
Code
import asyncio
import base64
import json
import websockets
async def realtime_transcribe():
uri = "ws://localhost:8000/v1/realtime"
async with websockets.connect(uri) as ws:
# 等待 session.created
response = await ws.recv()
print(f"Session: {response}")
# 提交缓冲区
await ws.send(json.dumps({
"type": "input_audio_buffer.commit"
}))
发送音频分片(文件示例)¶
with open("audio.raw", "rb") as f:
while chunk := f.read(4096):
await ws.send(json.dumps({
"type": "input_audio_buffer.append",
"audio": base64.b64encode(chunk).decode()
}))
# 发送信号表示所有音频已发送完毕
await ws.send(json.dumps({
"type": "input_audio_buffer.commit",
"final": True,
}))
# 接收转录结果
while True:
response = json.loads(await ws.recv())
if response["type"] == "transcription.delta":
print(response["delta"], end="", flush=True)
elif response["type"] == "transcription.done":
print(f"\n最终结果: {response['text']}")
break
asyncio.run(realtime_transcribe())
```
Tokenizer API¶
我们的 Tokenizer API 是 HuggingFace 风格分词器 的一个简单封装。 它包含两个端点:
/tokenize对应调用tokenizer.encode()。/detokenize对应调用tokenizer.decode()。
Pooling API¶
我们的 Pooling API 使用 池化模型 对输入提示进行编码,并返回相应的隐藏状态。
输入格式与 Embeddings API 相同,但输出数据可以包含任意嵌套的列表,而不仅仅是一维浮点数列表。
代码示例: examples/pooling/pooling/pooling_online.py
Classification API¶
我们的 Classification API 直接支持 Hugging Face 的序列分类模型,例如 ai21labs/Jamba-tiny-reward-dev 和 jason9693/Qwen2.5-1.5B-apeach。
我们通过 as_seq_cls_model() 自动包装任何其他 transformer 模型,该函数会在最后一个 token 上进行池化,附加一个 RowParallelLinear 头,并应用 softmax 来生成每个类别的概率。
代码示例: examples/pooling/classify/classification_online.py
示例请求¶
你可以通过传递一个字符串数组来对多个文本进行分类:
curl -v "http://127.0.0.1:8000/classify" \
-H "Content-Type: application/json" \
-d '{
"model": "jason9693/Qwen2.5-1.5B-apeach",
"input": [
"喜欢这家新咖啡馆——咖啡很棒。",
"这次更新把一切都搞坏了。真令人沮丧。"
]
}'
响应
{
"id": "classify-7c87cac407b749a6935d8c7ce2a8fba2",
"object": "list",
"created": 1745383065,
"model": "jason9693/Qwen2.5-1.5B-apeach",
"data": [
{
"index": -1,
"label": "Default",
"probs": [
0.565970778465271,
0.4340292513370514
],
"num_classes": 2
},
{
"index": 1,
"label": "Spoiled",
"probs": [
0.26448777318000793,
0.7355121970176697
],
"num_classes": 2
}
],
"usage": {
"prompt_tokens": 20,
"total_tokens": 20,
"completion_tokens": 0,
"prompt_tokens_details": null
}
}
你也可以直接将一个字符串传递给 input 字段:
curl -v "http://127.0.0.1:8000/classify" \
-H "Content-Type: application/json" \
-d '{
"model": "jason9693/Qwen2.5-1.5B-apeach",
"input": "喜欢这家新咖啡馆——咖啡很棒。"
}'
响应
{
"id": "classify-9bf17f2847b046c7b2d5495f4b4f9682",
"object": "list",
"created": 1745383213,
"model": "jason9693/Qwen2.5-1.5B-apeach",
"data": [
{
"index": 0,
"label": "Default",
"probs": [
0.565970778465271,
0.4340292513370514
],
"num_classes": 2
}
],
"usage": {
"prompt_tokens": 10,
"total_tokens": 10,
"completion_tokens": 0,
"prompt_tokens_details": null
}
}
额外参数¶
支持以下 池化参数。
truncate_prompt_tokens: Annotated[int, msgspec.Meta(ge=-1)] | None = None
softmax: bool | None = None
activation: bool | None = None
use_activation: bool | None = None
支持以下 Classification API 参数:
Code
支持以下额外参数:
Code
truncate_prompt_tokens: Annotated[int, Field(ge=-1)] | None = None
request_id: str = Field(
default_factory=random_uuid,
description=(
"The request_id related to this request. If the caller does "
"not set it, a random_uuid will be generated. This id is used "
"through out the inference process and return in response."
),
)
priority: int = Field(
default=0,
description=(
"The priority of the request (lower means earlier handling; "
"default: 0). Any priority other than 0 will raise an error "
"if the served model does not use priority scheduling."
),
)
add_special_tokens: bool = Field(
default=True,
description=(
"If true (the default), special tokens (e.g. BOS) will be added to "
"the prompt."
),
)
softmax: bool | None = Field(
default=None,
description="softmax will be deprecated, please use use_activation instead.",
)
activation: bool | None = Field(
default=None,
description="activation will be deprecated, please use use_activation instead.",
)
use_activation: bool | None = Field(
default=None,
description="Whether to use activation for classification outputs. "
"Default is True.",
)
对于类聊天输入(即如果传递了 messages),支持以下参数:
相应地,支持以下额外参数:
Code
truncate_prompt_tokens: Annotated[int, Field(ge=-1)] | None = None
request_id: str = Field(
default_factory=random_uuid,
description=(
"The request_id related to this request. If the caller does "
"not set it, a random_uuid will be generated. This id is used "
"through out the inference process and return in response."
),
)
priority: int = Field(
default=0,
description=(
"The priority of the request (lower means earlier handling; "
"default: 0). Any priority other than 0 will raise an error "
"if the served model does not use priority scheduling."
),
)
add_generation_prompt: bool = Field(
default=False,
description=(
"If true, the generation prompt will be added to the chat template. "
"This is a parameter used by chat template in tokenizer config of the "
"model."
),
)
continue_final_message: bool = Field(
default=False,
description=(
"If this is set, the chat will be formatted so that the final "
"message in the chat is open-ended, without any EOS tokens. The "
"model will continue this message rather than starting a new one. "
'This allows you to "prefill" part of the model\'s response for it. '
"Cannot be used at the same time as `add_generation_prompt`."
),
)
add_special_tokens: bool = Field(
default=False,
description=(
"If true, special tokens (e.g. BOS) will be added to the prompt "
"on top of what is added by the chat template. "
"For most models, the chat template takes care of adding the "
"special tokens so this should be set to false (as is the "
"default)."
),
)
chat_template: str | None = Field(
default=None,
description=(
"A Jinja template to use for this conversion. "
"As of transformers v4.44, default chat template is no longer "
"allowed, so you must provide a chat template if the tokenizer "
"does not define one."
),
)
chat_template_kwargs: dict[str, Any] | None = Field(
default=None,
description=(
"Additional keyword args to pass to the template renderer. "
"Will be accessible by the chat template."
),
)
softmax: bool | None = Field(
default=None,
description="softmax will be deprecated, please use use_activation instead.",
)
activation: bool | None = Field(
default=None,
description="activation will be deprecated, please use use_activation instead.",
)
use_activation: bool | None = Field(
default=None,
description="Whether to use activation for classification outputs. "
"Default is True.",
)
Score API¶
我们的 Score API 可以应用交叉编码器模型或嵌入模型来预测句子或多模态对的分数。当使用嵌入模型时,分数对应于每对嵌入之间的余弦相似度。 通常,句子对的分数指的是两个句子之间的相似度,范围从 0 到 1。
你可以在 sbert.net 找到交叉编码器模型的文档。
代码示例: examples/pooling/score/score_api_online.py
评分模板¶
一些评分模型需要特定的提示格式才能正常工作。你可以使用 --chat-template 参数指定自定义的评分模板(参见 聊天模板)。
评分模板仅支持交叉编码器模型。如果你使用嵌入模型进行评分,vLLM 不会应用评分模板。
与聊天模板类似,评分模板接收一个 messages 列表。对于评分,每条消息都有一个 role 属性——要么是 "query",要么是 "document"。对于通常的点式交叉编码器,你可以预期恰好有两条消息:一条查询和一条文档。要访问查询和文档内容,请使用 Jinja 的 selectattr 过滤器:
- 查询:
{{ (messages | selectattr("role", "eq", "query") | first).content }} - 文档:
{{ (messages | selectattr("role", "eq", "document") | first).content }}
这种方法比基于索引的访问(messages[0]、messages[1])更健壮,因为它通过语义角色选择消息。如果将来向 messages 添加了额外的消息类型,它也避免了关于消息顺序的假设。
示例模板文件: examples/pooling/score/template/nemotron-rerank.jinja
单次推理¶
你可以将字符串传递给 queries 和 documents,形成一个句子对。
curl -X 'POST' \
'http://127.0.0.1:8000/score' \
-H 'accept: application/json' \
-H 'Content-Type: application/json' \
-d '{
"model": "BAAI/bge-reranker-v2-m3",
"encoding_format": "float",
"queries": "法国的首都是什么?",
"documents": "法国的首都是巴黎。"
}'
响应
{
"id": "score-request-id",
"object": "list",
"created": 693447,
"model": "BAAI/bge-reranker-v2-m3",
"data": [
{
"index": \\
"object": "score",
"score": 1
}
],
"usage": {}
}
批量推理¶
你可以向 queries 传递一个字符串,向 documents 传递一个列表,从而构建多个句子对,其中每个句子对由 queries 和 documents 中的一个字符串组成。句子对的总数为 len(documents)。
请求
响应
你也可以向 queries 和 documents 都传递一个列表,从而构建多个句子对,其中每个句子对由 queries 中的一个字符串和 documents 中对应位置的字符串组成(类似于 zip() 函数)。句子对的总数为 len(documents)。
请求
curl -X 'POST' \
'http://127.0.0.1:8000/score' \
-H 'accept: application/json' \
-H 'Content-Type: application/json' \
-d '{
"model": "BAAI/bge-reranker-v2-m3",
"encoding_format": "float",
"queries": [
"What is the capital of Brazil?",
"What is the capital of France?"
],
"documents": [
"The capital of Brazil is Brasilia.",
"The capital of France is Paris."
]
}'
响应
多模态输入¶
你可以通过请求中的 content 字段传递包含多模态输入(如图像等)的列表,从而向评分模型传递多模态输入。请参考以下示例进行说明。
要启动该模型服务:
由于请求模式不由 OpenAI 客户端定义,我们使用底层的 requests 库向服务器发送请求:
代码
import requests
response = requests.post(
"http://localhost:8000/v1/score",
json={
"model": "jinaai/jina-reranker-m0",
"queries": "slm markdown",
"documents": {
"content": [
{
"type": "image_url",
"image_url": {
"url": "https://raw.githubusercontent.com/jina-ai/multimodal-reranker-test/main/handelsblatt-preview.png"
},
},
{
"type": "image_url",
"image_url": {
"url": "https://raw.githubusercontent.com/jina-ai/multimodal-reranker-test/main/paper-11.png"
},
},
],
},
},
)
response.raise_for_status()
response_json = response.json()
print("Scoring output:", response_json["data"][0]["score"])
print("Scoring output:", response_json["data"][1]["score"])
完整示例:
- examples/pooling/score/vision_score_api_online.py
- examples/pooling/score/vision_rerank_api_online.py
额外参数¶
支持以下 池化参数。
truncate_prompt_tokens: Annotated[int, msgspec.Meta(ge=-1)] | None = None
softmax: bool | None = None
activation: bool | None = None
use_activation: bool | None = None
支持以下 Score API 参数:
model: str | None = None
user: str | None = None
mm_processor_kwargs: dict[str, Any] | None = Field(
default=None,
description=("Additional kwargs to pass to the HF processor."),
)
支持以下额外参数:
truncate_prompt_tokens: Annotated[int, Field(ge=-1)] | None = None
request_id: str = Field(
default_factory=random_uuid,
description=(
"The request_id related to this request. If the caller does "
"not set it, a random_uuid will be generated. This id is used "
"through out the inference process and return in response."
),
)
priority: int = Field(
default=0,
description=(
"The priority of the request (lower means earlier handling; "
"default: 0). Any priority other than 0 will raise an error "
"if the served model does not use priority scheduling."
),
)
softmax: bool | None = Field(
default=None,
description="softmax will be deprecated, please use use_activation instead.",
)
activation: bool | None = Field(
default=None,
description="activation will be deprecated, please use use_activation instead.",
)
use_activation: bool | None = Field(
default=None,
description="Whether to use activation for classification outputs. "
"Default is True.",
)
mm_processor_kwargs: dict[str, Any] | None = Field(
default=None,
description=("Additional kwargs to pass to the HF processor."),
)
重排序 API¶
我们的重排序 API 可以应用嵌入模型或交叉编码器模型来预测单个查询与一组文档列表中每个文档之间的相关性分数。通常,句子对的分数指的是两个句子或多模态输入(如图像等)之间的相似度,范围在 0 到 1 之间。
你可以在 sbert.net 找到交叉编码器模型的文档。
重排序端点支持流行的重排序模型,例如 BAAI/bge-reranker-base 以及其他支持 score 任务的模型。此外,/rerank、/v1/rerank 和 /v2/rerank 端点与 Jina AI 的重排序 API 接口 和 Cohere 的重排序 API 接口 兼容,以确保与流行的开源工具兼容。
代码示例: examples/pooling/score/rerank_api_online.py
示例请求¶
请注意,top_n 请求参数是可选的,默认为 documents 字段的长度。结果文档将按相关性排序,index 属性可用于确定原始顺序。
请求
curl -X 'POST' \
'http://127.0.0.1:8000/v1/rerank' \
-H 'accept: application/json' \
-H 'Content-Type: application/json' \
-d '{
"model": "BAAI/bge-reranker-base",
"query": "What is the capital of France?",
"documents": [
"The capital of Brazil is Brasilia.",
"The capital of France is Paris.",
"Horses and cows are both animals"
]
}'
响应
{
"id": "rerank-fae51b2b664d4ed38f5969b612edff77",
"model": "BAAI/bge-reranker-base",
"usage": {
"total_tokens": 56
},
"results": [
{
"index": 1,
"document": {
"text": "The capital of France is Paris."
},
"relevance_score": 0.99853515625
},
{
"index": 0,
"document": {
"text": "The capital of Brazil is Brasilia."
},
"relevance_score": 0.0005860328674316406
}
]
}
额外参数¶
支持以下 池化参数。
truncate_prompt_tokens: Annotated[int, msgspec.Meta(ge=-1)] | None = None
softmax: bool | None = None
activation: bool | None = None
use_activation: bool | None = None
支持以下重排序 API 参数:
model: str | None = None
user: str | None = None
softmax: bool | None = Field(
default=None,
description="softmax will be deprecated, please use use_activation instead.",
)
activation: bool | None = Field(
default=None,
description="activation will be deprecated, please use use_activation instead.",
)
use_activation: bool | None = Field(
default=None,
description="Whether to use activation for classification outputs. "
"Default is True.",
)
mm_processor_kwargs: dict[str, Any] | None = Field(
default=None,
description=("Additional kwargs to pass to the HF processor."),
)
支持以下额外参数:
truncate_prompt_tokens: Annotated[int, Field(ge=-1)] | None = None
request_id: str = Field(
default_factory=random_uuid,
description=(
"The request_id related to this request. If the caller does "
"not set it, a random_uuid will be generated. This id is used "
"through out the inference process and return in response."
),
)
priority: int = Field(
default=0,
description=(
"The priority of the request (lower means earlier handling; "
"default: 0). Any priority other than 0 will raise an error "
"if the served model does not use priority scheduling."
),
)
softmax: bool | None = Field(
default=None,
description="softmax will be deprecated, please use use_activation instead.",
)
activation: bool | None = Field(
default=None,
description="activation will be deprecated, please use use_activation instead.",
)
use_activation: bool | None = Field(
default=None,
description="Whether to use activation for classification outputs. "
"Default is True.",
)
mm_processor_kwargs: dict[str, Any] | None = Field(
default=None,
description=("Additional kwargs to pass to the HF processor."),
)
Ray Serve LLM¶
Ray Serve LLM 能够实现 vLLM 引擎的可扩展、生产级服务。它与 vLLM 紧密集成,并通过自动扩缩容、负载均衡和背压等特性对其进行扩展。
核心能力:
- 暴露一个 OpenAI 兼容的 HTTP API 以及一个 Pythonic API。
- 无需更改代码即可从单个 GPU 扩展到多节点集群。
- 通过 Ray 仪表板和指标提供可观测性和自动扩缩容策略。
以下示例展示了如何使用 Ray Serve LLM 部署像 DeepSeek R1 这样的大型模型: examples/online_serving/ray_serve_deepseek.py。
通过官方 Ray Serve LLM 文档了解更多信息。