生产环境指标¶

vLLM 暴露了一系列可用于监控系统健康状况的指标。这些指标通过 vLLM 的 OpenAI 兼容 API 服务器上的 /metrics 端点提供。

你可以使用 Python 启动服务器，或使用 Docker：

vllm serve unsloth/Llama-3.2-1B-Instruct

然后查询该端点以获取服务器的最新指标：

输出

$ curl http://0.0.0.0:8000/metrics

# HELP vllm:iteration_tokens_total 每个 engine_step 中 token 数量的直方图。
# TYPE vllm:iteration_tokens_total histogram
vllm:iteration_tokens_total_sum{model_name="unsloth/Llama-3.2-1B-Instruct"} 0.0
vllm:iteration_tokens_total_bucket{le="1.0",model_name="unsloth/Llama-3.2-1B-Instruct"} 3.0
vllm:iteration_tokens_total_bucket{le="8.0",model_name="unsloth/Llama-3.2-1B-Instruct"} 3.0
vllm:iteration_tokens_total_bucket{le="16.0",model_name="unsloth/Llama-3.2-1B-Instruct"} 3.0
vllm:iteration_tokens_total_bucket{le="32.0",model_name="unsloth/Llama-3.2-1B-Instruct"} 3.0
vllm:iteration_tokens_total_bucket{le="64.0",model_name="unsloth/Llama-3.2-1B-Instruct"} 3.0
vllm:iteration_tokens_total_bucket{le="128.0",model_name="unsloth/Llama-3.2-1B-Instruct"} 3.0
vllm:iteration_tokens_total_bucket{le="256.0",model_name="unsloth/Llama-3.2-1B-Instruct"} 3.0
vllm:iteration_tokens_total_bucket{le="512.0",model_name="unsloth/Llama-3.2-1B-Instruct"} 3.0
...

以下指标被暴露：

通用指标¶

Metric Name	Type	Description
`vllm:corrupted_requests`	Counter	Corrupted requests, in terms of total number of requests with NaNs in logits.
`vllm:external_prefix_cache_hits`	Counter	External prefix cache hits from KV connector cross-instance cache sharing, in terms of number of cached tokens.
`vllm:external_prefix_cache_queries`	Counter	External prefix cache queries from KV connector cross-instance cache sharing, in terms of number of queried tokens.
`vllm:generation_tokens`	Counter	Number of generation tokens processed.
`vllm:mm_cache_hits`	Counter	Multi-modal cache hits, in terms of number of cached items.
`vllm:mm_cache_queries`	Counter	Multi-modal cache queries, in terms of number of queried items.
`vllm:num_preemptions`	Counter	Cumulative number of preemption from the engine.
`vllm:prefix_cache_hits`	Counter	Prefix cache hits, in terms of number of cached tokens.
`vllm:prefix_cache_queries`	Counter	Prefix cache queries, in terms of number of queried tokens.
`vllm:prompt_tokens`	Counter	Number of prefill tokens processed.
`vllm:request_success`	Counter	Count of successfully processed requests.
`vllm:engine_sleep_state`	Gauge	Engine sleep state; awake = 0 means engine is sleeping; awake = 1 means engine is awake; weights_offloaded = 1 means sleep level 1; discard_all = 1 means sleep level 2.
`vllm:kv_cache_usage_perc`	Gauge	KV-cache usage. 1 means 100 percent usage.
`vllm:lora_requests_info`	Gauge	Running stats on lora requests.
`vllm:num_requests_running`	Gauge	Number of requests in model execution batches.
`vllm:num_requests_waiting`	Gauge	Number of requests waiting to be processed.
`vllm:e2e_request_latency_seconds`	Histogram	Histogram of e2e request latency in seconds.
`vllm:inter_token_latency_seconds`	Histogram	Histogram of inter-token latency in seconds.
`vllm:iteration_tokens_total`	Histogram	Histogram of number of tokens per engine_step.
`vllm:kv_block_idle_before_evict_seconds`	Histogram	Histogram of idle time before KV cache block eviction. Sampled metrics (controlled by --kv-cache-metrics-sample).
`vllm:kv_block_lifetime_seconds`	Histogram	Histogram of KV cache block lifetime from allocation to eviction. Sampled metrics (controlled by --kv-cache-metrics-sample).
`vllm:kv_block_reuse_gap_seconds`	Histogram	Histogram of time gaps between consecutive KV cache block accesses. Only the most recent accesses are recorded (ring buffer). Sampled metrics (controlled by --kv-cache-metrics-sample).
`vllm:request_decode_time_seconds`	Histogram	Histogram of time spent in DECODE phase for request.
`vllm:request_generation_tokens`	Histogram	Number of generation tokens processed.
`vllm:request_inference_time_seconds`	Histogram	Histogram of time spent in RUNNING phase for request.
`vllm:request_max_num_generation_tokens`	Histogram	Histogram of maximum number of requested generation tokens.
`vllm:request_params_max_tokens`	Histogram	Histogram of the max_tokens request parameter.
`vllm:request_params_n`	Histogram	Histogram of the n request parameter.
`vllm:request_prefill_kv_computed_tokens`	Histogram	Histogram of new KV tokens computed during prefill (excluding cached tokens).
`vllm:request_prefill_time_seconds`	Histogram	Histogram of time spent in PREFILL phase for request.
`vllm:request_prompt_tokens`	Histogram	Number of prefill tokens processed.
`vllm:request_queue_time_seconds`	Histogram	Histogram of time spent in WAITING phase for request.
`vllm:request_time_per_output_token_seconds`	Histogram	Histogram of time_per_output_token_seconds per request.
`vllm:time_to_first_token_seconds`	Histogram	Histogram of time to first token in seconds.

推测解码指标¶

Metric Name	Type	Description
`vllm:spec_decode_num_accepted_tokens`	Counter	Number of accepted tokens.
`vllm:spec_decode_num_accepted_tokens_per_pos`	Counter	Accepted tokens per draft position.
`vllm:spec_decode_num_draft_tokens`	Counter	Number of draft tokens.
`vllm:spec_decode_num_drafts`	Counter	Number of spec decoding drafts.

NIXL KV 连接器指标¶

Metric Name	Type	Description
`vllm:nixl_num_failed_notifications`	Counter	Number of failed NIXL KV Cache notifications.
`vllm:nixl_num_failed_transfers`	Counter	Number of failed NIXL KV Cache transfers.
`vllm:nixl_num_kv_expired_reqs`	Counter	Number of requests that had their KV expire. NOTE: This metric is tracked on the P instance.
`vllm:nixl_bytes_transferred`	Histogram	Histogram of bytes transferred per NIXL KV Cache transfers.
`vllm:nixl_num_descriptors`	Histogram	Histogram of number of descriptors per NIXL KV Cache transfers.
`vllm:nixl_post_time_seconds`	Histogram	Histogram of transfer post time for NIXL KV Cache transfers.
`vllm:nixl_xfer_time_seconds`	Histogram	Histogram of transfer duration for NIXL KV Cache transfers.

废弃策略¶

注意：当某个指标在版本 X.Y 中被废弃时，它在版本 X.Y+1 中将被隐藏，但可通过 --show-hidden-metrics-for-version=X.Y 这一临时选项重新启用，然后在版本 X.Y+2 中被移除。