Quantization¶
Quantization trades off model precision for smaller memory footprint, allowing large models to be run on a wider range of devices.
Tip
To get started with quantization, see LLM Compressor, a library for optimizing models for deployment with vLLM that supports FP8, INT8, INT4, and other quantization formats.
The following are the supported quantization formats for vLLM:
- AutoAWQ
- BitsAndBytes
- GGUF
- GPTQModel
- Intel Neural Compressor
- INT4 W4A16
- INT8 W8A8
- FP8 W8A8
- NVIDIA Model Optimizer
- AMD Quark
- Quantized KV Cache
- TorchAO
Supported Hardware¶
The table below shows the compatibility of various quantization implementations with different hardware platforms in vLLM:
| Implementation | Volta | Turing | Ampere | Ada | Hopper | AMD GPU | Intel GPU | x86 CPU |
|---|---|---|---|---|---|---|---|---|
| AWQ | ❌ | ✅︎ | ✅︎ | ✅︎ | ✅︎ | ❌ | ✅︎ | ✅︎ |
| GPTQ | ✅︎ | ✅︎ | ✅︎ | ✅︎ | ✅︎ | ❌ | ✅︎ | ✅︎ |
| Marlin (GPTQ/AWQ/FP8) | ❌ | ❌ | ✅︎ | ✅︎ | ✅︎ | ❌ | ❌ | ❌ |
| INT8 (W8A8) | ❌ | ✅︎ | ✅︎ | ✅︎ | ✅︎ | ❌ | ❌ | ✅︎ |
| FP8 (W8A8) | ❌ | ❌ | ❌ | ✅︎ | ✅︎ | ✅︎ | ❌ | ❌ |
| bitsandbytes | ✅︎ | ✅︎ | ✅︎ | ✅︎ | ✅︎ | ❌ | ❌ | ❌ |
| DeepSpeedFP | ✅︎ | ✅︎ | ✅︎ | ✅︎ | ✅︎ | ❌ | ❌ | ❌ |
| GGUF | ✅︎ | ✅︎ | ✅︎ | ✅︎ | ✅︎ | ✅︎ | ❌ | ❌ |
- Volta refers to SM 7.0, Turing to SM 7.5, Ampere to SM 8.0/8.6, Ada to SM 8.9, and Hopper to SM 9.0.
- ✅︎ indicates that the quantization method is supported on the specified hardware.
- ❌ indicates that the quantization method is not supported on the specified hardware.
- All Intel Gaudi quantization support has been migrated to vLLM-Gaudi.
Note
For information on quantization support on Google TPU, please refer to the TPU-Inference Recommended Models and Features documentation.
Note
This compatibility chart is subject to change as vLLM continues to evolve and expand its support for different hardware platforms and quantization methods.
For the most up-to-date information on hardware support and quantization methods, please refer to vllm/model_executor/layers/quantization or consult with the vLLM development team.
Out-of-Tree Quantization Plugins¶
vLLM supports registering custom, out-of-tree quantization methods using the @register_quantization_config decorator. This allows you to implement and use your own quantization schemes without modifying the vLLM codebase.
Registering a Custom Quantization Method¶
To register a custom quantization method, create a class that inherits from QuantizationConfig and decorate it with @register_quantization_config. The get_quant_method dispatches to the appropriate quantize method based on the layer type:
import torch
from vllm.model_executor.layers.quantization import (
register_quantization_config,
)
from vllm.model_executor.layers.quantization.base_config import (
QuantizationConfig,
QuantizeMethodBase,
)
from vllm.model_executor.layers.linear import LinearBase
from vllm.model_executor.layers.fused_moe import FusedMoE
@register_quantization_config("my_quant")
class MyQuantConfig(QuantizationConfig):
"""Custom quantization config."""
def get_name(self) -> str:
return "my_quant"
def get_supported_act_dtypes(self) -> list:
return [torch.float16, torch.bfloat16]
@classmethod
def get_min_capability(cls) -> int:
# Minimum GPU compute capability, -1 for no restriction
return -1
@staticmethod
def get_config_filenames() -> list[str]:
# Config files to search for in model directory
return []
@classmethod
def from_config(cls, config: dict) -> "MyQuantConfig":
# Create config from model's quantization config
return cls()
def get_quant_method(
self, layer: torch.nn.Module, prefix: str
) -> QuantizeMethodBase | None:
# Dispatch based on layer type
# NOTE: you only need to implement methods you care about
if isinstance(layer, LinearBase):
return MyQuantLinearMethod()
elif isinstance(layer, FusedMoE):
return MyQuantMoEMethod(layer.moe_config)
return None
Required QuantizationConfig Methods¶
Your custom QuantizationConfig subclass must implement these abstract methods:
| Method | Description |
|---|---|
get_name() | Returns the name of the quantization method |
get_supported_act_dtypes() | Returns list of supported activation dtypes (e.g., torch.float16) |
get_min_capability() | Returns minimum GPU compute capability (e.g., 80 for Ampere, -1 for no restriction) |
get_config_filenames() | Returns list of config filenames to search for in model directory |
from_config(config) | Class method to create config from model's quantization config dict |
get_quant_method(layer, prefix) | Returns the quantization method for a given layer, or None to skip |
Implementing a Quantized Linear Method¶
For linear layers, return a QuantizeMethodBase subclass from get_quant_method. You can extend UnquantizedLinearMethod as a starting point:
from vllm.model_executor.layers.linear import UnquantizedLinearMethod
class MyQuantLinearMethod(UnquantizedLinearMethod):
"""Custom quantization method for linear layers."""
def create_weights(
self, layer: torch.nn.Module, *weight_args, **extra_weight_attrs
):
# Create quantized weights for the layer
...
def apply(
self,
layer: torch.nn.Module,
x: torch.Tensor,
bias: torch.Tensor | None = None,
) -> torch.Tensor:
# Apply custom quantization logic here
...
Implementing a Quantized MoE Method¶
For Mixture of Experts (MoE) models, return a FusedMoEMethodBase subclass from get_quant_method. You can use UnquantizedFusedMoEMethod to skip MoE quantization:
from vllm.model_executor.layers.fused_moe.layer import UnquantizedFusedMoEMethod
from vllm.model_executor.layers.fused_moe.fused_moe_method_base import (
FusedMoEMethodBase,
)
from vllm.model_executor.layers.fused_moe.config import FusedMoEQuantConfig
class MyQuantMoEMethod(FusedMoEMethodBase):
"""Custom quantization method for MoE layers."""
def create_weights(
self,
layer: torch.nn.Module,
num_experts: int,
hidden_size: int,
intermediate_size_per_partition: int,
params_dtype: torch.dtype,
**extra_weight_attrs,
):
# Create quantized weights for the MoE layer
...
```markdown
---
title: "使用 vLLM 插件系统开发自定义量化方法"
description: "了解如何使用 vLLM 插件系统创建自定义量化方法。"
linkTitle: "自定义量化方法"
weight: 30
---
这个指南展示了如何通过 vLLM 的插件系统开发自定义量化方法。
vLLM 目前支持 FP8、AWQ 和 GPTQ 等多种量化方法,您可以创建自己的量化插件来支持新的量化格式。
## 插件结构
自定义量化插件必须实现以下接口:
```python
from vllm.model_executor.layers.quantization import (
QuantizationConfig,
QuantizedModule,
FusedMoEMethod,
FusedMoEQuantConfig,
)
from vllm.model_executor.layers.quantization.base_config import (
QuantizationConfig as BaseQuantizationConfig,
)
class MyQuantConfig(QuantizationConfig):
# 实现量化配置的具体细节
...
class MyQuantizedModule(QuantizedModule):
# 实现量化模块的具体细节
...
class MyFusedMoEMethod(FusedMoEMethod):
def apply(
self,
layer: torch.nn.Module,
router: "FusedMoERouter",
x: torch.Tensor,
router_logits: torch.Tensor,
) -> torch.Tensor:
# 应用量化权重的 MoE 计算
...
def get_fused_moe_quant_config(
self, layer: torch.nn.Module
) -> FusedMoEQuantConfig | None:
# 返回 MoE 量化配置
...
请参考 vllm/model_executor/layers/quantization/fp8.py 中的 Fp8MoEMethod 实现作为参考。
使用插件¶
注册后,您可以在 vLLM 中使用自定义量化方法:
# 注册您的量化方法(导入包含配置的模块)
import my_quant_plugin
from vllm import LLM
# 使用自定义量化方法
llm = LLM(model="your-model", quantization="my_quant")
有关插件系统的更多信息,请参阅插件系统文档。 ```