vllm.model_executor.layers.quantization.compressed_tensors.schemes.compressed_tensors_w4a16_mxfp4 ¶
CompressedTensorsW4A16Mxfp4 ¶
Bases: CompressedTensorsScheme
Compressed tensors scheme for MXFP4 weight-only quantization.
Supports models quantized with the compressed-tensors mxfp4-pack-quantized format.
MXFP4 format: - 4-bit float weights (E2M1) packed into uint8 - Per-group E8M0 scales with group_size=32 - No global scale (unlike NVFP4)
Source code in vllm/model_executor/layers/quantization/compressed_tensors/schemes/compressed_tensors_w4a16_mxfp4.py
__init__ ¶
apply_weights ¶
Source code in vllm/model_executor/layers/quantization/compressed_tensors/schemes/compressed_tensors_w4a16_mxfp4.py
create_weights ¶
create_weights(
layer: Module,
output_partition_sizes: list[int],
input_size_per_partition: int,
params_dtype: dtype,
weight_loader: Callable,
**kwargs,
)