Model Quantization

What Is Model Quantization?

Model quantization is a compression technique that reduces the precision of a neural network's weights and activations from standard 32-bit or 16-bit floating point to lower-precision formats such as INT8 (8-bit integer) or INT4 (4-bit integer). This can shrink model size by 2-8x and significantly accelerate inference, often with minimal impact on output quality.

Several quantization approaches have emerged. Post-training quantization (PTQ) converts an already-trained model without retraining. Quantization-aware training (QAT) simulates low-precision during training for better accuracy. Advanced methods like GPTQ and AWQ use calibration data to optimize which weights can tolerate lower precision, achieving remarkable compression with negligible quality loss.

Why Quantization Matters for Deployment

Running large language models at full precision requires expensive GPU hardware with substantial memory. Quantization enables organizations to deploy powerful models on more modest infrastructure, including consumer-grade GPUs or even CPUs. A 70-billion parameter model that normally requires multiple high-end GPUs can run on a single GPU when quantized to 4-bit precision.

Trade-offs and Best Practices

The key trade-off is between compression and quality. INT8 quantization typically preserves over 99% of model quality, while INT4 may introduce more noticeable degradation on complex reasoning tasks. Enterprise deployments should benchmark quantized models against full-precision baselines on their specific use cases to find the optimal balance between cost and performance.

What Is Model Quantization?

Why Quantization Matters for Deployment

Trade-offs and Best Practices

Related terms

Related services and products