Quantization for LLM Inference
Large Language Models (LLMs) have transformed natural language processing but remain costly in terms of compute, memory, and energy. Quantization is a central optimization technique that reduces bit precision of weights and activations, enabling faster and more efficient inference while maintaining accuracy.
This article consolidates core concepts, practical methods, precision formats, calibration strategies, and recent research, with reference to Hugging Face Optimum’s quantization guidelines.
1. Fundamentals
Definition
Quantization converts high-precision values (e.g., FP32) into lower-precision types (FP16, INT8, INT4).
Motivation
- Reduce model size and memory usage
 - Accelerate inference
 - Lower power consumption
 - Enable deployment on constrained hardware
 
Trade-offs
- Potential accuracy loss
 - Some layers are more sensitive to quantization
 - Calibration is required to determine proper ranges
 
2. Types of Quantization
- 
Post-Training Quantization (PTQ)
Applied after training. Simple, but may reduce accuracy. - 
Quantization-Aware Training (QAT)
Simulates quantization during training. Higher accuracy retention, but costly. - 
Weight-Only Quantization
Weights in INT4/INT8, activations remain in higher precision. - 
Weight + Activation Quantization
Both quantized, providing maximum efficiency, but harder to calibrate. - 
Mixed Precision Quantization
Assigns different bit-widths per layer based on sensitivity. 
3. Precision Formats
| Precision | Memory Reduction | Typical Usage | Notes | 
|---|---|---|---|
| FP32 | — | Training baseline | Highest accuracy, largest size | 
| FP16/BF16 | ~50% | Training & inference | Default in many frameworks | 
| INT8 | ~4× smaller | Production inference | Widely supported | 
| INT4 | ~8× smaller | LLM inference (current standard) | Balance of efficiency and accuracy | 
| ≤2-bit | Experimental | Research only | Instability issues | 
Observation: Many open-source LLaMA and GPT weights are distributed in GGUF quantized formats (e.g., mfxp4, nf4, q4_k_m), making them feasible on consumer GPUs.
4. Symmetric vs Affine Quantization
- 
Affine (Asymmetric):
( x \approx S \cdot (x_q - Z) ), with scale (S) and zero-point (Z).
Common in activations to align ranges with integer representation. - 
Symmetric:
Zero point fixed at 0. Simpler, often used for weights. 
5. Per-Tensor vs Per-Channel
- Per-Tensor: Single scale/zero-point across entire tensor. Lower memory, less accurate.
 - Per-Channel: Separate parameters per channel (e.g., convolution output channel). Higher accuracy at small additional cost.
 
6. Calibration and Range Estimation
Quantization requires mapping real-valued activations to integer ranges. This is handled via calibration:
- Dynamic Quantization: Range estimated at runtime. Flexible but adds overhead.
 - Static Quantization: Calibration dataset used to pre-compute ranges (e.g., ~200 examples). Faster inference.
 - QAT: Model learns quantization effects during training. Most accurate.
 
Calibration Strategies
- Min–max
 - Moving average
 - Histogram-based (entropy, MSE, percentile)
 
7. Practical Workflow (INT8 Example)
- Select operators to quantize (linear, attention projections).
 - Apply dynamic quantization to evaluate feasibility.
 - Perform static quantization with calibration passes.
 - Choose range estimation method (histogram, percentile).
 - Convert observers into quantized operators.
 - Validate accuracy and latency. If insufficient, switch to QAT or refine calibration.
 
8. Tools and Frameworks
- 
Hugging Face Optimum
optimum.onnxruntime– ONNX quantization and inferenceoptimum.intel– Intel hardware optimizationoptimum.fx– PyTorch FX graph-based quantizationoptimum.gptq– GPTQ quantization for LLMs
 - 
bitsandbytes – PyTorch INT8/INT4 quantization
 - 
GPTQ / AWQ – Post-training methods for LLMs
 - 
GGUF (llama.cpp) – Efficient runtime format with diverse quantization schemes
 - 
ONNX Runtime / TensorRT-LLM – Production-grade inference frameworks
 
9. Research Trends (2024–2025)
- Outlier-Aware Quantization (OWQ): Preserve precision for rare outlier weights
 - SliM-LLM: Bit allocation by weight salience
 - VPTQ: Vector quantization, enabling 2-bit compression
 - QuantX: Hardware-aware framework targeting sub-3-bit precision
 
These innovations aim to push accuracy retention below INT4 while maintaining efficiency.
10. System-Level Considerations
- KV-Cache Quantization: Saves memory in long-context inference; requires careful tuning.
 - Kernel Support: True speedups depend on backend INT4/INT8 matmul kernels (llama.cpp, vLLM, TensorRT).
 - Hybrid CPU–GPU Offload: GGUF enables CPU–GPU memory partitioning for larger models.
 
11. Key Figures
- Memory reduction: 75–90% with INT4
 - LLaMA-2 7B: 13GB (FP16) → 4GB (INT4)
 - Inference speedup: 2–3× in practice
 - Accuracy degradation: typically < 1–2% with calibration
 
Conclusion
Quantization is a cornerstone of modern LLM deployment. By combining techniques such as PTQ, QAT, weight-only quantization, mixed precision, and leveraging frameworks like Optimum, GPTQ, and GGUF, large models can be scaled to real-world environments at reduced cost and latency.
The future of quantization lies in sub-4-bit methods, smarter calibration, and hardware-aware implementations, making LLMs increasingly efficient and accessible.