Quantization for LLM Inference

2025-10-02

Large Language Models (LLMs) have transformed natural language processing but remain costly in terms of compute, memory, and energy. Quantization is a central optimization technique that reduces bit precision of weights and activations, enabling faster and more efficient inference while maintaining accuracy.

This article consolidates core concepts, practical methods, precision formats, calibration strategies, and recent research, with reference to Hugging Face Optimum’s quantization guidelines.

1. Fundamentals

Definition
Quantization converts high-precision values (e.g., FP32) into lower-precision types (FP16, INT8, INT4).

Motivation

Reduce model size and memory usage
Accelerate inference
Lower power consumption
Enable deployment on constrained hardware

Trade-offs

Potential accuracy loss
Some layers are more sensitive to quantization
Calibration is required to determine proper ranges

2. Types of Quantization

Post-Training Quantization (PTQ)
Applied after training. Simple, but may reduce accuracy.
Quantization-Aware Training (QAT)
Simulates quantization during training. Higher accuracy retention, but costly.
Weight-Only Quantization
Weights in INT4/INT8, activations remain in higher precision.
Weight + Activation Quantization
Both quantized, providing maximum efficiency, but harder to calibrate.
Mixed Precision Quantization
Assigns different bit-widths per layer based on sensitivity.

3. Precision Formats

Precision	Memory Reduction	Typical Usage	Notes
FP32	—	Training baseline	Highest accuracy, largest size
FP16/BF16	~50%	Training & inference	Default in many frameworks
INT8	~4× smaller	Production inference	Widely supported
INT4	~8× smaller	LLM inference (current standard)	Balance of efficiency and accuracy
≤2-bit	Experimental	Research only	Instability issues

Observation: Many open-source LLaMA and GPT weights are distributed in GGUF quantized formats (e.g., mfxp4, nf4, q4_k_m), making them feasible on consumer GPUs.

4. Symmetric vs Affine Quantization

Affine (Asymmetric):
( x \approx S \cdot (x_q - Z) ), with scale (S) and zero-point (Z).
Common in activations to align ranges with integer representation.
Symmetric:
Zero point fixed at 0. Simpler, often used for weights.

5. Per-Tensor vs Per-Channel

Per-Tensor: Single scale/zero-point across entire tensor. Lower memory, less accurate.
Per-Channel: Separate parameters per channel (e.g., convolution output channel). Higher accuracy at small additional cost.

6. Calibration and Range Estimation

Quantization requires mapping real-valued activations to integer ranges. This is handled via calibration:

Dynamic Quantization: Range estimated at runtime. Flexible but adds overhead.
Static Quantization: Calibration dataset used to pre-compute ranges (e.g., ~200 examples). Faster inference.
QAT: Model learns quantization effects during training. Most accurate.

Calibration Strategies

Min–max
Moving average
Histogram-based (entropy, MSE, percentile)

7. Practical Workflow (INT8 Example)

Select operators to quantize (linear, attention projections).
Apply dynamic quantization to evaluate feasibility.
Perform static quantization with calibration passes.
Choose range estimation method (histogram, percentile).
Convert observers into quantized operators.
Validate accuracy and latency. If insufficient, switch to QAT or refine calibration.

8. Tools and Frameworks

Hugging Face Optimum
- optimum.onnxruntime – ONNX quantization and inference
- optimum.intel – Intel hardware optimization
- optimum.fx – PyTorch FX graph-based quantization
- optimum.gptq – GPTQ quantization for LLMs
bitsandbytes – PyTorch INT8/INT4 quantization
GPTQ / AWQ – Post-training methods for LLMs
GGUF (llama.cpp) – Efficient runtime format with diverse quantization schemes
ONNX Runtime / TensorRT-LLM – Production-grade inference frameworks

9. Research Trends (2024–2025)

Outlier-Aware Quantization (OWQ): Preserve precision for rare outlier weights
SliM-LLM: Bit allocation by weight salience
VPTQ: Vector quantization, enabling 2-bit compression
QuantX: Hardware-aware framework targeting sub-3-bit precision

These innovations aim to push accuracy retention below INT4 while maintaining efficiency.

10. System-Level Considerations

KV-Cache Quantization: Saves memory in long-context inference; requires careful tuning.
Kernel Support: True speedups depend on backend INT4/INT8 matmul kernels (llama.cpp, vLLM, TensorRT).
Hybrid CPU–GPU Offload: GGUF enables CPU–GPU memory partitioning for larger models.

11. Key Figures

Memory reduction: 75–90% with INT4
LLaMA-2 7B: 13GB (FP16) → 4GB (INT4)
Inference speedup: 2–3× in practice
Accuracy degradation: typically < 1–2% with calibration

Conclusion

Quantization is a cornerstone of modern LLM deployment. By combining techniques such as PTQ, QAT, weight-only quantization, mixed precision, and leveraging frameworks like Optimum, GPTQ, and GGUF, large models can be scaled to real-world environments at reduced cost and latency.

The future of quantization lies in sub-4-bit methods, smarter calibration, and hardware-aware implementations, making LLMs increasingly efficient and accessible.