Quantization for LLM Inference

2025-10-02

Large Language Models (LLMs) have transformed natural language processing but remain costly in terms of compute, memory, and energy. Quantization is a central optimization technique that reduces bit precision of weights and activations, enabling faster and more efficient inference while maintaining accuracy.

This article consolidates core concepts, practical methods, precision formats, calibration strategies, and recent research, with reference to Hugging Face Optimum’s quantization guidelines.



1. Fundamentals

Definition
Quantization converts high-precision values (e.g., FP32) into lower-precision types (FP16, INT8, INT4).

Motivation

  • Reduce model size and memory usage
  • Accelerate inference
  • Lower power consumption
  • Enable deployment on constrained hardware

Trade-offs

  • Potential accuracy loss
  • Some layers are more sensitive to quantization
  • Calibration is required to determine proper ranges


2. Types of Quantization

  • Post-Training Quantization (PTQ)
    Applied after training. Simple, but may reduce accuracy.

  • Quantization-Aware Training (QAT)
    Simulates quantization during training. Higher accuracy retention, but costly.

  • Weight-Only Quantization
    Weights in INT4/INT8, activations remain in higher precision.

  • Weight + Activation Quantization
    Both quantized, providing maximum efficiency, but harder to calibrate.

  • Mixed Precision Quantization
    Assigns different bit-widths per layer based on sensitivity.



3. Precision Formats

PrecisionMemory ReductionTypical UsageNotes
FP32Training baselineHighest accuracy, largest size
FP16/BF16~50%Training & inferenceDefault in many frameworks
INT8~4× smallerProduction inferenceWidely supported
INT4~8× smallerLLM inference (current standard)Balance of efficiency and accuracy
≤2-bitExperimentalResearch onlyInstability issues

Observation: Many open-source LLaMA and GPT weights are distributed in GGUF quantized formats (e.g., mfxp4, nf4, q4_k_m), making them feasible on consumer GPUs.



4. Symmetric vs Affine Quantization

  • Affine (Asymmetric):
    ( x \approx S \cdot (x_q - Z) ), with scale (S) and zero-point (Z).
    Common in activations to align ranges with integer representation.

  • Symmetric:
    Zero point fixed at 0. Simpler, often used for weights.



5. Per-Tensor vs Per-Channel

  • Per-Tensor: Single scale/zero-point across entire tensor. Lower memory, less accurate.
  • Per-Channel: Separate parameters per channel (e.g., convolution output channel). Higher accuracy at small additional cost.


6. Calibration and Range Estimation

Quantization requires mapping real-valued activations to integer ranges. This is handled via calibration:

  • Dynamic Quantization: Range estimated at runtime. Flexible but adds overhead.
  • Static Quantization: Calibration dataset used to pre-compute ranges (e.g., ~200 examples). Faster inference.
  • QAT: Model learns quantization effects during training. Most accurate.

Calibration Strategies

  • Min–max
  • Moving average
  • Histogram-based (entropy, MSE, percentile)


7. Practical Workflow (INT8 Example)

  1. Select operators to quantize (linear, attention projections).
  2. Apply dynamic quantization to evaluate feasibility.
  3. Perform static quantization with calibration passes.
  4. Choose range estimation method (histogram, percentile).
  5. Convert observers into quantized operators.
  6. Validate accuracy and latency. If insufficient, switch to QAT or refine calibration.


8. Tools and Frameworks

  • Hugging Face Optimum

    • optimum.onnxruntime – ONNX quantization and inference
    • optimum.intel – Intel hardware optimization
    • optimum.fx – PyTorch FX graph-based quantization
    • optimum.gptq – GPTQ quantization for LLMs
  • bitsandbytes – PyTorch INT8/INT4 quantization

  • GPTQ / AWQ – Post-training methods for LLMs

  • GGUF (llama.cpp) – Efficient runtime format with diverse quantization schemes

  • ONNX Runtime / TensorRT-LLM – Production-grade inference frameworks



  • Outlier-Aware Quantization (OWQ): Preserve precision for rare outlier weights
  • SliM-LLM: Bit allocation by weight salience
  • VPTQ: Vector quantization, enabling 2-bit compression
  • QuantX: Hardware-aware framework targeting sub-3-bit precision

These innovations aim to push accuracy retention below INT4 while maintaining efficiency.



10. System-Level Considerations

  • KV-Cache Quantization: Saves memory in long-context inference; requires careful tuning.
  • Kernel Support: True speedups depend on backend INT4/INT8 matmul kernels (llama.cpp, vLLM, TensorRT).
  • Hybrid CPU–GPU Offload: GGUF enables CPU–GPU memory partitioning for larger models.


11. Key Figures

  • Memory reduction: 75–90% with INT4
  • LLaMA-2 7B: 13GB (FP16) → 4GB (INT4)
  • Inference speedup: 2–3× in practice
  • Accuracy degradation: typically < 1–2% with calibration


Conclusion

Quantization is a cornerstone of modern LLM deployment. By combining techniques such as PTQ, QAT, weight-only quantization, mixed precision, and leveraging frameworks like Optimum, GPTQ, and GGUF, large models can be scaled to real-world environments at reduced cost and latency.

The future of quantization lies in sub-4-bit methods, smarter calibration, and hardware-aware implementations, making LLMs increasingly efficient and accessible.