Overview

Most models are trained using 32-bit or 16-bit floating-point numbers. Quantization converts these to 8-bit or even 4-bit integers.

Impact

  • Memory: Reduces the RAM/VRAM required to run the model by 4x or more.
  • Speed: Can significantly speed up inference on hardware that supports integer math.
  • Accuracy: Usually results in a small, acceptable drop in model performance.

Popular Formats

  • GGUF (used with llama.cpp)
  • EXL2
  • AWQ

Related Terms