Overview
Most models are trained using 32-bit or 16-bit floating-point numbers. Quantization converts these to 8-bit or even 4-bit integers.
Impact
- Memory: Reduces the RAM/VRAM required to run the model by 4x or more.
- Speed: Can significantly speed up inference on hardware that supports integer math.
- Accuracy: Usually results in a small, acceptable drop in model performance.
Popular Formats
- GGUF (used with llama.cpp)
- EXL2
- AWQ