Quantization

Most models are trained using 32-bit or 16-bit floating-point numbers. Quantization converts these to 8-bit or even 4-bit integers.

Memory: Reduces the RAM/VRAM required to run the model by 4x or more.
Speed: Can significantly speed up inference on hardware that supports integer math.
Accuracy: Usually results in a small, acceptable drop in model performance.

Related Terms