What is quantization?

Quantization shrinks a model by storing its weights and activations at lower numeric precision, for example 8-bit or 4-bit integers instead of 16-bit floats, cutting memory and speeding up inference with minimal quality loss.

Quantization reduces the numeric precision used to represent a model's parameters. A model trained in 16-bit floating point can often be served in 8-bit or even 4-bit integer formats, roughly halving or quartering its memory footprint and accelerating inference, because lower-precision arithmetic is faster and moves less data through memory, which is usually the bottleneck. The art is keeping quality intact: naive rounding loses accuracy, so techniques like per-channel scaling, GPTQ, AWQ, and quantization-aware training preserve the parts of the weight distribution that matter most. Post-training quantization is applied after a model is trained and is the common path for serving open-weight models on commodity GPUs or even laptops; runtimes like llama.cpp and Ollama lean on it heavily. Quantization is what makes it practical to run a capable model locally or to fit a large model on fewer accelerators, and it composes with distillation and LoRA adapters. The main tradeoff is that aggressive 4-bit quantization can degrade reasoning on the hardest prompts, so teams benchmark each quantized variant on their own evals rather than assuming parity with the full-precision model.