Skip to content
C Codeloom
LLMs

LLM Quantization Explained

How quantization shrinks LLMs to run on smaller hardware, the math behind 8-bit and 4-bit weights, and the trade-offs between speed, memory, and quality.

·4 min read · By Codeloom
Intermediate 10 min read

What you'll learn

  • What quantization actually does
  • How 8-bit and 4-bit weights work
  • Common methods like GPTQ and AWQ
  • Quality vs memory trade-offs
  • When quantization is a bad fit

Prerequisites

  • Basic Python familiarity

Quantization is the technique that lets a 70-billion-parameter model run on a single consumer GPU instead of needing four datacenter cards. It works by storing model weights in fewer bits per number. This post unpacks how that is possible and what you give up in return.

What quantization really is

A model weight is normally a 16-bit or 32-bit floating-point number. Most of the precision is wasted; the values fall in a narrow range, and the model is surprisingly tolerant of approximations. Quantization stores each weight in fewer bits, typically 8 or 4, by mapping the original float range onto a small set of discrete levels.

At inference time, the quantized weights are read from memory and dequantized to a higher precision just before each matrix multiply, or the multiply happens in low precision directly with custom kernels. Either way, the memory footprint shrinks proportionally.

Mental model

Think of a paint store with thousands of color shades. Quantization reduces the palette to 16 or 256 colors. Each original shade gets mapped to its nearest available color. Pictures look almost the same from a distance, but you cannot reproduce fine gradients.

fp16 weight  ->  scale + zero_point
 |                  |
 v                  v
quantize ->  int4 / int8 value
 |
 v
store in compact memory
 |
 v
dequantize at use time
Weight quantization mapping

Hands-on example

Loading a quantized model with Hugging Face Transformers is straightforward.

from transformers import AutoModelForCausalLM, AutoTokenizer, BitsAndBytesConfig

bnb = BitsAndBytesConfig(load_in_4bit=True, bnb_4bit_quant_type="nf4",
                        bnb_4bit_compute_dtype="bfloat16")

model = AutoModelForCausalLM.from_pretrained(
    "meta-llama/Llama-3-8B-Instruct",
    quantization_config=bnb,
    device_map="auto",
)
tok = AutoTokenizer.from_pretrained("meta-llama/Llama-3-8B-Instruct")

NF4 is a 4-bit format designed for the normal distribution of model weights. GPTQ and AWQ are post-training methods that quantize more carefully by minimizing the error on a small calibration set. Both ship pre-quantized weights you can download directly.

For serving, llama.cpp uses GGUF format with k-quants like Q4_K_M, Q5_K_M, and Q6_K. These offer a quality dial: lower numbers mean smaller files and faster CPU inference, higher numbers preserve more accuracy.

Trade-offs

Memory savings are dramatic. A 70B model in fp16 needs about 140 GB. In int4 it needs about 35 GB. That difference decides whether the model fits on one card or four.

Latency improvements depend on the kernel. Naively dequantizing on each matmul can be slower than fp16. Fused int4 kernels, like those in ExLlamaV2 or AWQ, deliver real speedups on supported hardware.

Quality degrades smoothly with bit width. 8-bit is nearly lossless on most tasks. 4-bit shows small but measurable drops on reasoning and code. 3-bit and below start to break down. Always benchmark on your own evaluation set.

Long-context performance suffers more than short-context performance. The accumulated rounding errors over thousands of tokens can amplify hallucinations.

Practical tips

Pick the highest bit width your hardware can fit. There is no point quantizing more aggressively than needed. If you can run 8-bit, do not jump straight to 4-bit.

Use a calibration-based method, like GPTQ or AWQ, over a naive round-to-nearest. The accuracy difference at 4-bit is significant, and the one-time calibration cost is small.

Benchmark with your real prompts, not generic perplexity. Perplexity is a smooth metric; tasks like math, code, and instruction following drop sharply if the quantization is bad.

Match compute dtype to your GPU. On Ampere or newer, bfloat16 compute is the right default. Older cards may need fp16. Mismatches can silently slow you down.

Watch out for tokenizer differences between original and quantized model releases. Some redistributions ship slightly different chat templates, which can make outputs look broken even when the weights are fine.

Wrap-up

Quantization makes large models practical to run on small budgets. It trades a small amount of quality for a large amount of memory and often latency. The right bit width depends on your hardware, your latency target, and your tolerance for accuracy loss. Always benchmark with the prompts you actually serve.