When researching LLMs, I found these two terms often used interchangeably — distillation and quantization. After some thought, I began to see them like the process of storing a digital image.
When you take a picture of an object, you’re not capturing the object itself but recording a representation — something that can later be projected or reconstructed. You decide how many pixels to keep, and for each pixel, how many bits to allocate. Every choice trades fidelity for practicality.
Distillation is like reducing the pixel count. You retrain a smaller model to reproduce the behavior of a larger one, keeping structure and meaning while dropping fine details. It captures the “shape” of knowledge, not every contour. The result: faster, lighter, and usually good enough for the intended resolution.
Quantization is like lowering the bit depth. The architecture stays the same — same number of layers, parameters, and connections — but each weight or activation is stored with fewer bits. You keep the shape, but reduce the number of shades you can represent. Like playing a color movie on a black and white TV.
Both methods are forms of compression, but they act on different dimensions:
Distillation trims the model’s space — fewer neurons and layers.
Quantization trims the model’s depth — fewer bits per value.
Which to use depends on your target “display.” A small mobile device may need both: fewer pixels and lower bit depth. A server model with room to breathe might only quantize for efficiency.
Like photography, model optimization is about preserving what matters most for the final audience. Every reduction is a decision about what’s worth keeping.