LLM Quantization Techniques - 3rd One is Important

Step By Step LLM Quantization Workflow

1. Model Preparation

Choose an LLM compatible with quantization (like Llama 2, Falcon, MPT).
Convert it into a framework-supported format (PyTorch, TensorFlow, or ONNX).
Prepare representative data to calibrate activations before quantization.

2. Decide Quantization Precision

Select bit-width and numerical type like -
1. 8-bit (INT8): Good accuracy, large speedup.
2. 4-bit (INT4): More aggressive compression, minimal error tolerance.
3. 16-bit (FP16/BF16): Useful for GPU-based mixed precision.
Match precision to hardware and task sensitivity.

3. Choose Quantization Method

Each technique offers unique trade-offs. Below are the key techniques used with LLMs today.

1. Post-Training Quantization (PTQ)

Quantizes model weights after training—no retraining needed.

Steps for Post Training Quantization (PTQ) :

Convert weights to low precision.
Optionally quantize activations using calibration data.
Test and fine-tune slightly to correct drift.

Significance of PTQ :

Simplest entry point; ideal for testing deployment viability quickly.

Applications of PTQ :

Chatbots, content generators, or inference-only workloads where minor accuracy loss is acceptable.

Read this article to learn more about model quantization.

2. Quantization-Aware Training (QAT)

The model learns in quantized form during training.

Steps for Quantization-Aware Training (QAT) :

Insert simulated quantization layers in training.
Train as usual while letting the model adapt to reduced precision.
Export the quantized version for inference.

Significance of QAT :

Preserves high accuracy with lower precision.

Applications of QAT :

High-accuracy applications like enterprise search, medical summarization, or reasoning assistants.

3. GPTQ (Gradient Post-Training Quantization)

A post-training algorithm that quantizes each layer sequentially, adjusting remaining weight scales to minimize error.

Steps for GPTQ :

Compute weight importance via Hessian approximation.
Quantize layer-by-layer using grouped calibration.
Optimize to minimize reconstruction error.

Significance of GPTQ :

Balances speed and precision remarkably well; minimal accuracy loss compared to full precision.

Applications of GPTQ :

Popular for Llama, Mistral, and GPT-style models in inference APIs or local setups.

4. AWQ (Activation-Aware Weight Quantization)

Focuses on weights whose activations contribute most to outputs; skips less important ones during quantization.

Steps for AWQ :

Analyze activation sensitivity.
Prioritize quantizing critical weights.
Apply group-wise integer quantization.

Significance of AWQ :

Faster than GPTQ and often equal in accuracy.

Applications of AWQ :

Edge inference, CPU or GPU deployment where speed and memory savings are top priorities.

5. SmoothQuant

Normalizes activation ranges layer-by-layer before quantizing.

Steps for SmoothQuant :

Calculate scaling factors that balance weight and activation ranges.
Apply smooth scaling to reduce quantization error.

Significance of SmoothQuant :

Reduces activation explosion and stabilizes performance.

Applicants of SmoothQuant :

Cloud inference of LLMs with large activation variance (like transformers in multi-document summarization).

6. BitsAndBytes (QLoRA Technique)

Mixed-precision quantization that maintains training flexibility.

Steps for BitsandBytes :

Compress model weights to 4-bit quantized states.
Use LoRA adapters for fine-tuning.
Train or infer with minimal GPU memory.

Significance of BitsandBytes :

Enables fine-tuning large models on consumer GPUs.

Applications of BitsandBytes :

Domain-specific model fine-tuning (customer service, medical text, code assistants).

7. LLM-QAT (Quantization-Aware Fine-Tuning)

Extends QAT for very large models by quantizing key-value caches during training.

Steps for LLM-QAT :

Quantize activations in the forward pass.
Include key-value cache quantization.
Jointly optimize for accuracy and reduced memory footprint.

Significance of LLM-QAT :

Allows maintaining long context lengths efficiently.

Applications of LLM-QAT :

Long-sequence tasks like summarization, dialogue memory, or document chaining.

8. GGUF Format (Quantized Model Storage)

A file format used by llama.cpp and similar frameworks for compact model storage.

Steps for GGUF :

Export quantized weights into GGUF structure (supports 2, 3, 4, or 8-bit).
Load using CPU-only inference frameworks.

Significance of GGUF :

Readable by local inference engines; perfect for personal AI experiments.

Applications of GGUF :

On-device assistants, offline chatbots, low-memory environments.

9. KV Cache Quantization

Optimizes memory-heavy attention key–value caches.

Steps for KV Cache Quantization :

Quantize key-value pairs per token.
Compress during inference to cut memory use.

Significance of KV Cache Quantization :

Saves memory in long-context operations without major degradation.

Applications of KV Cache Quantization :

Streaming inference in chat models, long-form conversation memory.

10. Per-Channel and Per-Token Quantization

Different granularity levels for quantizing weights or activations.

Steps for Per-Channel and Per-Token Quantization :

Per-Channel: Quantize each feature channel separately for better accuracy.
Per-Token: Apply quantization per generated token to minimize accumulation of error.

Significance of Per-Channel and Per-Token Quantization :

Preserves accuracy for transformer blocks.

Applications of Per-Channel and Per-Token Quantization :

GPT-like models during real-time text generation.

Real-World Toolchain Examples :

Hugging Face Optimum: Simplifies GPTQ and SmoothQuant workflows.
BitsAndBytes Library: Supports QLoRA and 4-bit model fine-tuning.
AutoAWQ: Simple integration with transformer-based LLMs.
llama.cpp: Runs quantized GGUF models efficiently on consumer CPUs.