LLM Quantization Comparison

Three dominant quantization pathways have emerged for large language models: gptq, awq, and gguf. Each takes a fundamentally different approach to compressing weights while targeting specific hardware environments, making the choice highly dependent on the inference stack a user intends to run.

Hardware affinity shapes the decision from the start. gptq and awq both depend on CUDA and are designed for NVIDIA GPU acceleration; they do not support pure CPU execution. In contrast, gguf is architected for CPU-first inference and shines on Apple Silicon, where the unified memory architecture eliminates the redundant model copies that plague discrete GPU setups. This single‑file format makes gguf the most consumer‑friendly option for laptops and desktops without a high‑end NVIDIA card.

On NVIDIA GPUs, the speed‑quality trade‑off tilts differently. gptq consistently delivers the fastest token generation, while awq tends to preserve finer perceptual quality and offers stronger performance on long‑context sequences. Both methods can be served through engines like vLLM for production workloads, but neither can run on a CPU‑only host. On CPU and Apple Silicon, gguf may not match the raw GPU throughput of the others, but its offloading capabilities and the absence of CUDA barriers make it the only practical route for those platforms.

Memory planning requires a safety margin regardless of the method. A workable estimate for model loading is parameters × bytes_per_parameter × 1.2, where the 1.2 overhead accounts for the KV cache and other transient buffers. Real‑world usable memory is usually 60–70% of the advertised capacity, so a 7B model at 4‑bit (≈3.5 GB) fits comfortably in many Apple Silicon Macs, whereas a 70B model at 4‑bit demands hardware that can reliably provide roughly 42 GB of headroom—pushing the limits of a 64 GB M4 Max or an RTX 4090 24 GB unless further offloading is employed.

In practice, the selection heuristic follows the hardware. For CUDA‑equipped machines, gptq offers the highest speed and awq the best quality‑to‑size ratio; both benefit from the mature CUDA ecosystem. For Apple Silicon users and CPU‑only scenarios, gguf remains the indispensable choice, trading some GPU speed for extreme portability and the advantage of unified memory. Keeping the 60–70% capacity rule in mind when sizing a machine helps avoid out‑of‑memory surprises in daily use.