Gptq

GPTQ is a post-training quantization method built for GPU inference of large language models. It compresses model weights to lower bit-widths using a layer-wise approach informed by a small calibration dataset, allowing fast execution on compatible hardware while preserving core generation quality.

Among quantization formats, GPTQ delivers the highest inference throughput on NVIDIA GPUs. When compared to awq, it trades a slight edge in quantization fidelity and long‑context handling for outright speed, making it a go‑to choice in latency‑sensitive deployments where a CUDA‑capable GPU is available.

Running a GPTQ model requires CUDA; the method does not support CPU‑only execution or the unified memory advantages of Apple Silicon. This hardware constraint sets it apart from gguf, which is designed to run efficiently on CPUs and consumer‑grade unified‑memory systems. Consequently, GPTQ is less suitable for devices without a dedicated NVIDIA GPU.

Memory needs for a GPTQ‑quantized model follow the common estimation rule: VRAM ≈ parameters × bytes per parameter × 1.2. The 1.2× multiplier covers the KV cache and other transient buffers. Because actual usable GPU memory is typically 60–70% of the advertised capacity, operators must plan with headroom to avoid out‑of‑memory errors. A side‑by‑side treatment of these tradeoffs appears in llm-quantization-comparison.