LLMs on CPU: The Power of Quantization with GGUF, AWQ, & GPTQ

Read Time:
minutes

Large Language Models (LLMs) have revolutionized what's possible with AI, but their immense size and computational demands often relegate them to powerful GPUs or cloud infrastructure. What if you want to run these incredible models right on your personal computer, leveraging your CPU? This is where quantization comes in a game-changer that makes local LLM inference a reality.

Running LLMs on consumer-grade CPUs presents two primary hurdles: memory and compute. Full-precision models (typically FP32) require gigabytes of RAM, often exceeding what's available in typical laptops or desktops, and their floating-point operations can overwhelm CPU cores. Quantization offers an elegant solution by significantly reducing a model's footprint and enabling faster, more efficient processing.

In this blog, we'll explore the fascinating world of quantization, focusing on techniques like GGUF, AWQ, and GPTQ, and how they empower you to run powerful LLMs locally.

Quantization Fundamentals for CPU Deployment

To truly appreciate the magic of quantization, it helps to understand the underlying principles. At its most fundamental level, quantization is the process of reducing the numerical precision of a neural network's weights and activations. Think of it like compressing a high-resolution image (e.g., a TIFF file) into a more compact format (e.g., a JPEG). While some minuscule detail might be lost, the resulting file is significantly smaller and much easier to store, transmit, and process, often without a noticeable impact on perceived quality.

In the context of LLMs, this typically involves converting the standard 32-bit floating-point (FP32) numbers which offer a wide range and high precision to lower-precision integer formats, most commonly 4-bit (INT4) or 8-bit (INT8). Sometimes, intermediate precision levels like 5-bit (INT5) are also used.

Comparison of the float32, bfloat16, and float16 numerical formats

Why is this numerical compression a game-changer for CPUs?

Massive Memory Savings

This is arguably the most immediate and impactful benefit. An FP32 number occupies 4 bytes of memory. An INT4 number occupies just half a byte. This seemingly small difference scales exponentially across the billions of parameters in an LLM. For instance, a 7-billion parameter model in FP32 format would require approximately 7×109×4 bytes = 28 GB of RAM. Quantized to INT4, that same model could fit into roughly 7×109×0.5 bytes = 3.5 GB! This dramatic reduction allows significantly larger models to be loaded into the finite RAM of consumer laptops and desktops, often making the difference between whether a model can run at all.

Faster Integer Operations

CPUs are inherently optimized for integer arithmetic. While they can perform floating-point operations, integer operations are generally faster and consume less power. By converting weights and activations to integer formats, quantization allows the CPU to leverage its strengths, leading to noticeable improvements in inference speed. This means quicker responses from your local LLM, enhancing the user experience.

Of course, no optimization comes without a trade-off. The act of reducing numerical precision inherently introduces a small amount of "quantization error." This error can subtly degrade the model's accuracy, potentially leading to slightly less coherent responses or minor factual inaccuracies compared to its full-precision counterpart. The central challenge and art of quantization lie in finding the optimal balance: achieving substantial memory and speed benefits with minimal, ideally imperceptible, impact on the model's output quality.

The most common approach for quantizing LLMs for deployment is Post-Training Quantization (PTQ). As the name suggests, this means the quantization process occurs after the model has been fully trained in its high-precision format. Unlike training-aware quantization methods that integrate quantization into the training loop, PTQ is simpler to implement and doesn't require access to the original training data or infrastructure, making it highly practical for open-source models. The PTQ algorithms strive to minimize the accuracy loss during this conversion, often by calibrating the quantization scales based on a small sample of representative data.

Simple Example of Quantization

Let’s assume you have 25 weight values in FP16 format which are as shown below in a matrix.

We need to do int8 quantization of these values. The steps are given below.

  1. Old Range = Max weight value in fp16 format — Min weight value in fp16 format= 0.932–0.0609= 0.871
  2. New Range = Int 8 has numbers from -128 to 127. Hence, Range = 127-(-128) = 255
  3. Scale = Max value in New Range / Max value in Old Range= 127 / 0.932= 136.24724986904138
  4. Quantized Value = Round(Scale * Original Value)
  1. Dequantized Value = Quantized Value / Scale
  2. Rounding Error An important point to note here is that when we de-quantize back to the fp16 format, we notice that the numbers do not appear to be exactly same. The first element 0.5415 becomes 0.543. The same issue can be noticed in most of the elements. This is the error that is the result of quantization dequantization process.

Now that we have understood the core of quantization, let’s move on to types of Quantization when it comes to LLM’s.

GPTQ (Generalized Post-Training Quantization)

GPTQ (General-purpose quantization for LLMs) is a highly influential post-training quantization method. This means that once an LLM has been pre-trained in its full precision (e.g., FP32), its parameters are simply converted into a lower precision format.

While GPTQ was initially designed and is primarily preferred for GPU efficiency, its effectiveness in achieving high accuracy at low bitrates (especially 4-bit) has led to its widespread adoption and subsequent adaptation for CPU inference.

There are various approaches within GPTQ:

Static Range Quantization

This method quantizes both weights and activations into lower precision. It requires a small calibration dataset, typically a representative sample (e.g., 1000 data points) from the original pre-training dataset. By performing inference on this calibration dataset, the algorithm identifies the distribution (e.g., min-max ranges) of weights and activations for each layer. This information is then used to determine the quantization scales.

Dynamic Range Quantization

In contrast, this approach converts weights to a lower precision while dynamically generating a quantization function for activations during inference. This function is applied on-the-fly to quantize activations, providing flexibility but potentially introducing runtime overhead.

Weight Quantization

This specifically focuses on quantizing only the weights of the model. During inference, the input remains in its original float32 format. To perform computations, the quantized weights are temporarily de-quantized back to float32. This process, while saving memory for storing weights, can introduce accuracy loss due to rounding issues during de-quantization.

Overview of Static Range Quantization Algorithm

When performing Static Range Quantization with GPTQ, the algorithm operates layer by layer within the neural network. The key steps are:

  1. Grouped Column Processing: The weight matrix of each layer is logically divided into groups of columns.
  2. Iterative Quantization and Error Compensation: These column groups are processed iteratively. Within each group (e.g., if group_size is 128, a group of 128 columns), data for one column is quantized. Immediately after, the remaining weights in that group are updated to compensate for the error introduced by this quantization.
  3. Lazy Batch Update: Once an entire group of columns is processed, the rest of the columns in the complete weight matrix (i.e., other groups) are further updated to globally compensate for the accumulated quantization error. This entire process is often referred to as a "Lazy batch update" due to its structured, compensatory adjustments.

GPTQ Code

GPU is needed for this kind of Quantization. At first, we tried quantizing a 7B Sharded Mistral model but failed. This is because the model is first loaded to CPU upon download and T4 doesn’t have sufficient CPU RAM to support. we ended up picking a model of small size that can be accommodated in free T4 instance on Google Collab. The model is bigscience/bloom-3b from HF repo.

!pip install auto_gptq
import torch
from auto_gptq import AutoGPTQForCausalLM, BaseQuantizeConfig
from transformers import TextGenerationPipeline
from transformers import AutoTokenizer

pretrained_model_name = "bigscience/bloom-3b" 
quantize_config = BaseQuantizeConfig(bits=4, group_size=128)

# Tensors of bloom are of float16. Hence, torch_dtype=torch.float16. Do not leave torch_dtype as "auto" as this leads to a warning of implicit dtype conversion
model = AutoGPTQForCausalLM.from_pretrained(pretrained_model_name, quantize_config, trust_remote_code=False, device_map="auto", torch_dtype=torch.float16)  # changing device map to "cuda" does not have any impact on T4 GPU mem usage.
tokenizer = AutoTokenizer.from_pretrained(pretrained_model_name)

# Calibration
examples = [
    tokenizer(
        "Automated machine learning is the process of automating the tasks of applying machine learning to real-world problems. AutoML potentially includes every stage from beginning with a raw dataset to building a machine learning model ready for deployment."
    )
]  # giving only 1 example here for testing. In an real world scenario, you might want to give 500-1000 samples.
model.quantize(examples)

quantized_model_dir = "bloom3b_q4b_gs128"
model.save_quantized(quantized_model_dir)

# Inference with quantized model
device = "cuda:0"  # make use of GPU for inference.
model = AutoGPTQForCausalLM.from_quantized(quantized_model_dir, device=device, torch_dtype=torch.float16)
pipeline = TextGenerationPipeline(model=model, tokenizer=tokenizer, max_new_tokens=50)  
print(pipeline("Automated machine learning is")[0]["generated_text"])
# Sequence length of a model (bloom has seq length of 2048) is total tokens in input & output; 
# max_new_tokens is number of output tokens
# Do note that there is a warning while executing this code that model's sequence length was not in model config. However, what i could find any option to pass the seq length of bloom in configurations.
# The warnings related to fused modules & unsupported model is not valid. The links that i used to validate this are in references.
Output of Inference w.r.t quantized model

GGUF (GGML Universal Format)

GGUF (GGML Universal Format) has rapidly emerged as the de facto standard for efficient LLM inference on CPUs. It is the evolution and new version of GGML, a remarkable C++ library designed to replicate LLM inference capabilities. GGUF extends this powerful foundation, supporting a wide array of Large Language Models, including the popular LLaMA series, Falcon, and many others.

A key advantage of GGUF is its exceptional portability. While the underlying llama.cpp project, which leverages GGUF, was famously optimized for Apple Silicon processors (Apple's custom CPU-GPU integrated chips found in Macs), GGUF models are equally compatible with Windows and Linux operating systems. This means you can run these highly optimized models directly on standard x86 CPUs found in most personal computers. Furthermore, GGUF allows for a flexible hybrid approach: in scenarios where your CPU might not be sufficient for an entire model, you can offload a few layers to a compatible GPU, significantly boosting performance.

GGUF is central to the quantization process itself, providing robust support for various quantization levels, typically ranging from 2-bit to 8-bit precision. The general workflow involves taking an original, full-precision LLM (like a LLaMA model), converting it into the GGUF format, and then applying a specific quantization level to this GGUF model to achieve the desired balance of size, speed, and accuracy. This seamless conversion and quantization capability makes GGUF an indispensable tool for democratizing LLM access.

Different Types of GGUF Quantization

Within the GGUF ecosystem, quantization isn't a one-size-fits-all concept. Instead, there's a spectrum of options, each meticulously engineered to strike a distinct balance among model size, inference speed, and the overall fidelity of the LLM's output. These variations are typically identified through specific naming conventions that indicate the average number of bits per weight and describe the underlying quantization strategy. Understanding these different types is crucial for selecting the most effective GGUF model for your particular hardware setup and intended application.

Standard Quantization Levels (Qx_0 and Qx_1)

At the foundational level are the standard GGUF quantization types, often recognized by designations like Q4_0 or Q5_0. The numerical prefix indicates the average number of bits used to represent each weight (e.g., 4-bit, 5-bit). Generally, more bits mean higher precision, leading to better accuracy but also a larger file size and potentially slower inference. For example, Q8_0 signifies an 8-bit quantization, offering very high quality, often nearly indistinguishable from the original full-precision (FP16/BF16) model, but with a significantly larger file size. The _0 suffix denotes a simpler quantization scheme with a single scale factor per block of weights. While some variations like Q4_1 or Q5_1 offer slight accuracy improvements with a scale and minimum value per block, these _0 and _1 types are now largely considered legacy, superseded by more optimized methods.

K-Quantization (Qx_K_S, Qx_K_M, Qx_K_L)

A significant leap forward in GGUF quantization is represented by the K-quants, such as Q4_K_M, Q5_K_S, or Q6_K. The 'K' signifies a "mixed" or "k-quantization" approach, which intelligently allocates bits. Instead of uniform precision, K-quants identify more critical weights or layers within the model and quantize them with higher precision, while less important weights use lower precision. This selective precision aims to minimize accuracy loss while still achieving substantial file size reduction.

Within the K-quant family, suffixes like _S (Small), _M (Medium), and _L (Large) denote variations in how this mixed precision is applied, affecting the balance of quality, size, and speed:

Qx_K_S (Small)

These are generally the most aggressive K-quants in terms of compression for a given bit depth, aiming for the smallest file size. For example, Q4_K_S uses 4-bit quantization for all tensors, offering good quality at a compact size.Qx_K_M (Medium): Often considered the "sweet spot" for many users, these offer a great balance. For instance, Q4_K_M might use 6-bit quantization for half of the attention and feed-forward weights, with the rest at 4-bit, resulting in higher quality than Q4_K_S with a slightly larger file. Q5_K_M is frequently recommended for preserving most model performance with significant compression.Qx_K_L (Large): These offer even higher accuracy within the K-quant framework by allocating more bits to a larger proportion of weights or using higher base bit depths for critical components. They result in larger files but retain more of the original model's quality.The key innovation of K-quants is their ability to be as fast as, or even faster than, older legacy quants, while simultaneously having lower quantization error. This makes them the preferred choice for most llama.cpp and GGUF users seeking an optimal balance.

I-Quantization (IQx_XXS, IQx_XS, IQx_S, IQx_M, IQx_NL)

More recently, I-quants (e.g., IQ2_XXS, IQ3_S, IQ4_NL) have emerged as a cutting-edge advancement in GGUF quantization. These are designed for even better performance-to-size ratios, especially when leveraging specific hardware acceleration features like cuBLAS (for NVIDIA GPUs) or rocBLAS (for AMD GPUs). I-quants incorporate advanced techniques, sometimes inspired by research like QuIP#, to intelligently quantize weights with a focus on minimizing error and maximizing efficiency. While I-quants can be faster on specific GPU setups, they might be slower on CPU-only or Apple Metal environments, indicating a hardware-specific performance trade-off. Within the I-quant family, suffixes like _XXS (Extra Extra Small), _XS (Extra Small), _S (Small), _M (Medium), and _NL (No Luma) denote different balances of compression and quality, with _XXS being the most compressed. They are generally considered state-of-the-art for below 4-bit quantization when GPU acceleration is available.

Accuracy Retention Comparison Across LLM Quantization Methods
Multi-Dimensional Performance Comparison of LLM Quantization Methods

Code to convert a Model to GGML Format

# Install llama.cpp
!git clone https://github.com/ggerganov/llama.cpp
!cd llama.cpp && git pull && make clean && LLAMA_CUBLAS=1 make
!pip install -r llama.cpp/requirements.txt

# Download model
!git lfs install
!git clone "url"

# Convert weights to fp16
!python llama.cpp/convert.py MistralSharded2 --outtype f16 --outfile "MistralSharded2/mistralsharded2.fp16.bin"
# Quantization - Requires GPU RAM as a mandate. However, does not use much of it.

# As per info on https://huggingface.co/TheBloke/Llama-2-13B-chat-GGML, q4_k_m uses Q6_K for half of the attention.wv and feed_forward.w2 tensors, else Q4_K
!./llama.cpp/quantize "MistralSharded2/mistralsharded2.fp16.bin" "MistralSharded2/mistralsharded2.Q4_K_M.gguf" q4_k_m
Memory usage during GGML Quantization (code fragment above)
Comparing size of Original and GGML Model

Code for Inference using GGML model

import os
# There are 32 layers in mistral 7B. Hence offloading all 32 layers to GPU while loading for inference below.
!./llama.cpp/main -m "MistralSharded2/mistralsharded2.Q4_K_M.gguf" -n 35 --color -ngl 32 -p "Automated machine learning"
Inference results of GGML model
GPU usage during Inference

Quantization Aware Training (QAT)

While Post-Training Quantization (PTQ) offers a quick way to compress models, it can sometimes lead to an unacceptable loss in accuracy. This is where Quantization Aware Training (QAT) comes into play. QAT begins with either a pre-trained model or an already PTQ-quantized model, which is then fine-tuned. The primary objective of QAT is to recover or even improve the accuracy that might have been lost during initial quantization, by making the training process "aware" of the quantization process itself.

In QAT, the key innovation is the simulation of quantization during the forward pass of training. This is achieved through the insertion of "FakeQuant Nodes." These nodes perform a full quantization (e.g., converting FP32 to INT8) and then an immediate de-quantization (converting INT8 back to FP32). This "fake" process ensures that all subsequent operations in the network are performed on values that have experienced the quantization error, while still allowing the gradients to be computed in a floating-point domain.

The basic idea of QAT is elegant: during the forward pass, inputs are quantized according to the precision of the weights in a given layer. The output of the weight-input multiplication is also handled by the FakeQuant node, ensuring that quantization effects are propagated. By introducing this quantization error directly into the forward pass, these errors are then incorporated into the overall loss calculation. Consequently, during the backward pass, the optimizer learns to adjust the model's weights not just to minimize the primary task loss, but also to reduce the negative impact of quantization errors. This makes the model inherently more robust to the precision limitations of the final quantized deployment.

QAT Code

For practical implementation, numerous code fragments and tutorials are available on official machine learning framework websites, such as TensorFlow's documentation. While specific framework versions might present unique challenges (e.g., compatibility issues with certain layers), these resources provide invaluable starting points for experimenting with QAT in your own projects. It's common practice to begin with a simplified model to grasp the core concepts before applying QAT to larger, more complex architectures.

Please note that there are couple of code fragments available on tensor flow official website. One of them uses tensor flow & the other one tf-nightly. I encountered errors of nature “layers in the model do not support QAT” while using tensor flow. Hence, I switched to nightly variant.Also, i ended up removing any fancy layer from Keras model. Kindly ignore the quality of Keras model as it is only a dummy model used for QAT. Also, the code was executed using CPU since it was just a dummy model for demo purpose.

! pip uninstall -y tensorflow
! pip install -q tf-nightly  # Use tf-nightly instead of tensorflow since it gets updated with fixes every day
! pip install -q tensorflow-model-optimization  # this lib is used for QAT

import numpy as np
import pandas as pd
import tensorflow as tf
import tensorflow_model_optimization as tfmot
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler


### Sample Keras model - Code generated by ChatGPT

# Generate sample data
np.random.seed(0)
data = pd.DataFrame(np.random.rand(1000, 5), columns=['Feature1', 'Feature2', 'Feature3', 'Feature4', 'Feature5'])
target = pd.Series(np.random.randint(0, 2, size=1000), name='Target')

# Split data into train and test sets
X_train, X_test, y_train, y_test = train_test_split(data, target, test_size=0.2, random_state=42)

# Standardize features
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)

# Define the Keras model
model = tf.keras.Sequential([
    tf.keras.layers.Dense(64, activation='relu', input_shape=(5,)),
    tf.keras.layers.Dense(32, activation='relu'),
    tf.keras.layers.Dense(1, activation='sigmoid')
])

# Compile the model
model.compile(optimizer='adam', loss='binary_crossentropy', metrics=['accuracy'])

# Train the model
model.fit(X_train_scaled, y_train, epochs=10, batch_size=32, validation_split=0.2)

# Evaluate the model
test_loss, test_accuracy = model.evaluate(X_test_scaled, y_test)
print(f'Test Loss: {test_loss}, Test Accuracy: {test_accuracy}')
quant_aware_model = tfmot.quantization.keras.quantize_model(model)
quant_aware_model.compile(optimizer='adam', loss='binary_crossentropy', metrics=['accuracy'])

# Fine tune to create a quant aware model
quant_aware_model.fit(X_train_scaled, y_train, epochs=10, batch_size=32, validation_split=0.2)

# Evaluate the quant aware model
test_loss, test_accuracy = quant_aware_model.evaluate(X_test_scaled, y_test)
print(f'Test Loss: {test_loss}, Test Accuracy: {test_accuracy}')

Activation-Aware Quantization (AWQ)

Activation-Aware Quantization (AWQ) represents a sophisticated step forward in the quest to minimize accuracy degradation during the quantization process, especially when targeting extremely low bitrates like 4-bit. Unlike simpler quantization methods that might uniformly quantize all weights, AWQ operates with a keen "awareness" of the model's activations – the intermediate outputs generated by each layer during inference.

The core insight behind AWQ is that not all weights contribute equally to a model's output quality. The groundbreaking observation in AWQ is that protecting only a very small percentage (e.g., 1%) of the most "salient" weights can dramatically reduce quantization error. These salient weights are identified not by their own magnitude, but by analyzing the magnitude of the input activations they are connected to. Activations with large magnitudes can amplify errors if their corresponding weights are poorly quantized.

AWQ addresses this by selectively scaling the weight channels (groups of weights) by channel-specific scaling factors before quantization. This is an "equivalent transformation" that effectively protects these critical weights. By scaling down the activation magnitudes that correspond to important weights and inversely scaling up the weights themselves, AWQ ensures that even when these weights are quantized to a low bit-depth (like 4-bit integers), their impact on the model's output is preserved. This strategy maintains a remarkable level of accuracy even with highly compressed models, often approaching the performance of the unquantized FP16 model.

Crucially, AWQ is a post-training quantization (PTQ) method that does not require backpropagation or complex reconstruction during its quantization process. It relies on a small, unlabelled calibration dataset to gather activation statistics. This makes AWQ very sample-efficient and helps it avoid overfitting to the calibration data, leading to better generalization across different tasks and even multimodal LLMs. The resulting quantized models then offer significant memory and speed benefits, making them highly suitable for CPU deployment via formats like GGUF.

AWQ Code

While full AWQ implementation involves detailed numerical operations and potentially custom CUDA kernels for speed, the general process when using common libraries typically involves these steps:

Load the Pre-trained Model

Start with your full-precision (e.g., FP16 or FP32) LLM and its corresponding tokenizer. This is usually done using frameworks like Hugging Face's

# Conceptual: Loading a pre-trained model
from transformers import AutoModelForCausalLM, AutoTokenizer
model_path = "path/to/your/unquantized_model" # e.g., "mistralai/Mistral-7B-v0.1"
model = AutoModelForCausalLM.from_pretrained(model_path)
tokenizer = AutoTokenizer.from_pretrained(model_path)

Define Quantization Configuration

Specify the target bit-width (e.g., 4-bit), group size, and other parameters relevant to AWQ.

# Conceptual: AWQ quantization configuration
quant_config = {
    "zero_point": True,  # Use zero-point quantization
    "q_group_size": 128, # Quantize weights in groups of 128
    "w_bit": 4,          # Target 4-bit weights
    "version": "GEMM"    # Or "exllama" for specific kernels
}

Load Calibration Data

Provide a small, representative dataset (e.g., 128-256 sequences) to collect activation statistics. This is crucial for AWQ to identify salient weights.

# Conceptual: Loading a small calibration dataset
from datasets import load_dataset
# This data is used to observe activation distributions
calibration_data = load_dataset("some_calibration_dataset", split="train[:128]")

Perform AWQ Quantization

Utilize an AWQ-specific library (like autoawq or llm-awq) to apply the quantization algorithm. This step involves analyzing activations, calculating optimal scaling factors for weights, and then quantizing the weights.

# Conceptual: Applying AWQ quantization
from awq import AutoAWQForCausalLM

# Load model with AWQ library for quantization process
model = AutoAWQForCausalLM.from_pretrained(model_path)

# Perform quantization (this step involves the activation analysis and weight scaling)
model.quantize(tokenizer, quant_config=quant_config)

Save and Convert (Optional: to GGUF)

Once quantized, the model can be saved in its AWQ-quantized format (e.g., .safetensors with AWQ metadata). For CPU inference, this model would then often be converted to the GGUF format using tools from the llama.cpp ecosystem, which can leverage the AWQ-derived scales for optimized CPU performance.

# Conceptual: Saving the quantized model
quant_path = "path/to/your/quantized_model_awq"
model.save_quantized(quant_path)
tokenizer.save_pretrained(quant_path)

# Conceptual: Then, typically converted to GGUF using llama.cpp's convert.py script
# This step is usually a separate command-line operation
# E.g., python llama.cpp/convert.py --outtype q4_K_M path/to/your/quantized_model_awq

Libraries like AutoAWQ (part of Hugging Face Transformers ecosystem) simplify this process significantly, providing high-level APIs to achieve AWQ quantization and export models for inference.

Comparative Overview: GGUF vs. AWQ vs. GPTQ

Beyond GGUF, other prominent quantization methods like AWQ and GPTQ offer different trade-offs, primarily optimized for GPU inference.

GPTQ (Generalized Post-Training Quantization)

GPTQ is a one-shot weight quantization method that utilizes approximate second-order information to achieve high accuracy and efficiency. Its primary focus is on GPU inference. GPTQ can quantize models to very low bit levels (e.g., 8, 4, 3, or even 2 bits), leading to significant reductions in model size. In terms of speed, GPTQ excels in GPU inference, often providing substantial speedups (e.g., 5x faster than GGUF when running purely on GPU with optimized kernels like Marlin). GPTQ aims for high accuracy even at low bit widths, with minimal degradation compared to the original model. However, a critical aspect of GPTQ is its reliance on a calibration dataset. The quality of the quantized model can be directly influenced by the selection and size of this dataset, potentially leading to reduced accuracy on real-world benchmarks if not chosen carefully. The quantization process itself can also be time-consuming for very large models.

AWQ (Activation-aware Weight Quantization)

AWQ represents an innovative activation-aware weight quantization approach that prioritizes protecting important (salient) weights by observing activation distributions rather than just the weights themselves. This method aims to minimize accuracy loss, particularly for instruction-tuned and multi-modal LLMs. Similar to GPTQ, AWQ achieves significant model size reduction through low-bit weight-only quantization. AWQ offers excellent GPU inference speeds, often outperforming GPTQ, especially when implemented with optimized kernels. Crucially, AWQ does not rely on backpropagation or reconstruction (unlike GPTQ), which can make its quantization process potentially faster and less data-intensive, requiring less calibration data. In terms of accuracy, AWQ generally achieves very good quantization performance and claims to better preserve the LLM's generalization ability, being less prone to overfitting the calibration set. However, it can be more complex to implement than simpler techniques.

Choosing the Right Quantization Method

The "right" quantization method and level aren't universal; they depend heavily on your specific hardware, performance requirements, and desired accuracy.

For CPU-only or Apple Silicon devices, choose GGUF. It's specifically designed for these environments and offers the best compatibility and performance for CPU-bound inference. Experiment with different GGUF K-quant levels (e.g., Q4_K_M, Q5_K_M) to find the best balance of model size and accuracy for your system's RAM. Higher quants (Q6_K, Q8_0) offer better quality but require more RAM.

For GPU-accelerated inference (e.g., NVIDIA GPUs), prioritize AWQ or GPTQ. These methods are optimized for GPU performance.

  • If maximum speed and good accuracy are paramount, and you have sufficient VRAM, consider AWQ. It often provides a better balance of speed and accuracy, and its calibration process is generally more robust, especially for instruction-tuned and multi-modal models.
  • If you need high accuracy at very low bitrates (e.g., 3-bit, 4-bit) and are comfortable with careful calibration, GPTQ can deliver excellent results. Be mindful of the calibration dataset to avoid accuracy degradation.
  • If you're using a serving engine like vLLM or HuggingFace TGI, check which quantization methods they support and have optimized kernels for, as this can significantly impact performance.

When considering model size and memory constraints, all three methods offer significant reductions. Lower bit rates (2-bit, 3-bit, 4-bit) will always lead to smaller models. GGUF's K-quants are particularly efficient for balancing size and quality on CPU/mixed systems. GPTQ and AWQ can compress models to very small sizes while striving to maintain accuracy, making them suitable for fitting larger models onto GPUs with limited VRAM.

For rapid experimentation and development, GGUF can be faster to quantize, which is useful when iterating on many models.

When accuracy is absolutely critical, start with higher quantization levels (e.g., 8-bit or 6-bit if available) or even FP16/BF16 if your hardware allows. Always evaluate quantized models on relevant benchmarks (like MMLU, GSM8K) rather than just perplexity, as perplexity alone can be misleading.

Ultimately, the best approach is to benchmark the quantized model on your target hardware and with your specific use case to determine the optimal choice for your LLM deployment.

Deployment Best Practices

Once you've chosen your desired GGUF quantization level, the next crucial step is getting your LLM up and running effectively. This section provides practical advice for deploying and optimizing GGUF models.

Finding Pre-Quantized GGUF Models on Hugging Face Hub

The Hugging Face Hub serves as the premier central repository for pre-trained models, making it the ideal starting point for discovering pre-quantized GGUF models. Numerous community members, with TheBloke being a particularly prolific contributor, actively convert and quantize a wide array of popular LLMs into various GGUF formats.

To efficiently locate GGUF models, you can easily browse all relevant models by applying the library=gguf filter on the Hugging Face models page (accessible via hf.co/models?library=gguf). Once you select a model, always navigate to its "Files and versions" tab. Here, you will typically find multiple .gguf files, each corresponding to a different quantization level (e.g., model.Q4_K_M.gguf, model.Q5_K_S.gguf). The accompanying model card or associated discussions often provide invaluable guidance and recommendations on which quantization level is most suitable for various hardware configurations.

For downloading these models, you have the option of direct download from the Hugging Face website. For more programmatic control, the huggingface_hub Python library is highly recommended. For instance, to download a specific GGUF file, you would execute commands such as:

pip install huggingface_hub
huggingface-cli download <repo_id> <filename.gguf> --local-dir .

(Remember to replace <repo_id> with the model's repository ID, like TheBloke/Mixtral-8x7B-Instruct-v0.1-GGUF, and <filename.gguf> with the precise GGUF file name.)

Using llama.cpp for Inference

llama.cpp stands as the reference implementation for running GGUF models, renowned for its exceptional versatility. It provides both a robust command-line interface (CLI) and a convenient Python binding, llama-cpp-python, catering to diverse user preferences.

Building llama.cpp:To get started, you'll first need to build llama.cpp.

Clone and navigate into the llama.cpp repository

Compile based on your hardware

  • CPU-only: make
  • With GPU (NVIDIA CUDA): make LLAMA_CUBLAS=1 (Requires CUDA toolkit installed)
  • With GPU (AMD ROCm): make LLAMA_HIPBLAS=1 (Requires ROCm toolkit installed)
  • With Apple Metal: make LLAMA_METAL=1

CLI Inference

Once successfully built, you can execute models directly from the command line:

./main -m /path/to/your/model.Q4_K_M.gguf -p "Your prompt here" -n 256

m: Specifies the path to your GGUF model file.

p: Your input prompt.

n: Maximum number of tokens to generate.

For interactive chat sessions, the command ./main -m /path/to/your/model.gguf -i -p "You are a helpful assistant." is particularly useful.

Optimizing CPU Usage (n_threads)

One of llama.cpp's core strengths lies in its highly efficient CPU utilization, achieved through sophisticated multithreading. When running llama.cpp, whether via the CLI or its llama-cpp-python binding, you can precisely control the number of CPU threads it employs using the -t (or n_threads) parameter.

Starting point: A good initial strategy is to set n_threads to the number of physical cores your CPU possesses. If your CPU supports hyperthreading, consider setting it slightly less than the total logical cores (threads) to reserve some capacity for background system processes. For instance, on a CPU with 6 physical cores and 12 logical threads, an n_threads value of 6 or 8 would be a sensible starting point.

./main -m model.gguf -p "Hello" -t 8 # Uses 8 CPU threads

Experimentation is key: However, it is crucial to understand that the optimal number of threads can vary based on your specific CPU architecture and the particular LLM being run. Therefore, experimentation is key: benchmark your setup with different n_threads values to identify the sweet spot that maximizes performance for your system. Using too many threads can, at times, lead to diminishing returns or even performance degradation due to increased overhead.

Ollama

For users seeking an even more streamlined and user-friendly experience, particularly beginners, Ollama emerges as an excellent solution. Ollama serves as a powerful wrapper around llama.cpp and other inference engines, providing an exceptionally intuitive way to download, run, and manage GGUF models with remarkable simplicity, often requiring just a single command.

Ease of Use: Ollama automates a significant portion of the setup process; after downloading the Ollama application, you can effortlessly pull and run models using commands like ollama run llama2.

Model Discovery: Ollama also maintains its own curated model library, frequently populated with popular GGUF models, which further simplifies model discovery and deployment.

Behind the Scenes: Crucially, while Ollama greatly simplifies the user experience, it consistently leverages the robust capabilities of GGUF and llama.cpp underneath the hood to ensure efficient local inference.

RAM Considerations for GGUF Models

Regardless of the chosen quantization level, LLMs inherently demand a substantial amount of RAM (or VRAM if offloaded to a GPU) to operate effectively.

Model Size = RAM Footprint: The size of the GGUF file directly correlates with the amount of RAM it will consume upon loading. For instance, a 7B (7 billion parameter) model quantized to Q4_K_M might occupy around 4-5 GB, necessitating at least that much available RAM.

Context Window (n_ctx): The context window (n_ctx) parameter plays a vital role in RAM consumption. This parameter dictates how many tokens the model can "remember" and process within a single interaction. A larger n_ctx value will inevitably consume more RAM; for example, setting n_ctx to 4096 tokens will require significantly more memory than 2048 tokens.

System RAM vs. VRAM:

  • CPU-only inference: The entire model must reside within your system's RAM. If you possess 16GB of RAM, running models significantly larger than 7B Q4_K_M might prove challenging, especially if you desire a larger context window or need to run other applications concurrently.
  • Hybrid (CPU+GPU) inference: In a hybrid setup, you gain the flexibility to offload a portion of the layers to your GPU's VRAM using the ngl (number of GPU layers) parameter in llama.cpp. This capability dramatically reduces the RAM footprint on your main system, enabling you to run larger models or achieve faster inference by harnessing the GPU's processing power. For example, if your GPU has 8GB of VRAM, you could offload enough layers to utilize most of that, leaving the remainder on the CPU.

Swap Space: A critical point to remember is that if your system exhausts its physical RAM, it will resort to using swap space on your hard drive or SSD. While this prevents system crashes, it will severely degrade inference speed, as disk access is orders of magnitude slower than RAM. Therefore, it is always advisable to aim for sufficient RAM to comfortably accommodate your chosen model and context window without excessive reliance on swap space.

By diligently adhering to these deployment best practices, you can effectively leverage GGUF models to run powerful LLMs locally on your own hardware, achieving an optimal balance between performance and resource utilization tailored to your specific needs.

Conclusion

Quantization is fundamentally reshaping the landscape of Large Language Models (LLMs) by reducing their size and computational demands. This critical technique compresses models, making them more efficient without significant performance loss. Key methods like GGUF (GGML Unified Format) enable broader compatibility and easier deployment on consumer hardware, while AWQ (Activation-aware Weight Quantization) and GPTQ optimize for specific model architectures, striking a balance between quantization accuracy and speed.

The benefits are profound and far-reaching. Quantization dramatically enhances privacy by allowing LLMs to run locally on devices, eliminating the need to send sensitive data to cloud servers. It significantly lowers costs associated with inference and storage, democratizing access to powerful AI. Furthermore, it boosts accessibility, bringing advanced AI capabilities to a wider range of devices, including those with limited resources.

Looking ahead, quantization is not merely an optimization but a cornerstone for the pervasive integration of AI. It paves the way for more resilient, ethical, and widely available AI applications, pushing the boundaries of what's possible on edge devices and personal machines. The future of AI is smaller, faster, and more private, thanks to the continuous advancements in quantization.

Ready to explore how these advancements can optimize your AI deployments?

Discover Our Quantization Solutions Today

Book an AI consultation

Looking to build AI solutions? Let's chat.

Schedule your consultation today - this not a sales call, feel free to come prepared with your technical queries.

You'll be meeting Rohan Sawant, the Founder.
 Company
Book a Call

Let us help you.

Thank you! Your submission has been received!
Oops! Something went wrong while submitting the form.
Behind the Blog 👀
Manideep
Writer

Manideep is a Machine Learning Intern at Ionio

Pranav Patel
Editor

Good boi. He is a good boi & does ML/AI. AI Lead.