By Ryan A. on May 24, 2025
Large Language Models (LLMs) are changing rapidly, and with that comes the need for good, consistent ways to store them. GGUF (GPT-Generated Unified Format) is a big step in this direction, especially for running LLMs on your own computer. It fixes some issues with its older version, GGML, by providing a stronger and more flexible way to share and use these models.
This guide explains GGUF in detail. It's written for software engineers and machine learning engineers, but we've also added bits to help anyone interested in running LLMs locally understand why GGUF matters. We'll cover what it's made of, what makes it good, and how you can use it.
Before GGUF, a format called GGML (GPT-Generated Model Language) was very common, especially because of the popular llama.cpp
project. GGML is a library for machine learning calculations, and its file format was key to running LLMs efficiently on computer processors (CPUs). But the GGML format had some problems:
GGUF was created to fix these issues. The llama.cpp
community designed it to be a unified, flexible, and future-ready file format for LLMs. The main goal was to allow new information to be added to models without making older files unusable or requiring complicated steps to update them.
A GGUF file is a structured binary file, meaning it's organized in a specific way for quick loading and use. It has several main parts arranged one after the other:
Let's look at each part closely.
The GGUF header is a fixed size and contains these fields:
0x47475546
(which spells 'GGUF' in computer code). This tells your computer it's a GGUF file.This part is very flexible and a big improvement over GGML. It allows model creators to put rich details directly into the file. Each key-value pair has:
Some common metadata keys include:
general.architecture
: for example, llama
, falcon
general.name
: A readable name for the model.[architecture_name].context_length
: for example, llama.context_length
(how much text the model can consider at once).[architecture_name].embedding_length
: for example, llama.embedding_length
[architecture_name].block_count
: for example, llama.block_count
tokenizer.ggml.model
: for example, llama
, gpt2
(for the type of tokenizer, which breaks text into pieces the model understands).tokenizer.ggml.tokens
: A list of strings representing the model's vocabulary.tokenizer.ggml.scores
: (Optional) A list of numbers for token scores (used in some tokenizers).tokenizer.ggml.token_type
: (Optional) A list of numbers for token types.tokenizer.ggml.merges
: For some tokenizers, the rules for combining pieces of text.general.quantization_version
: Specifies the version of how the model's size was reduced.general.file_type
: Often tells you the main way the model's numbers are stored in a smaller size, e.g., Mostly Q4_K_M
.This flexibility means new, specific details can be added without messing up the basic GGUF structure. For instance, information about specialized model additions (like LoRA adapters) or details about how the model was trained could be included.
After the metadata, there's a list of information about each tensor, as many as specified by tensor_count
in the header. Each tensor info entry usually has:
blk.0.attn_norm.weight
).GGML_TYPE_F32
, GGML_TYPE_Q4_K
, GGML_TYPE_Q8_0
) indicating the data type and how the tensor's size was reduced (quantization). You can find a full list in ggml.h
within llama.cpp
.This section holds the actual numerical data for all the tensors. Each tensor's data is located at the spot specified in its tensor info entry. Tensors are usually stored one after another, but there might be empty space (padding) to ensure they line up correctly in memory. GGUF needs tensor data to be aligned to a specific boundary (e.g., 32 or 64 bytes, specified by the general.alignment
metadata key, or defaulting to 32 if not there). This alignment is important for memory mapping (mmap
), which lets the operating system load parts of the model into memory only when they're needed, saving on initial loading time and memory use.
GGUF file structure: a sequential layout comprising the header, metadata key-value pairs, an array of tensor information blocks, and finally the consolidated tensor data.
GGUF offers several important benefits for people who build and use LLMs:
tokenizer.model
or tokenizer_config.json
files for basic use.ggml.h
(e.g., Q2_K
, Q3_K_S
, Q4_0
, Q4_K_M
, Q5_K_M
, Q6_K
, Q8_0
, F16
, F32
). This gives you precise control over the balance between model size, how fast it runs, and how accurate it is.mmap
): The format is designed to be easily memory-mapped. This allows your computer's operating system to load only the necessary parts of the model into memory as they're needed. This reduces how long it takes to load the model initially and how much memory it uses, especially for very large models. The way the data is aligned helps with this.llama.cpp
promotes it, GGUF is becoming a common way to store LLMs that have been shrunk for local use. This helps different tools and models work better together.The main tools for working with GGUF files come from the llama.cpp
group.
Usually, you start with a model in another format, like from Hugging Face Transformers (PyTorch *.bin
or SafeTensors *.safetensors
files). The llama.cpp
project provides a Python script, convert.py
(or newer versions like convert-hf-to-gguf.py
), to change these models into GGUF.
Here's an example of how you might convert a model:
# Make sure you have llama.cpp downloaded and the Python tools installed
cd llama-cpp
python3 convert.py path/to/your/hf_model_directory \
--outfile converted_model.gguf \
--outtype f16 # Or q8_0, q4_k_m etc.
This script reads the original model data and settings, then writes them out in the GGUF format. You can pick the output tensor type (--outtype
) for initial size reduction, though more detailed size reduction (quantization) is often a separate step.
Once you have a GGUF file (often in F16 or F32 precision, which means full or half precision), you can reduce its size even further to make the file smaller and speed up how fast it runs on CPUs. llama.cpp
has a quantize
program for this.
# Compile llama.cpp to get the 'quantize' program
cd llama-cpp && make quantize
# Example: Reduce the size of an F16 GGUF to Q4_K_M
./quantize converted_model_f16.gguf quantized_model_q4km.gguf Q4_K_M
# Common quantization types:
# Q4_0, Q4_1, Q5_0, Q5_1, Q8_0
# K-Quants (generally better for quality at smaller sizes):
# Q2_K, Q3_K_S, Q3_K_M, Q3_K_L, Q4_K_S, Q4_K_M
# Q5_K_S, Q5_K_M, Q6_K
The Q_K
quants (K-Quants) are generally recommended because they often give a better balance between how well the model works and its ability to predict text accurately for a given file size. This is due to their specific internal structures and size-reduction methods.
While there isn't a widely used, separate tool just for looking at GGUF files (apart from llama.cpp
's own internal ways), you can often get information by trying to load the model with llama.cpp
itself (e.g., using the main
example program) in a detailed mode. Some community tools or Python scripts might also exist to read and show GGUF metadata.
If you were to write code to do this, you would read the header, then go through the metadata key-value pairs, and then the tensor information, interpreting the data according to the GGUF rules. Python's struct
module can be helpful for reading binary data.
For example, to read the magic number and version in Python:
import struct
with open("model.gguf", "rb") as f:
magic = f.read(4)
if magic != b'GGUF':
raise ValueError("Not a GGUF file")
version = struct.unpack("<I", f.read(4))[0] # <I for little-endian unsigned 32-bit integer
tensor_count = struct.unpack("<Q", f.read(8))[0] # <Q for little-endian unsigned 64-bit integer
metadata_kv_count = struct.unpack("<Q", f.read(8))[0]
print(f"GGUF Version: {version}")
print(f"Tensor Count: {tensor_count}")
print(f"Metadata KV Count: {metadata_kv_count}")
# More parsing would happen here...
Note: This is a simplified example. A complete program to parse GGUF is more involved.
.pth
/.pt
: These are usually specific to Python (using pickle
) and store numerical data along with Python code for the model's design. They aren't meant for direct use in programs written in C/C++ without a Python environment..safetensors
): A secure and fast format for storing numerical data. It's great for sharing models safely and making them work across different programming frameworks. GGUF often starts from models that might use SafeTensors for their original weights. While SafeTensors is about storing numerical data, GGUF is more complete, including metadata and size-reduction specific to ggml
programs..onnx
): An open format designed to represent machine learning models. ONNX is more general, aiming for a wide range of computer hardware and programs. GGUF is specifically made for ggml
-based programs like llama.cpp
, making it very efficient for that specific setup, especially for running models on CPUs and using various size-reduction methods that aren't always easy to do with ONNX for LLMs.Each format has its own purpose. GGUF's strength is its optimization for llama.cpp
and similar ggml
-based programs that run models, especially for LLMs that have been reduced in size on CPUs and, increasingly, on GPUs supported by llama.cpp
.
llama.cpp
The most common way to use GGUF files is with llama.cpp
to run LLMs on your local machine.
After compiling llama.cpp
(which usually creates a main
program):
# Example of running the model
./main -m ./models/your_quantized_model.gguf \
-p "Building a website can be done in many ways. One popular way is to" \
-n 256 \
--repeat-penalty 1.1 \
-ngl 35 # Number of layers to offload to GPU (if GPU support compiled)
Here's what these options mean:
-m
: Points to the GGUF model file.-p
: The initial text you give the model (the prompt).-n
: How many words/pieces of text (tokens) the model should generate.--repeat-penalty
: Makes the model less likely to repeat the same words.-ngl
: (Optional) How many layers of the model to move to your graphics card (GPU), if llama.cpp
was set up to use your GPU (e.g., with CUDA or Metal). This makes the model run much faster.GGUF is actively being worked on and improved alongside llama.cpp
. We can expect:
llama.cpp
is the main force, other tools and libraries might start using GGUF because it's so efficient for running LLMs locally.The introduction of GGUF V3, for example, brought changes like allowing each piece of numerical data (tensor) to have its own specific size-reduction type, which wasn't clearly part of V1/V2 (where a global file type often indicated the main size reduction).
The GGUF file format is a big step forward in making it standard how Large Language Models are stored, shared, and used, especially on computers with limited resources. Its flexible metadata system, strong support for reducing model size, and single-file design make it a key part of projects like llama.cpp
and the wider community focused on running LLMs locally.
Understanding GGUF's structure and what it can do is becoming more important for engineers working with LLMs. Whether you're changing model formats, fine-tuning how small they are, or building applications, GGUF provides a strong base for AI that's efficient and easy to access.
© 2025 ApX Machine Learning. All rights reserved.
Recommended Courses
Related to this post