ApX logo

New Leaderboard:Best LLMs for Coding

LLM GGUF Guide: File Format, Structure, and How It Works

By Ryan A. on May 24, 2025

Guest Author

Large Language Models (LLMs) are changing rapidly, and with that comes the need for good, consistent ways to store them. GGUF (GPT-Generated Unified Format) is a big step in this direction, especially for running LLMs on your own computer. It fixes some issues with its older version, GGML, by providing a stronger and more flexible way to share and use these models.

This guide explains GGUF in detail. It's written for software engineers and machine learning engineers, but we've also added bits to help anyone interested in running LLMs locally understand why GGUF matters. We'll cover what it's made of, what makes it good, and how you can use it.

Why a New Format? The Story Behind GGUF

Before GGUF, a format called GGML (GPT-Generated Model Language) was very common, especially because of the popular llama.cpp project. GGML is a library for machine learning calculations, and its file format was key to running LLMs efficiently on computer processors (CPUs). But the GGML format had some problems:

  • Hard to Change: Adding new details or information about the model often meant making big changes to the format itself. This made it tough to update the format smoothly as new types of models or ways to shrink them (quantization) came out.
  • Metadata Issues: Important info like special words the model understands, how much text it can remember, or details about its design were sometimes handled inconsistently or needed separate files.
  • Compatibility Problems: Different versions of projects using GGML could end up not working together because of small differences in the format.

GGUF was created to fix these issues. The llama.cpp community designed it to be a unified, flexible, and future-ready file format for LLMs. The main goal was to allow new information to be added to models without making older files unusable or requiring complicated steps to update them.

What's Inside a GGUF File?

A GGUF file is a structured binary file, meaning it's organized in a specific way for quick loading and use. It has several main parts arranged one after the other:

  1. Header: This holds key information about the file.
  2. Metadata Key-Value Store: A flexible area for storing all sorts of details about the model.
  3. Tensor Info: Descriptions for each piece of numerical data (tensor) in the model.
  4. Tensor Data: The actual numerical values (weights) for these tensors.

Let's look at each part closely.

Header

The GGUF header is a fixed size and contains these fields:

  • Magic Number: 0x47475546 (which spells 'GGUF' in computer code). This tells your computer it's a GGUF file.
  • Version: A number indicating the GGUF version (like V1, V2, V3). Newer versions can add fields or change how tensor info is structured, but the idea is that newer software should still be able to read older versions if designed correctly.
  • Tensor Count: The total number of tensors (numerical arrays) in the file.
  • Metadata KV Count: The number of key-value pairs in the metadata section.

Metadata Key-Value Store

This part is very flexible and a big improvement over GGML. It allows model creators to put rich details directly into the file. Each key-value pair has:

  • Key: A text string (like "architecture" or "context_length").
  • Value Type: A code that says what kind of data the value is (e.g., number, text, true/false, list).
  • Value: The actual information, matching the type.

Some common metadata keys include:

  • general.architecture: for example, llama, falcon
  • general.name: A readable name for the model.
  • [architecture_name].context_length: for example, llama.context_length (how much text the model can consider at once).
  • [architecture_name].embedding_length: for example, llama.embedding_length
  • [architecture_name].block_count: for example, llama.block_count
  • tokenizer.ggml.model: for example, llama, gpt2 (for the type of tokenizer, which breaks text into pieces the model understands).
  • tokenizer.ggml.tokens: A list of strings representing the model's vocabulary.
  • tokenizer.ggml.scores: (Optional) A list of numbers for token scores (used in some tokenizers).
  • tokenizer.ggml.token_type: (Optional) A list of numbers for token types.
  • tokenizer.ggml.merges: For some tokenizers, the rules for combining pieces of text.
  • general.quantization_version: Specifies the version of how the model's size was reduced.
  • general.file_type: Often tells you the main way the model's numbers are stored in a smaller size, e.g., Mostly Q4_K_M.

This flexibility means new, specific details can be added without messing up the basic GGUF structure. For instance, information about specialized model additions (like LoRA adapters) or details about how the model was trained could be included.

Tensor Info

After the metadata, there's a list of information about each tensor, as many as specified by tensor_count in the header. Each tensor info entry usually has:

  • Name: A string identifying the tensor (e.g., blk.0.attn_norm.weight).
  • Number of Dimensions (n_dims): A number (e.g., 1, 2, 4).
  • Shape (dims): A list of numbers showing the size of each dimension of the tensor.
  • Type: A code (e.g., GGML_TYPE_F32, GGML_TYPE_Q4_K, GGML_TYPE_Q8_0) indicating the data type and how the tensor's size was reduced (quantization). You can find a full list in ggml.h within llama.cpp.
  • Offset: A number telling you how far from the beginning of the GGUF file this tensor's data starts.

Tensor Data

This section holds the actual numerical data for all the tensors. Each tensor's data is located at the spot specified in its tensor info entry. Tensors are usually stored one after another, but there might be empty space (padding) to ensure they line up correctly in memory. GGUF needs tensor data to be aligned to a specific boundary (e.g., 32 or 64 bytes, specified by the general.alignment metadata key, or defaulting to 32 if not there). This alignment is important for memory mapping (mmap), which lets the operating system load parts of the model into memory only when they're needed, saving on initial loading time and memory use.


GGUF file structure: a sequential layout comprising the header, metadata key-value pairs, an array of tensor information blocks, and finally the consolidated tensor data.


What Makes GGUF Special?

GGUF offers several important benefits for people who build and use LLMs:

  • It's Flexible: You can add new metadata fields without making older versions of the software stop working. Programs can simply ignore metadata keys they don't know about.
  • One File Does It All: GGUF aims to be a single file. This means it can include tokenizer information (the list of words, how they combine, and special words) right within the model file. This makes it easier to share and use. You won't need to search for separate tokenizer.model or tokenizer_config.json files for basic use.
  • Good Quantization Support: GGUF directly supports many ways to shrink model size, as defined by ggml.h (e.g., Q2_K, Q3_K_S, Q4_0, Q4_K_M, Q5_K_M, Q6_K, Q8_0, F16, F32). This gives you precise control over the balance between model size, how fast it runs, and how accurate it is.
  • Efficient Memory Use (mmap): The format is designed to be easily memory-mapped. This allows your computer's operating system to load only the necessary parts of the model into memory as they're needed. This reduces how long it takes to load the model initially and how much memory it uses, especially for very large models. The way the data is aligned helps with this.
  • Becoming a Standard: Because llama.cpp promotes it, GGUF is becoming a common way to store LLMs that have been shrunk for local use. This helps different tools and models work better together.
  • More Model Information: You can include details like scaling parameters, suggested ways to start a prompt, or licensing info directly in the model. This makes it easier to use and reproduce results.

How to Work with GGUF Files

The main tools for working with GGUF files come from the llama.cpp group.

Creating GGUF Files

Usually, you start with a model in another format, like from Hugging Face Transformers (PyTorch *.bin or SafeTensors *.safetensors files). The llama.cpp project provides a Python script, convert.py (or newer versions like convert-hf-to-gguf.py), to change these models into GGUF.

Here's an example of how you might convert a model:

# Make sure you have llama.cpp downloaded and the Python tools installed
cd llama-cpp
python3 convert.py path/to/your/hf_model_directory \
  --outfile converted_model.gguf \
  --outtype f16 # Or q8_0, q4_k_m etc.

This script reads the original model data and settings, then writes them out in the GGUF format. You can pick the output tensor type (--outtype) for initial size reduction, though more detailed size reduction (quantization) is often a separate step.

Making GGUF Files Smaller (Quantization)

Once you have a GGUF file (often in F16 or F32 precision, which means full or half precision), you can reduce its size even further to make the file smaller and speed up how fast it runs on CPUs. llama.cpp has a quantize program for this.

# Compile llama.cpp to get the 'quantize' program
cd llama-cpp && make quantize

# Example: Reduce the size of an F16 GGUF to Q4_K_M
./quantize converted_model_f16.gguf quantized_model_q4km.gguf Q4_K_M

# Common quantization types:
# Q4_0, Q4_1, Q5_0, Q5_1, Q8_0
# K-Quants (generally better for quality at smaller sizes):
# Q2_K, Q3_K_S, Q3_K_M, Q3_K_L, Q4_K_S, Q4_K_M
# Q5_K_S, Q5_K_M, Q6_K

The Q_K quants (K-Quants) are generally recommended because they often give a better balance between how well the model works and its ability to predict text accurately for a given file size. This is due to their specific internal structures and size-reduction methods.

Looking Inside GGUF Files

While there isn't a widely used, separate tool just for looking at GGUF files (apart from llama.cpp's own internal ways), you can often get information by trying to load the model with llama.cpp itself (e.g., using the main example program) in a detailed mode. Some community tools or Python scripts might also exist to read and show GGUF metadata.

If you were to write code to do this, you would read the header, then go through the metadata key-value pairs, and then the tensor information, interpreting the data according to the GGUF rules. Python's struct module can be helpful for reading binary data.

For example, to read the magic number and version in Python:

import struct

with open("model.gguf", "rb") as f:
    magic = f.read(4)
    if magic != b'GGUF':
        raise ValueError("Not a GGUF file")
    version = struct.unpack("<I", f.read(4))[0] # <I for little-endian unsigned 32-bit integer
    tensor_count = struct.unpack("<Q", f.read(8))[0] # <Q for little-endian unsigned 64-bit integer
    metadata_kv_count = struct.unpack("<Q", f.read(8))[0]
    
    print(f"GGUF Version: {version}")
    print(f"Tensor Count: {tensor_count}")
    print(f"Metadata KV Count: {metadata_kv_count}")
    # More parsing would happen here...

Note: This is a simplified example. A complete program to parse GGUF is more involved.

GGUF Compared to Other Formats

  • GGML: GGUF is its direct replacement, offering better flexibility and handling of model information.
  • PyTorch .pth/.pt: These are usually specific to Python (using pickle) and store numerical data along with Python code for the model's design. They aren't meant for direct use in programs written in C/C++ without a Python environment.
  • SafeTensors (.safetensors): A secure and fast format for storing numerical data. It's great for sharing models safely and making them work across different programming frameworks. GGUF often starts from models that might use SafeTensors for their original weights. While SafeTensors is about storing numerical data, GGUF is more complete, including metadata and size-reduction specific to ggml programs.
  • ONNX (.onnx): An open format designed to represent machine learning models. ONNX is more general, aiming for a wide range of computer hardware and programs. GGUF is specifically made for ggml-based programs like llama.cpp, making it very efficient for that specific setup, especially for running models on CPUs and using various size-reduction methods that aren't always easy to do with ONNX for LLMs.

Each format has its own purpose. GGUF's strength is its optimization for llama.cpp and similar ggml-based programs that run models, especially for LLMs that have been reduced in size on CPUs and, increasingly, on GPUs supported by llama.cpp.

Putting It to Use: Running LLMs with GGUF and llama.cpp

The most common way to use GGUF files is with llama.cpp to run LLMs on your local machine.

After compiling llama.cpp (which usually creates a main program):

# Example of running the model
./main -m ./models/your_quantized_model.gguf \
    -p "Building a website can be done in many ways. One popular way is to" \
    -n 256 \
    --repeat-penalty 1.1 \
    -ngl 35 # Number of layers to offload to GPU (if GPU support compiled)

Here's what these options mean:

  • -m: Points to the GGUF model file.
  • -p: The initial text you give the model (the prompt).
  • -n: How many words/pieces of text (tokens) the model should generate.
  • --repeat-penalty: Makes the model less likely to repeat the same words.
  • -ngl: (Optional) How many layers of the model to move to your graphics card (GPU), if llama.cpp was set up to use your GPU (e.g., with CUDA or Metal). This makes the model run much faster.

The Future of GGUF

GGUF is actively being worked on and improved alongside llama.cpp. We can expect:

  • Newer Versions (V3, V4, ...): As LLM technology gets better, new versions of GGUF might come out to support new types of metadata, ways to arrange numerical data, or methods for reducing model size. The design tries to keep older versions working with newer software whenever possible.
  • More Tool Support: While llama.cpp is the main force, other tools and libraries might start using GGUF because it's so efficient for running LLMs locally.
  • Better Metadata Standards: The community might agree on more standard keys for common information (like prompt formats or settings for model additions), which would help different tools work together even better.

The introduction of GGUF V3, for example, brought changes like allowing each piece of numerical data (tensor) to have its own specific size-reduction type, which wasn't clearly part of V1/V2 (where a global file type often indicated the main size reduction).

Conclusion

The GGUF file format is a big step forward in making it standard how Large Language Models are stored, shared, and used, especially on computers with limited resources. Its flexible metadata system, strong support for reducing model size, and single-file design make it a key part of projects like llama.cpp and the wider community focused on running LLMs locally.

Understanding GGUF's structure and what it can do is becoming more important for engineers working with LLMs. Whether you're changing model formats, fine-tuning how small they are, or building applications, GGUF provides a strong base for AI that's efficient and easy to access.

© 2025 ApX Machine Learning. All rights reserved.