New Leaderboard:Best LLMs for Coding

How To Calculate GPU VRAM Requirements for an Large-Language Model

By Wei Ming T. on Apr 23, 2025

Need to cite this page? View citation guide

Running Large Language Models (LLMs) effectively demands significant computational resources, with Graphics Processing Unit (GPU) memory, or VRAM, often being the most restricting bottleneck. Insufficient VRAM leads to frustrating Out-of-Memory (OOM) errors, halting inference or training processes. Conversely, over-provisioning VRAM results in unnecessary costs and underutilized hardware.

Understanding how to accurately estimate VRAM requirements is therefore important for any developer working with Local LLMs. This knowledge allows for informed hardware selection, efficient resource management, and successful deployment or fine-tuning of these large-scale models. Calculating these needs involves considering several factors related to the model's architecture, the specific task (inference or training), and the chosen configuration.

This post is the guide for the VRAM calculator ⬇️

Why VRAM is Important for LLMs

LLMs are extensive neural networks, with billions of parameters defining their learned knowledge. During operation, these parameters, along with intermediate calculations (activations), and potentially gradients and optimizer states (during training), must be stored in the GPU's VRAM for rapid processing.

The available VRAM directly influences:

Feasibility: Determines if a model of a specific size and precision can be loaded onto the GPU.
Performance: Impacts the maximum batch size and sequence length, thereby affecting throughput and latency.
Training Stability: OOM errors during training can corrupt progress or necessitate restarts, wasting valuable time and compute resources.

Components Contributing to VRAM Usage

Estimating total VRAM involves summing the memory consumed by several distinct components. The significance of each component varies based on whether inference or training/fine-tuning is being performed.

Model Parameters

This is frequently the largest and most direct component to calculate. It's determined by the model's parameter count and the numerical precision used for storage.

Number of Parameters: Usually specified in billions (e.g., 7B, 70B, 180B). This is typically found on the model card or its repository.
Precision (Data Type): Dictates the bytes needed per parameter.
- FP32 (Single Precision Float): 4 bytes
- FP16 (Half Precision Float): 2 bytes
- BF16 (Bfloat16 Float): 2 bytes
- INT8 (8-bit Integer): 1 byte
- INT4 (4-bit Integer): 0.5 bytes (packed)

The formula is:

$VRAM_{params} = \text{Number of Parameters} \times \text{Bytes per Parameter}$

Example: A 7 billion parameter model (7B) loaded in FP16 precision requires: $7 \times 10^9 \text{ parameters} \times 2 \text{ bytes/parameter} = 14 \times 10^9 \text{ bytes} = 14 \text{ GB}$

Note: This calculation assumes uniform precision and is typically the most predictable VRAM component.

Optimizer States (Training/Fine-tuning Only)

During training, optimizers like Adam or AdamW maintain state information for each trainable model parameter. Adam/AdamW usually store two states per parameter (momentum and variance), often in FP32, though mixed-precision setups can change this.

Adam/AdamW: Requires storing 2 values per parameter. If in FP32, this means $2 \times 4 = 8$ bytes per parameter. This is a common estimate.
Other Optimizers: SGD with momentum might store 1 state (4 bytes/param if FP32). Adafactor uses less memory.

A common estimation for AdamW:

$VRAM_{optimizer} \approx 2 \times \text{Number of Trainable Parameters} \times 4 \text{ bytes (for FP32 states)}$

If fine-tuning all parameters of a 7B model with AdamW using FP32 states: $2 \times 7 \times 10^9 \times 4 = 56 \text{ GB}$

Note: Libraries such as DeepSpeed or bitsandbytes provide 8-bit optimizers that significantly reduce this memory usage.

Gradients (Training/Fine-tuning Only)

Backpropagation computes gradients for each trainable parameter. These gradients usually match the numerical precision of the trainable parameters during the backward pass.

$VRAM_{gradients} = \text{Number of Trainable Parameters} \times \text{Bytes per Parameter (Training Precision)}$

For a 7B model being fully fine-tuned in FP16: $7 \times 10^9 \times 2 \text{ bytes/parameter} = 14 \text{ GB}$

Activations (Inference & Training)

Activations are intermediate outputs of model layers from the forward pass. Their size is more complex to determine accurately, influenced by:

Batch Size: Number of sequences processed together.
Sequence Length: Length of input sequences.
Hidden Dimension Size: Size of internal vector representations.
Number of Layers: Depth of the model.
Model Architecture: Specifics like attention mechanisms.

Limitation Acknowledgement: Precise activation memory calculation is difficult due to varied layer types and optimizations like activation checkpointing. The following formula is a rough approximation for Transformers and serves as a guideline. Actual usage depends on framework implementation and model specifics.

$VRAM_{activations} \approx \text{Batch Size} \times \text{Sequence Length} \times \text{Hidden Dim} \times \text{Num Layers} \times \text{Bytes per Activation} \times K$

Where $K$ is a model-specific heuristic factor (often between 10-30), covering various intermediate values. Precise figures often need detailed model analysis or empirical tests.

KV Cache (Inference Generation): During auto-regressive generation, the model caches past Key (K) and Value (V) states from attention layers to accelerate token prediction. This cache grows with generated sequence length and can use considerable VRAM. Its size is highly dependent on the attention mechanism structure and sequence length.

The approximate KV Cache size for a model can be generally stated as:

$VRAM_{kv\_cache} \approx 2 \cdot L \cdot N_{kv} \cdot D_{kv} \cdot S \cdot B \cdot C_b$

Where $L = \text{Num Layers}$ , $N_{kv} = \text{Num Key/Value Heads}$ , $D_{kv} = \text{Head Dim for KV projection}$ , $S = \text{Sequence Length}$ , $B = \text{Batch Size}$ , and $C_b = \text{Bytes per Cached Value}$ .

Bytes per Cached Value ( $C_b$ ): Typically 2 bytes for FP16/BF16.
Num Key/Value Heads ( $N_{kv}$ ) & Head Dim for KV ( $D_{kv}$ ): These depend on the attention architecture (see "Attention Mechanism Variants" below).

KV Cache Quantization: To reduce VRAM, the KV cache can be quantized, for example, to INT8 or even FP8 (on supported hardware). This changes $C_b$ to 1 (for INT8 or FP8), significantly reducing $VRAM_{kv\_cache}$ . This may come with a small performance/accuracy trade-off and requires framework support.

For long sequences or large batches, the KV cache can be a primary driver of inference VRAM usage. Its size estimation is also subject to implementation details.

Attention Mechanism Variants and VRAM

Different attention mechanisms have varying impacts on VRAM, primarily affecting the KV cache size ( $N_{kv}$ ) and other intermediate activation sizes. Let $N_q$ be the number of query heads in the model.

Multi-Head Attention (MHA): Standard attention where each attention head has its own Query, Key, and Value projections.
- In this configuration, $N_{kv} = N_q$ .
Multi-Query Attention (MQA): All query heads share a single Key and Value head. This drastically reduces the size of the KV cache.
- Here, $N_{kv} = 1$ .
- The KV cache VRAM becomes $2 \cdot L \cdot 1 \cdot D_{kv} \cdot S \cdot B \cdot C_b$ .
Grouped-Query Attention (GQA): Query heads are divided into a number of groups ( $N_g$ $N_{g}$ ), and each group shares a set of Key and Value heads. This offers a balance between MHA and MQA.
- The number of Key/Value heads $N_{kv}$ satisfies $1 < N_{kv} < N_q$ .
- Typically, $N_{kv} = N_q / N_g$ . For example, if $N_q = 32$ and there are $N_g=4$ groups, then $N_{kv} = 8$ .
FlashAttention / FlashAttention-2: These are I/O-aware attention algorithms that compute attention without materializing the full ( $S \times S$ ) attention matrix in GPU HBM (High Bandwidth Memory). This reduces VRAM usage for that specific large intermediate matrix (part of general $VRAM_{activations}$ ) and memory access overhead, often leading to faster execution. It does not directly reduce the size of the KV cache itself, but reduces other activation memory.
Sparse Attention Mechanisms (e.g., Longformer, BigBird): These reduce the quadratic complexity of attention ( $O(S^2)$ ) to $O(S \log S)$ or linear, targeting longer sequences. This reduces VRAM for attention pattern related activations, but their KV cache may still be dense depending on implementation.

The choice of attention mechanism is an architectural detail of the LLM and significantly influences VRAM.

Temporary Buffers & Workspace

Deep learning frameworks (PyTorch, TensorFlow) and CUDA kernels often allocate temporary memory for intermediate steps, fused operations, or communication (in multi-GPU contexts). This is hard to predict precisely but usually constitutes a smaller part (e.g., 1-2 GB, but variable) of total VRAM. Optimized backends or compilation techniques (e.g., torch.compile in PyTorch, TensorRT) can sometimes reduce peak temporary memory by fusing operations more effectively. It's prudent to include a buffer for this.

Input Data

The batch of tokenized input IDs also resides in VRAM, but its size is generally minor compared to parameters, activations, or optimizer states.

Calculating VRAM for Inference

For inference, main VRAM contributors are Model Parameters and Activations (including the KV Cache).

Total Inference VRAM ≈ VRAM_params + VRAM_activations + VRAM_kv_cache + VRAM_overhead

Example: Llama 3 8B (FP16) Inference, GQA, SeqLen 2048

Model Parameters: 8B params * 2 bytes/param = 16 GB
Activations & KV Cache: Highly dependent on sequence length, batch size, and architecture. For a batch size of 4 and sequence length of 2048:
- Llama 3 8B uses GQA with $L=32$ layers, $N_{kv}=8$ Key/Value heads, $D_{kv}=128$ head dimension.
- KV Cache (FP16, $C_b=2$ ): $2 \cdot 32 \cdot 8 \cdot 128 \cdot 2048 \cdot 4 \cdot 2 \text{ bytes} \approx 1.07 \text{ GB}$ .
- Other activations might add a few more GB.
Overhead: Framework, CUDA kernels. Estimate 1-2 GB.

Estimated Total (with Llama 3 8B GQA): 16 GB (Params) + ~2-5 GB (Activations incl. 1.07GB KV Cache) + 1-2 GB (Overhead) ≈ 19-23 GB

Note: This total is an estimate. Actual usage should be monitored. Architectural details like GQA significantly matter.

Main factors contributing to VRAM usage during LLM inference, including attention mechanism details. Activation calculation is approximate.

Here is one way to get the parameter count, using Hugging Face transformers:

from transformers import AutoConfig, AutoModelForCausalLM

model_name = "meta-llama/Meta-Llama-3-8B"
config = AutoConfig.from_pretrained(model_name)

# Recommended: Get from config if available and accurate
# For Llama models, num_parameters might not be directly in config.
# config.num_parameters might be None or an estimate.
num_params_config = getattr(config, "num_parameters", None)
if hasattr(config, "to_dict"): # More robust check
    true_num_params = config.to_dict().get("num_parameters", None)
    if true_num_params: num_params_config = true_num_params


# Fallback: Load model and count (requires CPU RAM)
if num_params_config is None:
    print("Parameter count not in config, loading model to count...")
    # Consider loading with low_cpu_mem_usage=True if RAM is limited
    model = AutoModelForCausalLM.from_pretrained(
        model_name, low_cpu_mem_usage=True
    )
    num_params = sum(p.numel() for p in model.parameters() if p.requires_grad)
    # For total parameters including non-trainable (rare for LLMs)
    # num_params = sum(p.numel() for p in model.parameters())
    del model # Free up memory
else:
    num_params = num_params_config

bytes_per_param = 2 # For FP16
vram_params_gb = (num_params * bytes_per_param) / (1024**3)

print(f"Model: {model_name}")
print(f"Parameters: {num_params / 1e9:.1f}B") # Ensure num_params is integer
print(f"Est. Param VRAM (FP16): {vram_params_gb:.2f} GB")

(Note: Loading the model directly requires sufficient CPU RAM. Using low_cpu_mem_usage=True can help. Parameter count from config can sometimes be an estimate.)

Alternatively, parameter counts are often found on the model's page or documentation. The most reliable method is often examining the model source code (e.g., Llama 3).

Calculating VRAM for Fine-Tuning

Fine-tuning demands substantially more VRAM than inference, as it stores gradients and optimizer states alongside parameters and activations.

Total Training VRAM ≈ VRAM_params + VRAM_gradients + VRAM_optimizer + VRAM_activations + VRAM_overhead

Full Fine-Tuning

Here, all model parameters are updated.

Example: Llama 3 8B (FP16), AdamW (FP32 states)

Model Parameters (FP16): 8B params * 2 bytes/param = 16 GB
Gradients (FP16): 8B params * 2 bytes/param = 16 GB
Optimizer States (AdamW, FP32): 2 states/param * 8B params * 4 bytes/state = 64 GB
Activations: Depends heavily on batch size/sequence length and use of techniques like FlashAttention. Could be 10-30 GB or more (highly approximate).
Overhead: Estimate 1-2 GB.

Estimated Total: 16 + 16 + 64 + (10 to 30) + (1 to 2) ≈ 107 - 128 GB

Note: This calculation highlights significant memory needs and relies on estimations for activations and overhead.

This shows why full fine-tuning of large models often requires multiple high-VRAM GPUs.

Parameter-Efficient Fine-Tuning (PEFT)

Techniques like LoRA (Low-Rank Adaptation) reduce VRAM needs by freezing base model parameters and training only small adapter layers.

LoRA: Only LoRA adapter parameters (millions, not billions) need gradients and optimizer states. The base model (frozen) contributes its parameter size (often in FP16, BF16, or quantized formats via QLoRA).
QLoRA: Further reduces memory by loading the base model in a quantized format (e.g., 4-bit NF4) while training LoRA adapters (often in BF16).

Example: Llama 3 8B with LoRA (Rank=8, Alpha=16)

Base Model Parameters (Frozen, e.g., FP16): 16 GB
LoRA Parameters (Trainable, BF16): Small, e.g., ~10-50 Million parameters. Say 20M params * 2 bytes/param ≈ 40 MB.
LoRA Gradients (BF16): 20M params * 2 bytes/param ≈ 40 MB.
LoRA Optimizer States (AdamW, FP32): 2 * 20M params * 4 bytes/state ≈ 160 MB.
Activations: Still significant, similar to inference but for the full model during forward/backward passes through adapters. Estimate 10-30 GB (FlashAttention can help reduce this).
Overhead: 1-2 GB.

Estimated Total (LoRA): 16 GB (Base) + ~0.24 GB (LoRA components) + (10 to 30) GB (Activations) + (1 to 2) GB (Overhead) ≈ 27 - 48 GB

Estimated Total (QLoRA, 4-bit base): Base model params ≈ 8B * 0.5 bytes/param = 4 GB. Total ≈ 4 + ~0.24 + (10 to 30) + (1 to 2) ≈ 15 - 36 GB

Note: Activation and overhead figures are estimates. LoRA parameter count estimate is simplified.

This reduction makes fine-tuning accessible on consumer or prosumer GPUs.

# Rough estimate of LoRA parameter count
def estimate_lora_params(model_config, rank=8,
                         target_modules=['q_proj', 'v_proj']):
    hidden_size = getattr(model_config, 'hidden_size', 0)
    num_layers = getattr(model_config, 'num_hidden_layers', 0)
    # intermediate_size for MLP layers, if targeted
    # inter_size = getattr(model_config, 'intermediate_size', 0)

    # Simplified: Assumes target modules are linear layers where LoRA
    # replaces W with W + BA. A is [rank, in_dim], B is [out_dim, rank].
    # For Q/V projections in attention, in_dim=out_dim=hidden_size.
    params_per_lora_layer_component = 0
    # This loop assumes all target_modules have same dimensionality structure
    # (e.g. all are like q_proj or v_proj)
    for _ in target_modules: # Iterate for each targeted module type per layer
        # LoRA adds A[rank, in_features] and B[out_features, rank]
        # For q_proj, v_proj: in_features = hidden_size, out_features = hidden_size
        params_per_lora_layer_component += (rank * hidden_size) + (hidden_size * rank)

    total_lora_params = num_layers * params_per_lora_layer_component
    return total_lora_params

# Example for Llama 3 8B-like config values
class MockConfig: # Replace with actual loaded config object
    hidden_size = 4096 # From Llama 3 8B
    num_hidden_layers = 32 # From Llama 3 8B
    # intermediate_size = 14336 # For MLP layers like gate/up/down_proj

config = MockConfig()
# Example: Targeting only Q and V projections in attention layers
l_params_qv = estimate_lora_params(config, rank=8,
                                 target_modules=['q_proj', 'v_proj'])
print(f"Est. LoRA Params (r=8, Q/V only): {l_params_qv / 1e6:.2f}M")

(Note: Actual LoRA parameters depend heavily on which specific layers are targeted (e.g., attention Q/K/V/O, MLP layers) and their dimensions. This function assumes all targeted modules are like Q/V projections in attention. Targeting MLP layers would require using intermediate_size for some dimensions.)

Estimated VRAM comparison for different fine-tuning methods on an 8B parameter model. Activation size is illustrative and highly dependent on batch size, sequence length, and attention optimizations. Adapter/Gradient/Optimizer sizes for LoRA/QLoRA are approximate.

Precision and Quantization Impact

Choosing the right numerical format is important for managing VRAM. Quantization can apply to weights, activations, and the KV cache.

Precision	Bytes per Parameter/Value	Typical Use Case	Notes
FP32	4	Older models, some science tasks	High precision, highest VRAM usage
FP16	2	Common for training & inference	Good balance, potential overflow issues
BF16	2	Common for training & inference	Wider range than FP16, less precision, good for training on newer GPUs
INT8	1	Quantized weights/activations/KV cache	Significant VRAM saving, requires calibration
FP8	1	Emerging for weights/activations/KV cache	Similar savings to INT8, hardware dependent (e.g., H100+)
INT4	0.5	Aggressive weight quantization (QLoRA base)	Max VRAM saving for weights, potential accuracy drop

Techniques like GPTQ, AWQ, or bitsandbytes (used in QLoRA) allow loading model weights with INT8 or INT4 precision. Quantizing activations or the KV cache (e.g., to INT8 or FP8) provides further savings during runtime and is increasingly supported by inference frameworks.

Multi-GPU Overhead

Using multiple GPUs ( $N_{gpus}$ ) introduces overhead compared to a single GPU, meaning performance and memory do not scale perfectly linearly. This arises from inter-GPU communication and synchronization needs.

Memory Overhead: Each GPU requires extra VRAM for communication buffers and replicated non-sharded states (depending on the distribution strategy like DeepSpeed ZeRO stage). A heuristic model suggests this overhead grows with the number of GPUs.
- Example heuristic: Additional overhead ( $VRAM_{overhead}$ ) might scale based on single-GPU base memory ( $VRAM_{base\_single}$ ) and $N_{gpus}$ : $VRAM_{overhead} \approx VRAM_{base\_single} \times 0.05 \times \sqrt{N_{gpus} - 1}$
- Limitation Note: This 5% factor and square root scaling are highly simplified assumptions. Actual memory overhead depends heavily on parallelism strategy (Data, Tensor, Pipeline Parallel, ZeRO stage), interconnects, and framework. This formula is a rough indication.
Performance Scaling: Doubling GPUs rarely doubles speed. Communication latency, synchronization, and load imbalances reduce speedup. This can be modeled with an efficiency factor per additional GPU.
- Let $Speed_{single}$ be performance on one GPU.
- Let $Efficiency$ be the scaling efficiency per additional GPU (e.g., 0.85).
- Effective speedup on $N_{gpus}$ can be estimated: $EffectiveSpeedup \approx Speed_{single} \times (1 + (N_{gpus} - 1) \times Efficiency)$
- Example with 85% efficiency: $EffectiveSpeedup \approx Speed_{single} \times (1 + (N_{gpus} - 1) \times 0.85)$
~85% Efficiency?
- This 0.85 value is a practical rule of thumb, seen in empirical tests with optimized distributed setups (e.g., PyTorch DDP/FSDP, DeepSpeed) on hardware with high-bandwidth interconnects (like NVLink).
- It reflects performance loss due to:
  1. Communication Cost: Time for data transfer between GPUs.
  2. Synchronization Cost: GPUs waiting for others.
  3. Workload Imbalance: Processing time variations.
- Limitation Note: This 85% is not fixed. It can be higher (>90%) for compute-bound tasks with excellent interconnects and optimized libraries (like NCCL), or much lower for communication-bound tasks, slower interconnects, or inefficient implementations.
- Use as an initial estimate. Always profile your specific workload and hardware for accurate scaling efficiency.

Understanding these overheads is important for realistic expectations and efficient resource allocation in multi-GPU setups.

Tools and Techniques for Estimation & Monitoring

While formulas give estimates, practical tools help verify VRAM usage.

Hugging Face Hub: Model cards often list parameter counts and architectural details (like attention type).
accelerate Library: Includes infer_auto_device_map for estimating model distribution across devices.
bitsandbytes Library: For 4-bit/8-bit weight quantization (QLoRA) and 8-bit optimizers.
nvidia-smi: Standard tool for real-time GPU monitoring.
```
watch -n 1 nvidia-smi
```
nvtop / gpustat: Interactive or concise GPU monitoring tools.

PyTorch Memory Utilities:

import torch

if torch.cuda.is_available():
    # Print detailed summary per device
    for i in range(torch.cuda.device_count()):
        print(f"--- Device {i}: {torch.cuda.get_device_name(i)} ---")
        print(torch.cuda.memory_summary(device=i))

    # Get max memory allocated/reserved (call after workload)
    # Ensure these are called at the right point to capture peak usage.
    print(f"Max VRAM allocated (all GPUs): "
          f"{torch.cuda.max_memory_allocated() / 1024**3:.2f} GB")
    print(f"Max VRAM reserved (all GPUs): "
          f"{torch.cuda.max_memory_reserved() / 1024**3:.2f} GB")

    # torch.cuda.reset_peak_memory_stats() # For sectional profiling

Other Tips & Considerations

Add a Buffer: Always add a safety margin (e.g., 10-20%) to calculated estimates for framework overhead and fragmentation.
Gradient Accumulation: Accumulate gradients over several smaller "micro-batches" before an optimizer step. This simulates a larger effective batch size for weight updates, while $VRAM_{activations}$ scales with the micro-batch size, trading compute time for memory.
Activation Checkpointing (Gradient Checkpointing): Recomputes activations during the backward pass instead of storing all. Reduces $VRAM_{activations}$ at the cost of ~20-30% more computation. Very effective for training large models or with long sequences.
Model Parallelism: For models too large for one GPU:
- Tensor Parallelism: Splits layers (weight matrices) across GPUs. Needs high inter-GPU bandwidth.
- Pipeline Parallelism: Assigns sequential layers to different GPUs.
- ZeRO (Zero Redundancy Optimizer): Partitions optimizer states, gradients, and parameters across GPUs.
CPU Offloading: Moves data (optimizer states, gradients, parameters) to CPU RAM. Reduces VRAM but impacts performance due to slower CPU-GPU transfers.
Dynamic Sequence Lengths: Handling dynamic sequence lengths (e.g., via padding or bucketing) can affect average VRAM usage compared to always using max sequence length, but peak usage will still be determined by the longest sequence in a batch.

Conclusion and Limitations

Calculating VRAM for LLMs involves accounting for parameters, optimizer states, gradients, activations (including attention mechanism specifics like KV cache and its potential quantization), and overhead. VRAM needs differ based on task (inference, fine-tuning, PEFT) and settings (precision, batch size, sequence length, multi-GPU setup, and architectural choices like GQA or FlashAttention).

These calculations provide a baseline; actual VRAM use is influenced by framework specifics, CUDA behavior, memory fragmentation, and other implementation details. These are not fully captured by basic formulas.