Just launched on LinkedIn! Follow for updates on AI/ML research and practical tips.

Follow on LinkedIn

Best Local LLMs for Every NVIDIA RTX 40 Series GPU

By Ryan A. on Apr 18, 2025

Guest Author

Running Large Language Models (LLMs) locally on consumer hardware is increasingly feasible, offering benefits like enhanced privacy, cost savings, and customization options. NVIDIA's RTX 40 series GPUs, with their significant VRAM and compute capabilities, provide a strong platform.

Selecting the right LLM for your specific card requires understanding the interplay between GPU specifications, model characteristics, and the available software frameworks. Success depends on matching the model's requirements, particularly its VRAM footprint, to the capabilities of your hardware.

Understanding RTX 40 Series GPUs for LLMs

The suitability of these GPUs for running LLMs primarily hinges on their Video Random Access Memory (VRAM). Larger models, or models running with less compression (quantization), demand more VRAM. Other factors like CUDA core count, Tensor Core performance (especially for lower precision formats like FP16/INT8), and memory bandwidth influence inference speed (tokens per second).

Here's a quick overview of the VRAM available across the RTX 40 series desktop lineup:

GPU Model VRAM (GB) Typical LLM Use Case Potential
RTX 4090 24 Large models (70B+ Q*), High performance
RTX 4080 / Super 16 Medium-Large models (30B-70B Q), Good perf.
RTX 4070 Ti Super 16 Medium-Large models (30B-70B Q), Good perf.
RTX 4070 Ti 12 Medium models (13B-30B Q), Solid performance
RTX 4070 / Super 12 Medium models (13B-30B Q), Solid performance
RTX 4060 Ti 16 Medium-Large models (30B-70B Q), Value Perf.
RTX 4060 Ti 8 Small-Medium models (7B-13B Q), Entry-level
RTX 4060 8 Small models (3B-7B Q), Basic use

* Q denotes quantized models. Performance varies significantly within tiers based on specific model, quantization, and framework.

Ensure you have the latest NVIDIA drivers installed. They often include performance improvements and compatibility fixes relevant to CUDA and LLM workloads.

Factors for Choosing a Local LLM

Selecting an LLM involves balancing several technical considerations:

Model Size (Parameters)

LLMs are often categorized by their number of parameters (e.g., 7 billion, 13B, 70B). Larger models generally exhibit better reasoning and knowledge capabilities but require significantly more VRAM and compute power. The VRAM requirement scales roughly linearly with the number of parameters and depends on the precision used (e.g., FP16, INT8, INT4).

Quantization

Quantization is a technique used to reduce the memory footprint and sometimes accelerate the inference speed of LLMs by representing the model's weights and activations with lower-precision data types (e.g., 8-bit integers (INT8) or 4-bit integers (INT4)) instead of the standard 16-bit floating-point (FP16 or BF16). Common quantization formats include:

  • GGUF (GPT-Generated Unified Format): Used extensively by llama.cpp, it bundles the model and quantization information into a single file. It supports various quantization methods (e.g., Q4_K_M, Q5_K_S, Q8_0) optimized for different quality/performance trade-offs.
  • GPTQ (Generative Pre-trained Transformer Quantization): An early popular post-training quantization method, often requiring specific library support (like AutoGPTQ).
  • AWQ (Activation-aware Weight Quantization): This is another method that aims to preserve model quality better during quantization.
  • BitsAndBytes: This is a library integrated with Hugging Face Transformers that enables on-the-fly quantization (e.g., loading models in 8-bit or 4-bit directly).

Quantization significantly lowers VRAM usage, making running larger models on GPUs with limited memory possible. For example, a 7B parameter model might require ~14GB in FP16 but only ~4-5GB using 4-bit quantization (like Q4_K_M).

Inference Speed

Inference speed, measured in tokens per second (tok/s), determines how quickly the model generates text. It's affected by the GPU's compute power (CUDA/Tensor cores), memory bandwidth, model size, quantization level, batch size, and the efficiency of the inference framework being used.

Task Suitability

Models are often fine-tuned for specific tasks. Some are general-purpose chat models (e.g., Llama 3 Instruct), while others might specialize in coding (e.g., Code Llama), instruction following, or specific knowledge domains. Choose a model whose training aligns with your intended application.

Licensing

LLMs have different licenses that dictate how they can be used, modified, and distributed. Common licenses include Apache 2.0 (permissive), MIT, and specific community licenses like the Llama 2 & 3 Community License, which might restrict commercial use for large companies. Always check the model's license before using it, especially for commercial applications.

The relationship between GPU VRAM, Model Size, Quantization, and Local LLM Execution.

Best LLMs for Each GPU

These recommendations focus on popular, high-performing models and assume the use of quantization (primarily 4-bit or 5-bit GGUF variants like Q4_K_M or Q5_K_M) unless otherwise stated. Performance is qualitative.

RTX 4090 (24GB VRAM)

With the most VRAM in the consumer lineup, the 4090 can handle the largest models with reasonable quantization or smaller models with higher precision.

  • Llama 3 70B Instruct (Quantized): A top-performing open model. Runs with 4-bit or 5-bit quantization (e.g., Q4_K_M, Q5_K_M GGUF). Expect an alright performance.
  • Mixtral 8x7B (Quantized): A high-quality Mixture-of-Experts (MoE) model. Requires significant VRAM even when quantized (~25-30GB for Q4_K_M), pushing the limits but potentially feasible with specific quantization/offloading strategies. Performance is generally good.
  • Command R+ (Quantized): Cohere's powerful 104B parameter model. Requires heavy quantization (e.g., Q4) and fits within 24GB, offering strong reasoning capabilities.
  • DeepSeek-R1-Distill-Qwen-14B (Quantized): Runs well on the 4090 with 4-bit quantization. It can also run DeepSeek-R1-Distill-Qwen-32B (Quantized) with a smaller context length and degraded inference speed.
  • Fine-tuned 30B+ Models: Ideal for running specialized, fine-tuned models based on architectures like Llama or Mistral at higher quality settings.

RTX 4080 / Super & RTX 4070 Ti Super (16GB VRAM)

These GPUs offer a good balance and are capable of running large models with quantization.

  • Mixtral 8x7B (Quantized): Runs well with 4-bit quantization (e.g., Q4_K_M, ~25-30GB VRAM usage). Depending on the exact variant and context size, it may require some layers offloaded to CPU RAM, but it is generally usable.
  • Llama 3 8B Instruct (Unquantized/Quantized): Runs exceptionally well, even unquantized (FP16). Good performance.
  • Mistral 7B / Zephyr / OpenHermes (Unquantized/Quantized): Smaller, more efficient models that run very fast.
  • Phi-3 Medium (Quantized): Microsoft's capable small model runs well.
  • DeepSeek-R1-Distill-Qwen-7B (Quantized): Runs comfortably at 4-bit (~4.5GB VRAM). A great mix of performance and size.
  • Fine-tuned 13B Models (Quantized): Can run fine-tuned 13B models comfortably.

RTX 4070 / Super & RTX 4070 Ti (12GB VRAM)

These 12GB cards are competent mid-range options, suitable for medium-sized models with quantization.

  • Llama 3 8B Instruct (Unquantized/Quantized): Runs very well. FP16 (~16GB) won't fit entirely, but high-quality quantization (Q5_K_M, Q8_0) fits easily with great speed.
  • Mistral 7B Instruct (Unquantized/Quantized): Similar to Llama 3 8B, it runs extremely well and easily fits even with less aggressive quantization.
  • Phi-3 Medium (Quantized): Runs comfortably with 4-bit quantization (~7GB VRAM needed).
  • Mixtral 8x7B (Heavily Quantized): Possible with aggressive 3-bit or lower 4-bit quantization (e.g., Q3_K_M, Q4_0), potentially requiring some CPU offloading. The performance will be noticeably slower.
  • DeepSeek-R1-Distill-Qwen-7B (Quantized): Also runs well, especially with Q4 quantization.
  • Fine-tuned 7B/8B Models: Excellent platform for running specialized 7B/8B models.

RTX 4060 Ti (16GB VRAM)

This card is interesting. It offers the VRAM of a 4080 but with less compute power and memory bandwidth. It can fit similar models but will run them slower.

  • Model Fit: Can technically load models similar to the 16GB 4070 Ti Super / 4080 (e.g., Mixtral Q4, potentially Llama 3 70B with heavy quantization/offloading).
  • Performance: Expect significantly lower tokens/second than the higher-tier 16GB cards due to fewer CUDA cores and narrower memory bus.
  • Recommendations: Llama 3 8B (excellent), Phi-3 Medium (good), Mixtral 8x7B (usable with Q4), DeepSeek-R1-Distill-Qwen-7B (very usable), Fine-tuned 7B/13B models.

RTX 4060 Ti (8GB VRAM) & RTX 4060 (8GB VRAM)

These entry-level cards are the most constrained by VRAM. Focus on smaller models or heavily quantized versions of medium models.

  • Llama 3 8B Instruct (Quantized): Runs well with 4-bit quantization (Q4_K_M fits comfortably within 8GB).
  • Mistral 7B Instruct (Quantized): Similar to Llama 3 8B, runs well with 4-bit quantization.
  • Phi-3 Mini / Small (Quantized): These smaller models (3.8B parameters) are ideal for 8GB cards and offer good performance and quality for their size.
  • DeepSeek-R1-Distill-Qwen-1.5B (Quantized): Light and fast, great on 8GB cards.
  • Other < 7B Models: Models like StableLM 3B and TinyLlama 1.1B run very fast.
  • Larger Models (e.g., 13B): Possible only with very aggressive quantization (e.g., Q2_K, Q3_K variants), expect slower performance and potential quality degradation.

Tools and Frameworks for Running Local LLMs

Several tools simplify the process of downloading and running LLMs locally:

  • Ollama: Provides a simple command-line interface and local server. Easy setup and model management.
    # Pull and run Llama 3 8B
    ollama run llama3:8b
    
    # List downloaded models
    ollama list
    
  • LM Studio / Jan: User-friendly graphical interfaces (GUIs) for downloading and interacting with various LLMs (often using llama.cpp in the backend).
  • Text Generation WebUI (Oobabooga): A comprehensive Gradio-based web interface supporting various models, loaders (Transformers, ExLlamaV2, Llama.cpp), and features like fine-tuning and chat modes.
  • Llama.cpp: A high-performance inference engine written in C++. Primarily uses the GGUF format and supports CPU and GPU (via CUDA/Metal) acceleration. It requires compilation but offers fine-grained control.
    # Example: Run Llama 3 8B GGUF with GPU offload
    ./main -m ./models/llama-3-8b-instruct.Q4_K_M.gguf \ 
           -p "User: What is CUDA? Assistant:" \ 
           -n 512 --color -ngl 33 # -ngl: num GPU layers
    
  • Hugging Face Transformers: The standard Python library for NLP. Can load and run many models, often combined with accelerate for multi-GPU/CPU offloading and bitsandbytes for quantization.
    from transformers import AutoModelForCausalLM, AutoTokenizer
    import torch
    
    model_id = "meta-llama/Meta-Llama-3-8B-Instruct"
    tokenizer = AutoTokenizer.from_pretrained(model_id)
    # Load with 4-bit quantization
    model = AutoModelForCausalLM.from_pretrained(
        model_id, 
        torch_dtype=torch.bfloat16, 
        device_map= "auto", # Auto-distribute layers
        load_in_4bit=True
    )
    
    # Basic generation (example)
    inputs = tokenizer("Explain quantization:", return_tensors="pt").to("cuda")
    outputs = model.generate(**inputs, max_new_tokens=100)
    print(tokenizer.decode(outputs[0], skip_special_tokens=True))
    
  • vLLM / TensorRT-LLM: Specialized inference libraries focused on maximizing throughput and minimizing latency, particularly for serving scenarios. These often require model conversion steps and are targeted at advanced users prioritizing speed.

Benchmarking and Performance Tuning

Published benchmarks provide a starting point, but the best way to assess performance is to test models directly on your hardware and workload.

  • Benchmarking: Many tools include built-in benchmarks (e.g., llama.cpp has perplexity calculation and speed tests). Simple Python scripts using time can measure generation speed for specific prompts.
  • Primary Metrics: Tokens per second (generation speed) and time-to-first-token (latency).
  • Tuning:
    • Quantization Level: Experiment with different GGUF quantizations (e.g., Q4_K_M vs. Q5_K_M vs. Q8_0) or bitsandbytes settings to balance VRAM, speed, and output quality.
    • GPU Layer Offloading (-ngl in Llama.cpp): For GGUF models, adjust the number of layers offloaded to the GPU. Maximize this within your VRAM limits for best speed. Start high and decrease if you encounter out-of-memory errors.
    • Batch Size: For frameworks supporting batching (like Transformers, vLLM), increasing batch size can improve throughput and VRAM usage.
    • Context Length: Longer context windows consume more VRAM (due to the KV cache). Adjust based on your needs and available memory.
    • Framework Choice: Different frameworks have varying performance characteristics. Llama.cpp, ExLlamaV2, and TensorRT-LLM are often among the fastest for inference.
    • Software Stack: Ensure CUDA Toolkit, cuDNN, PyTorch, and framework versions are compatible and up-to-date.

Example: Running Llama 3 8B using Llama.cpp

This provides a concrete workflow:

  1. Install Llama.cpp: Clone the repository and build it with CUDA support (follow instructions on the official llama.cpp GitHub page).
    git clone https://github.com/ggerganov/llama.cpp.git
    cd llama.cpp
    # Enable CUDA support during compilation
    make LLAMA_CUBLAS=1 
    # (Adjust build flags as needed for your system)
    
  2. Download a Model: Get a GGUF quantized version of Llama 3 8B Instruct. Q4_K_M is a good balance for 12GB VRAM.
    # Example download (check Hugging Face for official sources)
    mkdir models
    wget -P ./models https://huggingface.co/QuantFactory/Meta-Llama-3-8B-Instruct-GGUF/resolve/main/Meta-Llama-3-8B-Instruct.Q4_K_M.gguf?download=true
    
  3. Run Inference: Execute the main binary, specifying the model, prompt, and GPU layers.
    ./main -m ./models/Meta-Llama-3-8B-Instruct.Q4_K_M.gguf \ 
       -p "User: Write a python function for fibonacci sequence.\nAssistant:" \ 
       -n 256 --color --ctx-size 2048 -ngl 35
    
    • `-m': Specifies the model file.
    • -p: The initial prompt.
    • -n: Maximum number of tokens to generate.
    • --color: Enables colored output.
    • --ctx-size: Sets the context window size (affects VRAM).
    • -ngl: Number of layers to offload to the GPU. For Llama 3 8B (33 layers total), -ngl 33 or higher offloads all layers if VRAM allows.

Conclusion

The older NVIDIA RTX 40 series GPUs present a capable platform for running a wide range of LLMs locally and are still among the most available on the market today. VRAM capacity is the most critical factor determining which models are feasible, with the 24GB RTX 4090 offering the most flexibility and the 8GB RTX 4060/4060 Ti providing an entry point for smaller or heavily quantized models.

Quantization techniques like GGUF and 4-bit loading via libraries are essential for fitting larger models into available VRAM. Choosing the right model involves considering its size, task suitability, and license and balancing these against the VRAM and performance characteristics of your specific RTX 40 GPU.

By understanding these factors and utilizing the available tools and frameworks, technical professionals can effectively run powerful LLMs directly on their desktop hardware, gaining advantages in privacy, customization, and offline accessibility.

© 2025 ApX Machine Learning. All rights reserved.

;