Just launched on LinkedIn! Follow for updates on AI/ML research and practical tips.

Follow on LinkedIn

3 Common Myths About MoE LLM Efficiency for Local Setups

Wei Ming T.

By Wei Ming T. on May 1, 2025

Mixture of Experts (MoE) models are gaining traction, becoming a staple of all the latest models such as Llama 4 and Qwen 3. They promise the power of massive models while needing less calculation for each piece of text (token) they process. This efficiency makes people interested in running these powerful models on their own computers.

However, the advertised calculation savings often lead to wrong ideas about how much memory (VRAM) they need and how fast they run. This is especially true when trying to run them fast locally.

Understanding how MoE models actually work helps figure out the real hardware needs and what performance to expect. Let's look at some common myths.

What Exactly is a Mixture of Experts (MoE) LLM?

An MoE setup is different from regular 'dense' models. Instead of one big chunk of settings (parameters), it has many separate smaller 'expert' parts, which often work in a similar way.

A 'router' part (like a traffic controller) guides the work for each piece of text input. This router picks a few experts (for example, 2 out of 8) to do the processing for that specific token. The results from the chosen experts are then combined for the final output.

The main idea is sparse activation. This means only some of the model's parts are working on any single token. This directly lowers the amount of calculation (Floating Point Operations or FLOPs) needed per token compared to a dense model of the same total size.

Diagram showing token processing flow in an MoE model. The router selects k experts for computation.

Training vs. Inference: Where is MoE Most Efficient?

MoE efficiency shows up differently during training (building the model) and inference (using the model).

Training Efficiency Gains

The biggest advantage of MoE is during training. Training a dense model needs calculations based on its total size for every piece of text processed. MoE models only use the selected experts for each piece of text in a group (batch).

This means training an MoE model with 100 billion total parameters but only using 20 billion per token needs much less calculation per training step compared to training a 100 billion parameter dense model. This saving lets people train models with huge total parameter counts (hundreds of billions or even trillions) reasonably quickly and with less computing power, something too difficult for dense models. MoE lets us build bigger models.

Inference Efficiency

During inference (using the model), the same saving happens for each token. An MoE model does less calculation for each bit of text it creates compared to a dense model of the same total size. This calculation efficiency is real and could mean faster speed (in theory) or less energy use when comparing these two specific model types.

However, for people running models locally, this calculation efficiency doesn't automatically mean you get the benefits people expect, like lower VRAM use or faster real world speed:

  1. VRAM: As we'll discuss next, the VRAM needed still depends on the total parameter count if you want fast speeds. This cancels out memory savings unless you use slow "offloading" techniques.
  2. Practical Speed: When comparing an MoE to a smaller dense model that uses a similar amount of calculation per token (which people often do if they have limited hardware), the MoE's calculation advantage gets smaller. Extra tasks (like routing, how memory is used, and software details) can make it run at similar or maybe even slower real world speeds.

So, while MoE uses calculations efficiently for its size when running, its biggest effect is allowing huge models to be trained in the first place. When using the model, the benefit is best seen as getting the performance of a very large model while only doing the amount of calculation similar to a smaller model, if you have enough memory.

Myth 1: MoE Models Use Dramatically Less VRAM

A common wrong idea is that since only some parameters (the active experts) are used per token, you need much less VRAM. For running models fast with quick responses, this is incorrect or only partially true.

The Necessity of Full VRAM Usage

For fast performance, GPUs need model parameters stored in their dedicated, high speed memory (VRAM).

In MoE models, a component called the "router" looks at each incoming piece of text (token) and quickly picks which specialized "experts" should process it. A important point is, the router might choose a completely different set of experts for the very next token.

To keep things running fast, the GPU needs instant access to the parameters of whichever expert the router picks. If an expert's parameters aren't already loaded into the fast VRAM, they have to be fetched from much slower system RAM or your computer's storage drive (like an NVMe or SSD).

This fetching process is extremely slow compared to how fast GPUs calculate. If the model constantly has to wait for parameters to be loaded, it creates significant delays. This makes the model feel sluggish and unresponsive, completely wiping out the speed benefits you'd expect from using a powerful GPU.

Therefore, to ensure the MoE model runs quickly and smoothly for local inference, all the experts, the entire collection of the model's parameters, must be loaded into VRAM from the start. This allows the router to instantly access any expert it needs without causing performance delays.

VRAM Scaling: Total vs. Active Parameters

How much VRAM you need mostly depends on the model's total number of parameters and the detail level used for those parameters (like fp16, int8, or int4). The number of active parameters (those used per token) mainly affects the calculation amount (FLOPs), not how much memory storage is needed for the model's settings (weights) to run fast.

Here's how the different types compare (using some technical terms defined below):

Let's define:

  • NN: Total parameters in a comparable small dense model.
  • MM: Number of experts in the MoE model.
  • PexpertP_{expert}: Number of parameters per expert.
  • PsharedP_{shared}: Number of shared (non expert) parameters.
  • kk: Number of active experts per token (kMk \le M).
  • Ptotal_MoE=M×Pexpert+PsharedP_{total\_MoE} = M \times P_{expert} + P_{shared}: Total parameters in the MoE model.
  • Ptotal_denseLP_{total\_denseL}: Total parameters in a large dense model (similar size to Ptotal_MoEP_{total\_MoE}).
Model Type Total Params Active Params (per token) Compute (FLOPs/token) Relative VRAM (Weights)
Dense (Small) NN NN O(N)\mathcal{O}(N) N\propto N
MoE (MM experts, kk active) Ptotal_MoEP_{total\_MoE} k×Pexpert+Psharedk \times P_{expert} + P_{shared} O(k×Pexpert+Pshared)\mathcal{O}(k \times P_{expert} + P_{shared}) Ptotal_MoE\propto P_{total\_MoE}
Dense (Large) Ptotal_denseLP_{total\_denseL} Ptotal_denseLP_{total\_denseL} O(Ptotal_denseL)\mathcal{O}(P_{total\_denseL}) Ptotal_denseL\propto P_{total\_denseL}

(Math symbols: O()\mathcal{O}(\cdot) means 'roughly grows with', \propto means 'proportional to')

This table clearly shows an MoE model's VRAM needs depend on its total parameters (Ptotal_MoEP_{total\_MoE}), just like a large dense model (Ptotal_denseLP_{total\_denseL}). This happens even though its calculation cost per token is much lower (depending on kk). If the total MoE parameter count is much larger than NN, the MoE model needs a lot more VRAM than the small dense model.

The only way to run an MoE model using less VRAM than its total size needs is through parameter offloading. This means purposefully keeping unused experts in slower memory (CPU RAM or storage) and loading them only when needed. This works, but adds noticeable delay (latency), making the model less responsive. You sacrifice speed to run the model when you don't have enough VRAM.

Myth 2: MoE Guarantees Faster Inference Speed Locally

Doing fewer calculations (FLOPs) per token doesn't automatically mean MoE models are faster in real world use compared to all other options. Speed depends on what you compare it to and other factors.

Comparative Performance

  • MoE vs. Large Dense Model (Same Total Size): An MoE model (like one with 100B total parameters, 20B active) will usually run faster (more tokens per second) than a 100B dense model because it does a lot less calculation per token.
  • MoE vs. Small Dense Model (Same Active Size): Comparing a 100B MoE (20B active) to a 20B dense model isn't so simple. While the main math calculations are similar, the MoE model has extra tasks that take time:
    • Router Calculation: The router part itself needs to do calculations to pick experts. This takes some extra time.
    • How Memory is Used: Even though fewer parameters are used in calculations, getting the settings for the chosen experts from a possibly much bigger total model stored in VRAM can affect how fast memory can be read, compared to reading a smaller, all in one dense model.
    • Software Details: Speed really depends on how well the software used to run the model (like vLLM, llama.cpp, TGI) is built. Special code for handling sparse experts efficiently, smart routing, and ways to run experts at the same time are important. MoE software might be newer or less polished than software for standard dense models.

The main performance benefit of MoE is getting the results (like quality or knowledge) of a big model while only doing the amount of calculation similar to a smaller model. It's about getting more 'smarts' for the amount of calculation done, not always being the absolute fastest compared to smaller dense models.

Myth 3: MoE Models Are Always the Best Choice If They Fit

MoE is an interesting design, but if it's right depends on what you need and your limits. It's not always the best choice.

Contextual Factors

  • Quality vs. Size: MoE tries to give big model quality with smaller model calculations, but results can differ. A well tuned dense model of the same active size might perform just as well or even better for certain tasks, depending on how they were trained and adjusted.
  • Software and Hardware: The speed you get locally depends on how well the software handles MoE and how well that works with your specific GPU. Software for dense models is often better optimized because it has been around longer. Speed differences can come from technical details like how calculations are combined or how memory is managed.
  • Adjusting the Model: Changing MoE models later (fine tuning) can be trickier than with dense models. You might need special methods to handle the experts or adjust the router, possibly needing more complex training.
  • Effect of Shrinking the Model: Shrinking the model size (quantization) is important for running large models locally. While helpful for both types, the effect on quality might be slightly different between MoE and dense models, especially when shrunk a lot (e.g., to 4 bits).

Choosing the best model means balancing the quality you want, the speed you need (both response time and overall speed), your hardware (especially VRAM), and how well the software you use runs specific MoE or dense models.

Conclusion

Mixture of Experts models are an important new design, allowing huge models to run with much less calculation per token. This is especially useful for running good models on your own computer, mostly because it makes it possible to train these huge models in the first place.

However, understanding how they work when you use them is important. To run them fast locally without slow offloading, MoE models need enough VRAM to hold all their parts, just like a dense model of the same total size. The main efficiency gain when using them is less calculation (compared to an equally large dense model), usually not in saving VRAM or guaranteed faster speed compared to smaller dense models.

© 2025 ApX Machine Learning. All rights reserved.

;