Mixture of Experts (MoE) models are gaining traction, becoming a staple of all the latest models such as Llama 4 and Qwen 3. They promise the power of massive models while needing less calculation for each piece of text (token) they process. This efficiency makes people interested in running these powerful models on their own computers.
However, the advertised calculation savings often lead to wrong ideas about how much memory (VRAM) they need and how fast they run. This is especially true when trying to run them fast locally.
Understanding how MoE models actually work helps figure out the real hardware needs and what performance to expect. Let's look at some common myths.
An MoE setup is different from regular 'dense' models. Instead of one big chunk of settings (parameters), it has many separate smaller 'expert' parts, which often work in a similar way.
A 'router' part (like a traffic controller) guides the work for each piece of text input. This router picks a few experts (for example, 2 out of 8) to do the processing for that specific token. The results from the chosen experts are then combined for the final output.
The main idea is sparse activation. This means only some of the model's parts are working on any single token. This directly lowers the amount of calculation (Floating Point Operations or FLOPs) needed per token compared to a dense model of the same total size.
Diagram showing token processing flow in an MoE model. The router selects
k
experts for computation.
MoE efficiency shows up differently during training (building the model) and inference (using the model).
The biggest advantage of MoE is during training. Training a dense model needs calculations based on its total size for every piece of text processed. MoE models only use the selected experts for each piece of text in a group (batch).
This means training an MoE model with 100 billion total parameters but only using 20 billion per token needs much less calculation per training step compared to training a 100 billion parameter dense model. This saving lets people train models with huge total parameter counts (hundreds of billions or even trillions) reasonably quickly and with less computing power, something too difficult for dense models. MoE lets us build bigger models.
During inference (using the model), the same saving happens for each token. An MoE model does less calculation for each bit of text it creates compared to a dense model of the same total size. This calculation efficiency is real and could mean faster speed (in theory) or less energy use when comparing these two specific model types.
However, for people running models locally, this calculation efficiency doesn't automatically mean you get the benefits people expect, like lower VRAM use or faster real world speed:
So, while MoE uses calculations efficiently for its size when running, its biggest effect is allowing huge models to be trained in the first place. When using the model, the benefit is best seen as getting the performance of a very large model while only doing the amount of calculation similar to a smaller model, if you have enough memory.
A common wrong idea is that since only some parameters (the active experts) are used per token, you need much less VRAM. For running models fast with quick responses, this is incorrect or only partially true.
For fast performance, GPUs need model parameters stored in their dedicated, high speed memory (VRAM).
In MoE models, a component called the "router" looks at each incoming piece of text (token) and quickly picks which specialized "experts" should process it. A important point is, the router might choose a completely different set of experts for the very next token.
To keep things running fast, the GPU needs instant access to the parameters of whichever expert the router picks. If an expert's parameters aren't already loaded into the fast VRAM, they have to be fetched from much slower system RAM or your computer's storage drive (like an NVMe or SSD).
This fetching process is extremely slow compared to how fast GPUs calculate. If the model constantly has to wait for parameters to be loaded, it creates significant delays. This makes the model feel sluggish and unresponsive, completely wiping out the speed benefits you'd expect from using a powerful GPU.
Therefore, to ensure the MoE model runs quickly and smoothly for local inference, all the experts, the entire collection of the model's parameters, must be loaded into VRAM from the start. This allows the router to instantly access any expert it needs without causing performance delays.
How much VRAM you need mostly depends on the model's total number of parameters and the detail level used for those parameters (like fp16, int8, or int4). The number of active parameters (those used per token) mainly affects the calculation amount (FLOPs), not how much memory storage is needed for the model's settings (weights) to run fast.
Here's how the different types compare (using some technical terms defined below):
Let's define:
Model Type | Total Params | Active Params (per token) | Compute (FLOPs/token) | Relative VRAM (Weights) |
---|---|---|---|---|
Dense (Small) | ||||
MoE ( experts, active) | ||||
Dense (Large) |
(Math symbols: means 'roughly grows with', means 'proportional to')
This table clearly shows an MoE model's VRAM needs depend on its total parameters (), just like a large dense model (). This happens even though its calculation cost per token is much lower (depending on ). If the total MoE parameter count is much larger than , the MoE model needs a lot more VRAM than the small dense model.
The only way to run an MoE model using less VRAM than its total size needs is through parameter offloading. This means purposefully keeping unused experts in slower memory (CPU RAM or storage) and loading them only when needed. This works, but adds noticeable delay (latency), making the model less responsive. You sacrifice speed to run the model when you don't have enough VRAM.
Doing fewer calculations (FLOPs) per token doesn't automatically mean MoE models are faster in real world use compared to all other options. Speed depends on what you compare it to and other factors.
vLLM
, llama.cpp
, TGI) is built. Special code for handling sparse experts efficiently, smart routing, and ways to run experts at the same time are important. MoE software might be newer or less polished than software for standard dense models.The main performance benefit of MoE is getting the results (like quality or knowledge) of a big model while only doing the amount of calculation similar to a smaller model. It's about getting more 'smarts' for the amount of calculation done, not always being the absolute fastest compared to smaller dense models.
MoE is an interesting design, but if it's right depends on what you need and your limits. It's not always the best choice.
Choosing the best model means balancing the quality you want, the speed you need (both response time and overall speed), your hardware (especially VRAM), and how well the software you use runs specific MoE or dense models.
Mixture of Experts models are an important new design, allowing huge models to run with much less calculation per token. This is especially useful for running good models on your own computer, mostly because it makes it possible to train these huge models in the first place.
However, understanding how they work when you use them is important. To run them fast locally without slow offloading, MoE models need enough VRAM to hold all their parts, just like a dense model of the same total size. The main efficiency gain when using them is less calculation (compared to an equally large dense model), usually not in saving VRAM or guaranteed faster speed compared to smaller dense models.
Recommended Posts
© 2025 ApX Machine Learning. All rights reserved.