DeepSeek-V3 Technical Analysis - MoE, Fine-Grained Quantization, DualPipe, MLA

In this blogpost we will be analyzing the recent paper on DeepSeek-V3 training and architecture, as well as discussing the model's impact in the industry and the AI community.

We will cover the following technical innovations: Mixture of Experts (MoE), Fine-Grained Quantization, DualPipe, Multi-head Latent Attention (MLA), and more.

Index

What is DeepSeek-V3
Why is it relevant?
Technical Innovations
Suggestions for NVIDIA's AI GPU Improvements
Conclusions
References

What is DeepSeek-V3

DeepSeek-V3 is an open-source LLM built by DeepSeek, a Chinese AI company funded in 2023. DeepSeek-V3 has been proving itself to consumers and companies by showing impressive results, with people comparing it to state-of-the-art closed-source models like GPT-4o and Claude-3.5-Sonnet.

DeepSeek Chat - similar to Chat OpenAI UI

As of now, using it in their web chat interface is free and allows one to test both the V3 and R1 (reasoning) models, which are competing with OpenAI's GPT-4o and o1.

In this article we will only be covering the DeepSeek-V3 model, leaving the R1 for a future article.

Why is it relevant?

DeepSeek-V3 is considered by today's standards a large LLM, with its 671B parameters, largely overshadowing the old open-source king - Llama3.1 with 405B - and its older V2.5 version with 236B.

DeepSeek-V3 Parameters vs Llama3.1 - saves a lot on activated parameters

A key distinction between these two families of models is that while the Llama3.1 has a Dense architecture, requiring every parameter to be involved in processing each input token, DeepSeek-V3 only requires 5.51% of its total parameters for processing each token, using 37B parameters.

This allows for incredible efficiency for training and inference, drastically lowering costs and latency - it is one of the many perks of leveraging a Mixture of Experts (MoE) architecture, which we will discuss later on.

LLM Benchmarks

On paper the DeepSeek-V3 model blows every other non-reasoning LLM out of the water on various domains, including English, Coding, Math, and Chinese. It manages to present itself above on several research metrics against many SOTA models like GPT-4o and Claude-3.5.

LLM Benchmark Averages* on various Research Metrics - adapted from deepseek.com

* The image shown above is an adaptation using the average of every research metric, of course, some metrics may be more relevant and accurate than others, so take these numbers with a grain of salt.

The category where V3 seems to distinguish itself the most is Math, being 37.77% better than the 2nd best model. This is likely due to the Post-Training of V3, where the authors indicate that it was trained on data generated by the reasoning model R1, for some Knowledge Distillation of this larger more powerful model.

This shows very promising results that to the untrained eye may indicate that this model is categorically better than all the other non-reasoning models (o1, R1), however time and time again the community has seen how models can be excellent for research metrics but that doesn't always transpose to the real-life scenarios.

Training Cost

Something that came as a surprise and completely shook the tech space was the astonishingly low training cost, stated to be just ~$5.6M.

To put things into perspective:

GPT-4 from 2023 cost ~$80 - $100M
Llama3.1 405B estimated cost* ~$38.55 - $58.66M
GPT-3 which was trained in 2020 has 175B parameters and cost $4.5M.

* Knapkin math made on a non-real scenario where Meta rented GPU use - based on Meta reporting on Llama 3.1 405B using 30.84 million hours of H100 80GB instances to train, looking at some GPU cloud services, we can assume rental prices of 1.25$ (Hyperstack) - 2.49$ (Oracle) per hour = $38.55 - $58.6M. Of course, these prices would be much cheaper for someone like Meta, and in reality they used their own GPUs. These figures are merely illustrative to try to make some generic comparison with DeepSeek's cost, do not take them too seriously.

Of course, costs have been going down and the cost for training GPT-4o is likely much much lower, however, it is still unknown, but speculations are that it likely is much more expensive than DeepSeek-V3.

It is important to note that these figures from V3 refer only to the GPU training cost and do not account for other important things without which you could not train a model like this:

RLHF (Reinforcement learning from human feedback)
Ablation studies (test how removing specific model components affects performance)
Smaller model runs and experiments
Data generation (from R1 or other models)
Researchers Salaries (wait, people don't just work for free!? 🤔)

Still, this GPU training time cost metric is a good one to have and can be used with some care to compare the general costs of training each LLM and paints the picture of these costs drastically going down and more players being able to enter the market as top-competitors now and in the near future.

API Token Cost

Paired with its low training cost comes its very low API Token Cost, something which highly pressures other LLM provider players on the market, with DeepSeek-V3 being offered at a price 29.8x cheaper than GPT-4o, while being at least as powerful and possibly better for many use cases.

API Token Prices GPT4-o vs DeepSeek-V3 - from DocsBot.ai

So that begged the question of how it fares against other LLM providers, which can be seen in the image below, where we simulated the Total Cost of using 100k input and 10k output tokens and making 1000 such calls.

API Token Price Comparison - Total Cost assumes 100k input, 10k output tokens, and 1000 calls - adapted from DocsBot.ai

Looking at the results in the table above we can see that DeepSeek provides by far the cheapest models - $14.28 and $57.19 - with even its R1 heavy-weight reasoning model being much more affordable than every other model on the list.

OpenAI's o1 has a total cost of $1560, showing its gigantic price tag, and likely taking away all the confidence in it being useful in large or medium-scale applications.

LLM API Token prices have been steadily dropping over the months, yet such a drop in prices while delivering such quality models will certainly shake up the market and force other players to step up their game and either deliver much better models or find ways to cut on prices.

Community Reactions

Overall, upon actually using and testing the model the majority of users appear to agree that the model is indeed matching benchmark expectations and on par with the latest closed source models like GPT-4o, with some stating that it outperforms it completely on a variety of domains and more complex reasoning.

With its low price and high quality, DeepSeek-V3 is starting to be everyone's go-to model, replacing GPT-4o's spot. They are using it for personal use, Cursor, and Proof-of-Concept (PoC) projects, and perhaps even already trying them on full-scale Production.

There have also been emerging discussions on how the answers look so similar and questions as to how the DeepSeek company was able to build a model of such quality. Additionally, the low prices for training also made the community ponder whether the data had higher quality, possibly indicating that it was distilling knowledge from other models. Raising suspicions on whether DeepSeek models trained on other proprietary model's data - a legally gray-area move.

However, it is important to note that some people have been criticizing DeepSeek-V3 on longer context answers (>8k tokens), often becoming repetitive, for instance in storytelling, and underperforming compared to GPT-4o. Importantly, there still hasn't been any research comparing V3 against other models at different context lengths, which would be necessary to judge this properly.

Who is DeepSeek

DeepSeek, founded in 2023 by Chinese entrepreneur Liang Wenfeng, has rapidly emerged as a formidable player in the artificial intelligence (AI) sector. Initially, Liang applied AI to Quantitative Trading through his hedge fund, High-Flyer, which laid the groundwork for DeepSeek's AI endeavors.

Through the use of open-source models, DeepSeek challenges the traditional dominance of proprietary AI systems, promoting transparency and inclusivity in AI development. This strategy not only enhances their models through community collaboration but also positions them as leaders in the movement toward open and accessible AI technologies.

Funding

The High-Flyer hedge fund has been instrumental in providing the financial resources necessary for DeepSeek's AI research and development, reported to manage around $8 billion in assets. High-Flyer invested heavily in advanced computing infrastructure before U.S. export restrictions on AI chips, having built two AI supercomputing clusters comprising thousands of Nvidia A100 GPUs.

High-Flyer's support allowed DeepSeek to secure ~2048 H800 NVIDIA GPUs, these are state-of-the-art chips that were used to train the V3 and R1 models.

Driving Force for Innovation - U.S. GPU Export Restrictions

DeepSeek had 2048 H800 NVIDIA chips at their disposal, however, these chips are a restricted version of the H100 NVIDIA GPUs, designed to comply with U.S. export controls.

Since October 2023, the US has also banned exports of these slower GPU chips - A800 and H800 - to China and Hong Kong, in an attempt to slow down the AI market in these regions.

US has since banned exports to China of the H800 chips, which were used to train DeepSeek models

With DeepSeek showing just how much they can achieve with these limited H800 chips, it becomes more evident that China's technical prowesses and knowledge in the AI field are on par with the US.

The exact restrictions are unknown but the interconnect connectivity speed is capped at nearly half speed of 400 GB/s - compared to the original 900 GB/s, heavily limiting multi-GPU scalability. Additionally, overall FP (floating-point) performance is significantly lower and the Tensor Cores may also be limited.

Looking at it now, this connectivity speed cap which limits multi-GPU scalability, appears to be what ended up driving most innovations from the DeepSeek team with improved methods like Mixture of Experts, Fine-Grained Quantization, DualPipe communication and much more.

Technical Innovations

This blogpost will cover these technical innovations in a more high-level overview, for an in-depth dive into this, a future blogpost will cover this and will be mentioned here when it is available.

Mixture of Experts (MoE)

If you are unfamiliar with Mixture of Experts (MoE), it is a technique used to leverage dividing expertise and responsibilities over different parts (layers) of the LLM.

MoE is like having a group of LLMs, each being specialized in different domains and contributing only when they are an expert in said domain — **MoE** is like having a group of LLMs, each being specialized in different domains and **contributing only when they are an expert** in said domain

This is similar to our brains, and how different parts of specialize in tasks like vision, language, or movement. Both rely on routing information to the right "expert" for efficient processing.

For more information on MoE and its inner technical workings and how it has been leveraged in models like GPT4 and Mixtral, providing the basis for understanding DeepSeek-V3 - check our Blogpost on LLM Mixture of Experts Explained.

Innovations in MoE

DeepSeek-V3 uses as many expert layers as possible, employing only 3 FFN (feed-forward network / dense) layers, which are much more computationally expensive.

MoE is what allows DeepSeek to use only a fraction of its total parameters per call

DeepSeekMoE uses finer-grained experts, preferring smaller experts than the opposite. For instance, Mixtral-8x7B used only 8 experts (as discussed in our blogpost) - requiring each one to still be quite generic.

V3's MoE architecture is made by a total of 256 experts, each having ~4.1B parameters (by our estimates) and out of those 256 experts - only 8 Routed Experts contribute to each token being passed per layer.

Mixture of Experts (MoE) of DeepSeek-V3 - Shared Experts are used to retain common knowledge and allow others to specialize — Mixture of Experts (MoE) of DeepSeek-V3 - **Shared Experts** are used to retain common knowledge and allow others to specialize

It also isolates some experts as Shared Experts are always active and functioning and a common knowledge expert, allowing the others to specialize rather than having to keep track of mundane cognition things.

This key idea pairs quite nicely with the finer-grained experts since they will have to be more generic and thus overall each expert will cover more diversity of topics despite its lower count of parameters.

The paper reports them only using 1 Shared Expert per layer, so to the 8 Routed Experts (out of the 256) the 1 Shared Expert is also used - using only 9 experts at a time per Token and per layer - representing a use of 3.52% of the total parameters per forward pass.

Communication Speed Issues

This is one of the key reasons why DeepSeek is so cost-efficient, although it does heavily increase the relatively low communication speed problems of the H800 chips. This is because it requires all-to-all GPU communication - since experts can be allocated on any GPU on the cluster.

Attempting to mitigate this, expert redundancy is created in each GPU, with each machine hosting more experts than it actively uses and high-load experts being replicated across multiple GPUs to avoid slowdowns and bottlenecks.

Other innovations are made, particularly in the Router mechanism, which allows them to get rid of using a Loss-based approach for deciding on which expert to route to. They pioneer an Auxiliary-Loss-Free Load Balancing Strategy, which they covered in another paper, and which for now we won't go in-depth.

DualPipe - Bi-directional Communication

To handle the H800 chip communication limitations, the researchers came up with the DualPipe solution - a technique that overlaps communication of the forward and backward pass to avoid Pipeline Bubbles (idle time).

The name DualPipe stems from the overlap of opposite-direction communications from the forward and backward passes (featuring attention, weights, inputs, etc.).

DualPipe - overlaps communication of forward and backward pass to avoid idle time

A major achievement is that as the model grows larger, fine-grained experts can be used without adding extra communication delays, as long as the balance between computation and communication is kept constant.

Fine-Grained Quantization

While other LLMs only used quantization either during fine-tuning or just inference, DeepSeek models use quantization during the main training phase due to them leveraging Fine-Grained Quantization. Of course post-quantization can also be leveraged for cheaper inference and there are already various versions on Hugging Face, like this one.

Post-training Quantization and how it reduces model size for efficiency - From this Blogpost — Post-training Quantization and how it reduces the model size for efficiency - From this **Blogpost**

If you want an overview of what Quantization is you can check this Blogpost where we cover it from the ground up. It focuses more on post-training quantization but covers the key concepts required for clearly understanding this section.

Why Fine-Grained Quantization Matters?

Efficiency gains: Being able to leverage during training quantization drastically lowers computational requirements.
More stable training: Instead of big accuracy losses from quantization, they keep training smooth by handling outliers better.
Better use of FP8 hardware: Their method is future-proof for upcoming GPUs (like Blackwell), which will support fine-grained quantization natively.

Short Technical Breakdown

They use FP8 (8-bit floating point) to reduce memory and computation costs – but FP8 has a problem: it struggles with extreme values (outliers), leading to instability.
Standard quantization scales everything together, which makes training sensitive to these outliers.
Their solution? Fine-grained quantization – instead of applying one big scale factor to an entire tensor, they apply smaller, localized scale factors:
- For activations: Scaling is applied in small 1x128 tiles (per token, per 128 channels).
- For weights: Scaling is applied in 128x128 blocks (per 128 input and output channels).
They modified how General Matrix Multiply (GEMM) operations work, adding per-group scaling inside FP8 calculations. Since standard FP8 doesn’t support this, they handle it using precise FP32 accumulation to maintain accuracy.

Multi-Head Latent Attention (MLA)

Trying to avoid getting to technical here, the basic idea of MLA is to compress the attention input to a low-dimensional latent vector. Later on, to calculate attention, this latent vector can be mapped back to the high-dimensional space to recover the keys and values.

As a result, only the latent vector needs to be stored, leading to significant memory reduction. MLA is an adaptation of the Multi-head technique that uses latent compression of dimensions for efficiency gains.

MLA takes inspiration from previous techniques like Multi-head but uses latent compression of dimensions for huge efficiency gains - from the V2 paper

Since the attention components (KV pairs) are often the most computationally demanding in a transformer architecture, this leads to a huge positive impact on the overall speed and efficiency.

Suggestions for NVIDIA's AI GPU Improvements

DeepSeek-V3 is pushing efficiency to such an extreme that they’re not just optimizing software—they’re actively advising hardware manufacturers like NVIDIA on how to improve future AI chips.

They have a whole section commenting on the H800 NVIDIA's GPU chips and suggesting improvements, not only for those chips, but they suggest new industry standards as a whole.

This isn’t just theoretical—they're already leveraging NVIDIA’s latest H800 GPUs and pushing them to their limits. By working at the cutting edge of both hardware and software, DeepSeek-V3 isn’t just following industry trends—they’re shaping them.

Communication

Their implementation of all-to-all communication and FP8 training has exposed inefficiencies in current architectures, such as the reliance on expensive and valuable Streaming Multiprocessors (SMs) for communication instead of reserving them for more important tasks.

The DeepSeek team suggests that if communication could be offloaded more efficiently—perhaps to dedicated hardware instead of general-purpose compute units (SMs)—the overall AI performance could improve.

By redesigning hardware to handle communication separately, AI chips could unlock more computational power and improve efficiency without sacrificing resources for data transfer.

Tensor Cores are one of the biggest strenghts of new AI chips - The DeepSeek team argues that allowing them to sit idle due to Communication using the SMs is highly inneficient — Tensor Cores are one of the biggest strengths of new AI chips - The DeepSeek team argues that allowing them to sit idle due to Communication using the SMs is highly inefficient

Currently, 20 out of the 132 available SMs on the NVIDIA H800 GPU are being used just for communication tasks rather than actual AI computations. This creates an inefficiency because those SMs could otherwise contribute to training the model, while tensor cores—specialized for AI workloads—sit idle.

Fine-Grained & Online Quantization Native Support

Current GPUs only support per-tensor quantization, lacking the native support for fine-grained quantization like their tile-wise and block-wise quantization. They argue that future chips should support this natively as it will their use is likely to become an industry standard.

The current GPU chips also appear to struggle to support online quantization and require a lot of engineering tricks to make it work, but that makes it more inefficient. The researchers suggest 2 possible strategies for the chipmaker to deal with this issue and improve online quantization performance.

They also discuss the need for support for Transposed GEMM Operations, which we won't analyze in depth here for brevity.

Overall, we see the core suggestions of improvements being around quantization during training, it seems like an underexplored area by chipmakers that could significantly boost training efficiency and reduce costs.

For more information on quantization and an in-depth look, see our previous Blogpost that focuses on post-training Quantization.

Conclusion

DeepSeek-V3 is a game-changer in the AI world. It proves that open-source models can go head-to-head with the best from OpenAI and Anthropic—at a fraction of the cost.

By using smart techniques like Mixture of Experts (MoE), DualPipe communication, and Fine-Grained Quantization, DeepSeek built a powerful 671B parameter model for just ~$5.6M. This challenges the idea that only billion-dollar companies can create top-tier AI.

For the Western market, this is a big wake-up call. DeepSeek has shown that AI breakthroughs don’t just happen in Silicon Valley.

Looking ahead, DeepSeek and other Chinese AI companies are unlikely to slow down.

One thing is clear: the AI race is no longer just between US tech giants. DeepSeek has set a new standard, and the competition is just getting started.

Let's hope that major open-source advancements in the AI field come from this competition that can benefit the whole community.

Accelerating GenAI Projects — Click here for more