Read

Deep dive into DeepSeek for software engineers

Amir Moghimi

.5 min read

.23 February, 2025

Co-founder & CTO

.5 min read

.23 February, 2025

I have compiled what I have researched and found about DeepSeek by reading many sources into this one article and structured it in a way that I wish it was done for me.

Why do I care?

As a software engineer and CTO, I care a lot about Large Language Models (LLMs), such as OpenAI ChatGPT o3 and DeepSeek R1. This is because they have changed our tools, our development processes and our software products. And I think you should care too, especially if you want to stay in this field for a while!

Key points

A few key differences of DeepSeek in contrast to their mainstream rivals, such as OpenAI and Google are:

DeepSeek is around 10x cheaper for inference (i.e. usage)
100% open source and open weight
Makes it possible to run powerful AI on your beefy desktop (and mobile in the future)
Lower cost for pre-training and fine-tuning

What can I do with it?

With DeepSeek reasoning model (called R1), you can generate code at a new scale and at a much cheaper cost.
I always have concerns around cost when we are dynamically generating code in our products based on user input. For example, generating a totally custom report with multiple charts based on some description from the user. It is exciting, but how much would it cost in tokens, or GPU?

I mean a large amount of code that we have to compile, test, fix, generate again, test, fix, run, show the output to the user, get feedback and repeat this expensive loop until we get something that finally works. Well, being able to do this 10 times cheaper is a big difference that can actually make this feasible as a feature.
You can find an example of this work in our FHIR converter where we generate integration code dynamically based on a user provided examples. This can easily be applied to domains other than healthcare systems integration.

Where did DeepSeek come from?

DeepSeek is a Chinese AI company established in 2023, owned and funded by a Chinese hedge fund named High-Flyer High-Flyer is claimed to have around US$7 billion (as of October 2024) assets under management and is very interested in AI for using it in their trading algorithms. They are estimated to have access to around 50,000 Hopper GPUs.

DeepSeek has sourced talent exclusively from China, offering salaries of over US$1.3 million for promising candidates, well over the other big Chinese tech companies. They have around 150 employees and are growing rapidly.

So, they have a lot of money; but do they have enough to compete in the LLM game? It turns out the answer is yes because they have changed the rules of the game*to some extent.

What rules did DeepSeek change?

DeepSeek has released an open source model called V3 that is an impressive model compared to GPT-4o. Knowing that GPT-4o was released in May 2024 and the fact that AI moves very quickly, you may not be surprised to see that less compute is needed to train a stronger model. But what's amazing is how much less memory and compute is needed for inference.

For those of you who may be new to AI, inference is the process of making decisions (or predictions) based on previously trained models and current input data. Whenever you ask a question from ChatGPT or DeepSeek, you are performing an inference. And it used to cost a non-negligible amount of money because it runs on expensive GPUs, but not as much anymore, thanks to the optimisations that DeepSeek has found and implemented.

Multi-Token Prediction (MTP)

If there is a sentence that has a missing word at the end, AI models (and our brains) try to predict what that word should be based on the other words. This is called Next Token Prediction (where each word is one token or more). An AI model analyses the input and calculates the probability of different tokens being the next one. It then selects the token with the highest probability as its prediction. And yes, it's all probabilistic and that's why we have hallucinations.

DeepSeek V3 utilises Multi-Token Prediction (MTP) which predicts the next few tokens as opposed to a singular token. This is done at a scale not seen before and improves model performance during training and can be discarded during inference, which means requiring lower compute.

FP8 (8-bit floating point) numbers

DeepSeek also uses FP8 to represent numbers in the model with has reduced precision but allows faster computation with lower memory. It has turned out that this reduction in accuracy is well worth the performance gains.

Mixture of Experts (MoE) model

Another great algorithmic change in V3 has been using many smaller expert models that specialise in different things combined together (instead of relying on one large model such as ChatGPT-4o). This is called a Mixture of Expert (MoE) model that is known to have faster training and inference time (compared to a singular model with the same number of parameters).

In the mixture of experts approach, a gate network (or router) determines which tokens are sent to which expert. The router is like another sub-model that is trained at the same time as the rest of the model.
MoE models traditionally faced a lot of challenges in fine-tuning and high memory usage but DeepSeek has managed to overcome them at a new scale. The fine-tuning challenges can lead to overfitting, which means the model can solve some problems very well but fail to generalise its knowledge.

Multi-head Latent Attention (MLA)

Multi-head Latent Attention (MLA) is a key innovation of DeepSeek that reduced inference costs significantly. And others are quickly jumping to adopt it.
Before we can learn about MLA, we need to know a bit about attention and Key-Value (KV) Cache in LLMs. Please stay with me if you are new to transformers but feel free to skip the next section if you are already familiar with them, or just not interested in too much technical detail.

How these LLMs (transformer models) work

A key part of LLMs like V3 (a.k.a. transformer models) is attention mechanisms, which help the model focus on important parts of the input.
A simple analogy first: Imagine you want to find a book about "computer networks." Here’s how the search process works:

You type in your search query: "computer networks." This is the Query (Q).
There is a database with descriptions or tags for each book. These descriptions are the Keys (K).
Once you find a match, the result shows the actual books related to your query. These books are the Values (V).

In simple terms:

Query (Q): What you’re looking for
Key (K): Information that is indexed
Value (V): The content or data retrieved

In an LLM, Queries (Q), Keys (K), and Values (V) are represented as matrices of floating-point numbers stored in GPU memory.
Keys (K) represent summaries of the context (information about previous words). For example, if the sentence so far is "The quick white fox jumps over the lazy ...", the keys might encode information like: "quick" → adjective describing the subject "white fox" → subject of the sentence "jumps over" → verb/action

Values (V) represent the actual meaning of these words and their positions. For example: For “quick,” the value stores how “quick” relates to other words. For “fox,” the value would encode its role as a subject and its connection to the action.
When the model wants to find a new word:

It needs to figure out which past words are relevant to understanding this word and the Query (Q) is a mathematical representation of this need. Think of it as the model asking itself:
- "What context do I need to understand this word?"
- "Which previous words should I pay attention to?"
It then compares this query against all stored Keys (K) from previous words. This comparison tells the model which past words are relevant.
Each key is assigned a score based on how closely it matches the query. Higher score means the past word is more important for understanding the current word.
The model then retrieves the Values (V) associated with the most relevant keys.
Finally, the model combines these Values (e.g. weighted sum of Values) to influence its decision on what the next word should be.
Basically, every time we ask a question from an LLM (i.e. run an inference for a specific input), it stores past Keys (K) and Values (V) in memory (called the KV Cache), so it doesn’t have to recompute them when generating long responses.
KV Cache consumes GPU memory which is scarce even when we load reasonably small LLMs. Therefore, managing KV Cache size is a critical technical challenge when it comes to increasing the total context size (or conversation length).

Why MLA?

Multi-head Latent Attention (MLA) reduced the amount of KV Cache required per query by more than 90%. Yes! It is a huge saving.

This is how it works: Instead of storing all details in full precision, MLA compresses K and V in a way that keeps only the most important information. It also does this jointly, meaning it compresses K and V together instead of separately.

Imagine you have a high-resolution image, but you need to store it using less space. Instead of storing every pixel, you use a compressed format (like JPEG) that loses some precision but is still good enough for the human eyes. Similarly, MLA finds a smaller, optimised representation of K and V without losing their effectiveness in retrieving relevant information. They do this by compressing K and V matrices using low-rank approximations, making them smaller while preserving key information.

Meta CEO, Mark Zuckerberg, was most likely referring to MLA when mentioned the topic: "I think that there’s a number of novel things that they did that I think we’re still digesting," admitting that Meta is looking to "implement" some aspects of DeepSeek's tech "in our systems."

DeepSeek R1 (reasoning)

Based on V3, DeepSeek released another open source model called R1 which is able to achieve results comparable to OpenAI o1 reasoning model (announced in September 2024). But how did they catch up so fast? For that, we need to know how reasoning works in LLMs.

Instead of quickly generating an answer based on statistical guess of what the next word should be, a reasoning model will take time to break a question down into individual steps. It then follows a process called Chain of Thought (CoT) to come up with a more accurate answer.

CoT guides the model to break down a complex question into manageable steps, similar to how a human usually solves the problem. This helps the model to find the method needed to reach the correct answer, which makes it better at solving complex tasks like coding, math problems, commonsense reasoning, and symbolic manipulation.
Let's note that CoT was used as a prompt engineering technique before these reasoning models were released but now, we have access to models that have built-in chain of thought, meaning they internally reason through steps without needing explicit persuasion from the prompt. It is actually shown to reduce their performance if you try CoT prompting on a reasoning model.

The landscape of reasoning models is still very fresh and dynamic. That is why DeepSeek could catch up so quickly and others are releasing new models every day.

A strong competitor is Google’s Gemini Flash 2.0 Thinking model, that is available for public use via their API (but closed source), with a much larger context.

OpenAI has also announced benchmark results for their new o3 model that has higher capabilities than both R1 and o1 (but it is not publicly released yet and is closed source). Yes, another very confusing name by OpenAI because after o1, you might think o2 is next, but we get o3 instead!

To be clear, considering all of the above, what DeepSeek has done is remarkable. They are disrupting the space with their powerful and open source models as a fast moving, well-funded and focused startup.

Did it really cost only US$6M to train DeepSeek V3?

No. That is said to be only their pre-training cost which is a narrow part of the total cost. Estimates are that their hardware spend is well higher than US$500M over their history and to develop their architecture innovations, there is a considerable investment on testing new ideas with many failures. But the outputs of their research, such as MLA (Multi-head Latent Attention) and their MoE gate network design, are now available to everyone!

Does DeepSeek R1 code better than others?

No but it's great for an open source model.
Let's take look at a few benchmarks: The Codeforces benchmark, that evaluates a model’s algorithmic capabilities (represented as a percentile ranking against human participants) shows OpenAI o1 leading with 96.6% while DeepSeek R1 achieves a very competitive 96.3%.

The SWE-bench Verified benchmark that evaluates reasoning in software engineering tasks also shows DeepSeek R1 performing strongly with a score of 49.2%, slightly ahead of OpenAI o1 with 48.9%.
You can also review MathArena to see some interesting results for AIME I 2025 (that is a key stage toward the Mathematical Olympiad) which shows DeepSeek R1 as a top competitor to o3-mini and o1 (and make sure to check out the high running cost of o1).

Safety

So far, it seems DeepSeek has been focusing on performance and ignoring AI safety. Their guardrails have pretty much failed on every test thrown at them from 50 well-known attacks. But we also know how bad ChatGPT was when they had just got popular. So, there's more work to be done here.

Recap

DeepSeek has changed a few key assumptions. Their models are much faster to train and run, even on a desktop system, thanks to their innovations. Their accuracy is also very competitive compared to the models offered by top players such as OpenAI and Google.
DeepSeek has also released all of it 100% open source and open weight. This means we are living in a different world now where heaps of new possibilities are unlocked.

Bring AI Into Your Product Strategy

Our team builds custom AI applications that scale with your business and unlock actionable insights from your data.