TEST. 05 Inside DeepSeek-V4
Inside DeepSeek-V4.
Five things most people assume modern LLMs do, and what DeepSeek-V4 does instead. A visual explainer of the architectural choices that let a 1.6-trillion-parameter open-source model handle a one-million token context using twenty-seven percent of the FLOPs and ten percent of the KV cache of its predecessor.
The popular mental model of how a Large Language Model works is already five years out of date. The 2017 Transformer is still in the family tree, but the frontier has moved on. Below: five comparisons, each pairing what most people assume LLMs do with what DeepSeek-V4 actually does.
§ I TEST. 05.1 · Attention
From everything attends to everything to a zoom-out hierarchy.
Most people, when told that LLMs use "attention," picture a square grid where every word looks at every other word. This was true in 2017. It is no longer how a frontier model handles a million-token context, because the cost of that square grid is, in a literal sense, quadratic.
DeepSeek-V4 interleaves two layer types. Compressed Sparse Attention (CSA) takes every four tokens, crushes them into a single entry, and lets a small "lightning indexer" pick out the top thousand or so entries that matter for each query. Heavily Compressed Attention (HCA) compresses one hundred and twenty-eight tokens into a single entry and attends to all of them densely, because by then there are very few. Both layer types are wrapped around a small sliding window of recent uncompressed tokens, so local detail is never lost.
The cleanest mental image is a zoom-out map. Close to the query, individual streets. A bit further, neighborhoods. The whole rest of the city, drawn as labelled districts. The query never has to look at every street.
For the curious: the indexer
The lightning indexer scores each compressed CSA entry against the
query using a small set of indexer query heads
(64 heads, head dim 128). Scores are
computed in FP32, then quantized to BF16
before the top-k selection. The paper reports that quantizing the
score path from FP32 to BF16 preserves 99.7% recall
while doubling top-k throughput.
- CSA compression ratio
- m = 4 (4 tokens compressed to 1 entry)
- HCA compression ratio
- m' = 128
- Sliding window
- n_win = 128 tokens
- Top-k (Pro / Flash)
- 1024 / 512
- Indexer query heads
- 64, head dim 128
- Total query heads (Pro / Flash)
- 128 / 64
- KV cache reduction at 1M ctx vs. BF16 GQA8
- ~50×
§ II TEST. 05.2 · KV cache
From linear growth to two percent of the textbook curve.
Every token a model sees in a conversation, by default, is stored forever. Its key and value projections sit in memory, in sixteen-bit precision, ready to be attended to by every token that comes later. This is what makes long-context expensive. It is also why a chat that once ran in a few hundred megabytes can, ten thousand turns in, demand fifty gigabytes per worker.
DeepSeek-V4 attacks the problem from three sides at once. The
attention layers compress what they store: four tokens to one in
CSA layers, a hundred and twenty-eight to one in HCA layers. The
precision drops too. Main key-value entries live in FP8,
only the rotary positional dimensions need BF16, and
the indexer's QK path runs in FP4. And the layers
themselves are interleaved so most of the depth pays the
heavily-compressed cost rather than the moderate one.
The result, at a one-million-token context: roughly two percent of the standard sixteen-bit GQA8 baseline.
The chart's punchline is the y-axis. Three of the four lines crowd into the lowest few percent of the plot area. The textbook "every token costs another sixteen-bit key-value pair" line shoots diagonally to the top right. Everything else stays close to the floor.
For the curious: where the savings come from
Per-layer KV memory cost is roughly 2 · n_heads
· d_head · bytes_per_element per stored
entry. The standard baseline stores one entry per token at sixteen
bits per element. CSA cuts the number of entries by four. HCA
cuts it by a hundred and twenty-eight. Mixed-precision storage
then knocks another factor of roughly two off the per-entry cost.
The factors compound: 4 × 2 for the CSA layers,
128 × 2 for the HCA layers, with the interleave
ratio doing the rest.
- CSA compression
- 1/4 entries per token
- HCA compression
- 1/128 entries per token
- KV main precision
- FP8
- KV RoPE precision
- BF16
- Indexer QK precision
- FP4
- KV @ 1M ctx vs. BF16 GQA8
- ~2%
- KV @ 1M ctx vs. V3.2
- 10% (Pro), 7% (Flash)
§ III TEST. 05.3 · Residual connections
From one stream to four braided streams, mixed by a doubly-stochastic matrix.
The residual connection is the most quietly load-bearing idea in modern deep learning. It says: at each layer, instead of replacing the running state, simply add to it. This is the reason a hundred-layer network trains at all. It is also, everywhere, the same single addition.
DeepSeek-V4 widens the running state into four parallel streams and lets each layer mix them through a learned 4×4 matrix. Plain hyper-connections, just any old 4×4 mixing matrix, train unstably. The signal blows up or collapses across hundreds of layers. The fix is to constrain that matrix to live on a mathematical object called the Birkhoff polytope: the set of doubly-stochastic matrices whose rows and columns each sum to one. Twenty steps of a Sinkhorn-Knopp iteration project the learned matrix onto that manifold, every forward pass.
The intuition is plumbing. A single residual stream is one river flowing through the network. Hyper-connections turn it into four braided rivers. The doubly-stochastic constraint is the assurance that no layer accidentally dumps all the water into one channel and dries the other three.
For the curious: the spectral norm bound
Any doubly-stochastic matrix has spectral norm at most 1, with equality when it is a permutation. Bounding the spectral norm bounds how much the residual signal can grow in one layer. Stacking many such matrices keeps the worst-case signal growth polynomial in depth instead of exponential, which is the practical condition for stable deep training.
- Hyper-connection expansion factor
- n_hc = 4 streams
- Mixing matrix shape
- 4 × 4, doubly stochastic
- Sinkhorn-Knopp iterations
- 20 per forward pass
- Optimizer for mHC biases / gates
- AdamW (not Muon)
§ IV TEST. 05.4 · Optimizer
From AdamW to Muon. A step shaped only by direction.
AdamW has been the default optimizer for transformers since GPT-2. Its update rule is, to a first approximation, a normalized version of the gradient. If the gradient is stretched in one direction, the step is stretched in that direction. This is fine for moderate-sized models. At trillion-parameter scale, those stretches turn into instability.
Muon takes the gradient matrix M, computes its singular value
decomposition M = UΣVᵀ, throws away
Σ, and steps in the direction of
UVᵀ. The step has the same direction
structure as the gradient, but every singular value is now 1.
Real SVD is too expensive to do per-step, so DeepSeek-V4 uses a
hybrid Newton-Schulz iteration (ten steps in two stages) that
approximates it cheaply.
The geometric picture: AdamW takes a stretched ellipse and steps along its longest axis. Muon turns that ellipse into a sphere, then steps. Same direction, different magnitude. The result is faster convergence and better stability at scale. AdamW is still used for the bits where matrix orthogonalization doesn't make sense: embeddings, the prediction head, and RMSNorm weights.
For the curious: the Newton-Schulz coefficients
The hybrid Newton-Schulz iteration uses two stages. The first
eight steps use coefficients (3.4445, -4.7750, 2.0315)
for fast initial convergence. The final two use the standard
stable coefficients (2, -1.5, 0.5) to refine the
result. The iteration converges to the polar factor of M, which
is exactly UVᵀ.
- Optimizer (most params)
- Muon
- Optimizer (embeddings / head / RMSNorm)
- AdamW
- Newton-Schulz iterations
- 10 (8 fast + 2 stable)
- Stage-1 coefficients
- (3.4445, −4.7750, 2.0315)
- Stage-2 coefficients
- (2, −1.5, 0.5)
§ V TEST. 05.5 · Numerical precision
From sixteen bits to four, with the loss baked into training.
The math people quote about LLM size, "175 billion parameters times two bytes equals 350 gigabytes," is the math of a model stored in 16-bit floats. That has been the working assumption since 2018. It is the reason frontier models do not fit on single GPUs without tricks.
DeepSeek-V4 trains in mixed precision from the start. The
Mixture-of-Experts expert weights live in FP4 (four
bits) with quantization-aware training so the model adapts to
the precision loss instead of being post-hoc compressed and
hoping. The attention indexer's QK path also runs in
FP4. The main KV entries live in FP8;
only the rotary positional embedding dimensions need
BF16. Index scores quantize from FP32
down to BF16, which doubles top-k throughput while
preserving 99.7% recall.
The trick that makes all of this fit together is a lossless conversion between FP4 and FP8. FP8 has two more exponent bits than FP4, so as long as a per-block scale factor lives within FP8's dynamic range, the quantization is reversible. The whole pipeline runs in the existing FP8 framework.
For the curious: why FP4 needs a scale factor
Four bits give 16 distinct values. To represent a wide range of
weights, FP4 stores them in blocks (typically 32 weights per
block) with a shared scale factor. The actual weight is
fp4_value × scale. As long as the scale's
dynamic range fits inside FP8's exponent, the FP4 block can be
losslessly dequantized into FP8 for matmul. This is the trick
that lets MXFP4 ride the existing FP8 hardware path.
- MoE expert weights
- FP4 (MXFP4) with QAT
- KV cache main
- FP8
- KV cache RoPE dims
- BF16
- Indexer QK path
- FP4
- Index score precision
- BF16 (down from FP32)
- Index recall preserved
- 99.7%
§ VI Methodology, sources, caveats
Method
Five paired comparisons, each a "what most people assume" diagram next to a "what DeepSeek-V4 actually does" diagram, with editorial prose explaining the gap and a receipts box pinning each numerical claim to a section of the technical report.
The diagrams are hand-coded SVG, no charting library. The page is static HTML with vanilla JavaScript only for the dateline. Page weight stays under one megabyte total. No analytics, no tracking.
Sources
Primary: the DeepSeek-V4-Pro release on Hugging Face and the accompanying technical report. All numerical claims (compression ratios, head counts, precision configurations, Newton-Schulz coefficients) are sourced there.
Secondary context for the historical comparisons: the original 2017 "Attention Is All You Need" paper, the AdamW paper, and the Muon optimizer paper.
Caveats
This is a visual explainer. It simplifies for clarity. The paper is the source of truth.
The "what people assume" framing is a generalization for the educated non-specialist. ML researchers know none of these assumptions are universal; the lab is not written for them.
The 27% FLOPs and 10% KV cache numbers compare V4-Pro to V3.2 at one-million-token context. At shorter contexts, the gap narrows.
What this lab is not
Not a complete account of DeepSeek-V4. Innovations like DeepSeekMoE and Multi-Token Prediction are inherited from V3 and not covered. The lab focuses on the five most impactful and most under-covered changes.
Not authored by DeepSeek-V4 itself, despite the subject. This page was written by Claude Code based on the public technical report. Verify the primary sources.
Five comparisons. Five places where the public mental model of how LLMs work has fallen behind what frontier labs are actually shipping. The 2017 Transformer is still in the family tree, but DeepSeek-V4 shares less with it than the popular discourse suggests.
The throughline: efficiency at scale isn't won by one big idea. It's won by a stack of compromises (compress here, sparsify there, constrain a manifold, normalize a singular value, drop two bits of precision) that compound multiplicatively. The headline number is twenty-seven percent FLOPs and ten percent KV cache. The boring truth is that no single change in this paper produced that result. They all did, together.
If frontier-LLM efficiency feels mysterious, this is why. It isn't one trick. It's five.
FAQ
What is DeepSeek-V4?
DeepSeek-V4 is a 1.6-trillion-parameter open-source large language model released by DeepSeek in 2026. It handles a one-million-token context window using roughly 27% of the FLOPs and 10% of the KV cache of its predecessor DeepSeek-V3.2 — a 50× reduction in cache versus the standard BF16 GQA8 baseline.
How does DeepSeek-V4 reduce KV cache by 90%?
Through three compounding architectural choices: hybrid attention (Compressed Sparse Attention compresses 4 tokens to 1 entry, Heavily Compressed Attention compresses 128 to 1), mixed-precision storage (FP8 main, BF16 rotary positional dimensions, FP4 indexer), and interleaved layer types so the dense-attention overhead applies to only a few layers. The factors compound multiplicatively to yield the headline 50× cache reduction.
What is Compressed Sparse Attention (CSA)?
CSA is one of the two attention layer types in DeepSeek-V4. It groups every 4 consecutive tokens into a single compressed entry, then a "lightning indexer" picks the top-1024 (Pro) or top-512 (Flash) most-relevant entries for each query. The query attends only to those selected entries plus a sliding window of recent uncompressed tokens, instead of attending to every previous token.
What is the Muon optimizer?
Muon is the optimizer DeepSeek-V4 uses for most parameters, replacing AdamW. It orthogonalizes the gradient before stepping: the gradient matrix M is approximated as UV^T (the polar factor of M's singular value decomposition) via a 10-iteration hybrid Newton-Schulz, then the step is taken in that "shape-only" direction with all singular values normalized to 1. The result is faster convergence and better stability at trillion-parameter scale.
Is DeepSeek-V4 better than DeepSeek-V3.2?
For long-context tasks, dramatically: at one-million-token context V4-Pro uses 10% of V3.2's KV cache and 27% of its FLOPs, with comparable or better quality on benchmark tasks. At shorter contexts the gap narrows. The improvement isn't from one big idea but from five compounding architectural changes: hybrid attention, manifold-constrained hyper-connections, the Muon optimizer, FP4 quantization-aware training, and per-layer mixed precision.