Vol. XII · No. 05 · May 2026
Jake Cuth.
Experimental Authored by Claude Code. Subject: DeepSeek-V4 architecture. Quality may vary. Verify the primary sources · Models can hallucinate

Inside DeepSeek-V4.

Five things most people assume modern LLMs do, and what DeepSeek-V4 does instead. A visual explainer of the architectural choices that let a 1.6-trillion-parameter open-source model handle a one-million token context using twenty-seven percent of the FLOPs and ten percent of the KV cache of its predecessor.

The popular mental model of how a Large Language Model works is already five years out of date. The 2017 Transformer is still in the family tree, but the frontier has moved on. Below: five comparisons, each pairing what most people assume LLMs do with what DeepSeek-V4 actually does.


From everything attends to everything to a zoom-out hierarchy.

Keys (N tokens) Queries (N tokens)
Assumption Every query attends to every key. N×N attention. The original 2017 Transformer.
Query Window n=128 CSA 4:1, top-k HCA 128:1, dense
Reality The query attends to the recent window plus a top-k slice of compressed mid-range entries plus a few heavily compressed far-away entries. Far less work.

Most people, when told that LLMs use "attention," picture a square grid where every word looks at every other word. This was true in 2017. It is no longer how a frontier model handles a million-token context, because the cost of that square grid is, in a literal sense, quadratic.

DeepSeek-V4 interleaves two layer types. Compressed Sparse Attention (CSA) takes every four tokens, crushes them into a single entry, and lets a small "lightning indexer" pick out the top thousand or so entries that matter for each query. Heavily Compressed Attention (HCA) compresses one hundred and twenty-eight tokens into a single entry and attends to all of them densely, because by then there are very few. Both layer types are wrapped around a small sliding window of recent uncompressed tokens, so local detail is never lost.

The cleanest mental image is a zoom-out map. Close to the query, individual streets. A bit further, neighborhoods. The whole rest of the city, drawn as labelled districts. The query never has to look at every street.

For the curious: the indexer

The lightning indexer scores each compressed CSA entry against the query using a small set of indexer query heads (64 heads, head dim 128). Scores are computed in FP32, then quantized to BF16 before the top-k selection. The paper reports that quantizing the score path from FP32 to BF16 preserves 99.7% recall while doubling top-k throughput.

Receipts · § I
CSA compression ratio
m = 4 (4 tokens compressed to 1 entry)
HCA compression ratio
m' = 128
Sliding window
n_win = 128 tokens
Top-k (Pro / Flash)
1024 / 512
Indexer query heads
64, head dim 128
Total query heads (Pro / Flash)
128 / 64
KV cache reduction at 1M ctx vs. BF16 GQA8
~50×

From linear growth to two percent of the textbook curve.

27% FLOPs vs. V3.2
10% KV cache vs. V3.2 · Pro
7% KV cache vs. V3.2 · Flash
~2% KV vs. BF16 GQA8 baseline

Every token a model sees in a conversation, by default, is stored forever. Its key and value projections sit in memory, in sixteen-bit precision, ready to be attended to by every token that comes later. This is what makes long-context expensive. It is also why a chat that once ran in a few hundred megabytes can, ten thousand turns in, demand fifty gigabytes per worker.

DeepSeek-V4 attacks the problem from three sides at once. The attention layers compress what they store: four tokens to one in CSA layers, a hundred and twenty-eight to one in HCA layers. The precision drops too. Main key-value entries live in FP8, only the rotary positional dimensions need BF16, and the indexer's QK path runs in FP4. And the layers themselves are interleaved so most of the depth pays the heavily-compressed cost rather than the moderate one.

The result, at a one-million-token context: roughly two percent of the standard sixteen-bit GQA8 baseline.

FIG. 05.2 · KV cache (GB) by sequence length
Memory grows linearly only if you let it.
250 200 150 100 50 0 KV cache (GB) 0 200K 400K 600K 800K 1M Sequence length (tokens) BF16 GQA8: ~250 GB V3.2: ~50 GB V4-Pro: ~5 GB
BF16 GQA8 baseline DeepSeek-V3.2 DeepSeek-V4-Pro DeepSeek-V4-Flash

The chart's punchline is the y-axis. Three of the four lines crowd into the lowest few percent of the plot area. The textbook "every token costs another sixteen-bit key-value pair" line shoots diagonally to the top right. Everything else stays close to the floor.

For the curious: where the savings come from

Per-layer KV memory cost is roughly 2 · n_heads · d_head · bytes_per_element per stored entry. The standard baseline stores one entry per token at sixteen bits per element. CSA cuts the number of entries by four. HCA cuts it by a hundred and twenty-eight. Mixed-precision storage then knocks another factor of roughly two off the per-entry cost. The factors compound: 4 × 2 for the CSA layers, 128 × 2 for the HCA layers, with the interleave ratio doing the rest.

Receipts · § II
CSA compression
1/4 entries per token
HCA compression
1/128 entries per token
KV main precision
FP8
KV RoPE precision
BF16
Indexer QK precision
FP4
KV @ 1M ctx vs. BF16 GQA8
~2%
KV @ 1M ctx vs. V3.2
10% (Pro), 7% (Flash)

From one stream to four braided streams, mixed by a doubly-stochastic matrix.

Layer 1 Layer 2 Layer 3 Layer 4 Layer 5 x + f(x) x_out = x + f(x) applied at every layer
Assumption One residual stream. Every layer adds f(x) to the running x. The 2017 design.
h₁ h₂ h₃ h₄ L1 L2 L3 L4 L5 Each layer mixes the streams via B Σ = 1 (row) Σ = 1 (col)
Reality Four parallel streams. Each layer mixes them via a 4×4 matrix whose rows and columns each sum to 1, projected onto the Birkhoff polytope.

The residual connection is the most quietly load-bearing idea in modern deep learning. It says: at each layer, instead of replacing the running state, simply add to it. This is the reason a hundred-layer network trains at all. It is also, everywhere, the same single addition.

DeepSeek-V4 widens the running state into four parallel streams and lets each layer mix them through a learned 4×4 matrix. Plain hyper-connections, just any old 4×4 mixing matrix, train unstably. The signal blows up or collapses across hundreds of layers. The fix is to constrain that matrix to live on a mathematical object called the Birkhoff polytope: the set of doubly-stochastic matrices whose rows and columns each sum to one. Twenty steps of a Sinkhorn-Knopp iteration project the learned matrix onto that manifold, every forward pass.

The intuition is plumbing. A single residual stream is one river flowing through the network. Hyper-connections turn it into four braided rivers. The doubly-stochastic constraint is the assurance that no layer accidentally dumps all the water into one channel and dries the other three.

For the curious: the spectral norm bound

Any doubly-stochastic matrix has spectral norm at most 1, with equality when it is a permutation. Bounding the spectral norm bounds how much the residual signal can grow in one layer. Stacking many such matrices keeps the worst-case signal growth polynomial in depth instead of exponential, which is the practical condition for stable deep training.

Receipts · § III
Hyper-connection expansion factor
n_hc = 4 streams
Mixing matrix shape
4 × 4, doubly stochastic
Sinkhorn-Knopp iterations
20 per forward pass
Optimizer for mHC biases / gates
AdamW (not Muon)

From AdamW to Muon. A step shaped only by direction.

step ∝ gradient itself AdamW
Assumption AdamW takes a step proportional to the gradient. If the gradient is stretched, the step is stretched.
M NS UVT step ∝ gradient direction only Muon
Reality Muon orthogonalizes the gradient first, then steps. The step direction is preserved; the step magnitude is uniform.

AdamW has been the default optimizer for transformers since GPT-2. Its update rule is, to a first approximation, a normalized version of the gradient. If the gradient is stretched in one direction, the step is stretched in that direction. This is fine for moderate-sized models. At trillion-parameter scale, those stretches turn into instability.

Muon takes the gradient matrix M, computes its singular value decomposition M = UΣVᵀ, throws away Σ, and steps in the direction of UVᵀ. The step has the same direction structure as the gradient, but every singular value is now 1. Real SVD is too expensive to do per-step, so DeepSeek-V4 uses a hybrid Newton-Schulz iteration (ten steps in two stages) that approximates it cheaply.

The geometric picture: AdamW takes a stretched ellipse and steps along its longest axis. Muon turns that ellipse into a sphere, then steps. Same direction, different magnitude. The result is faster convergence and better stability at scale. AdamW is still used for the bits where matrix orthogonalization doesn't make sense: embeddings, the prediction head, and RMSNorm weights.

For the curious: the Newton-Schulz coefficients

The hybrid Newton-Schulz iteration uses two stages. The first eight steps use coefficients (3.4445, -4.7750, 2.0315) for fast initial convergence. The final two use the standard stable coefficients (2, -1.5, 0.5) to refine the result. The iteration converges to the polar factor of M, which is exactly UVᵀ.

Receipts · § IV
Optimizer (most params)
Muon
Optimizer (embeddings / head / RMSNorm)
AdamW
Newton-Schulz iterations
10 (8 fast + 2 stable)
Stage-1 coefficients
(3.4445, −4.7750, 2.0315)
Stage-2 coefficients
(2, −1.5, 0.5)

From sixteen bits to four, with the loss baked into training.

BF16 16 bits FP8 8 bits FP4 4 bits sign exp mantissa
Assumption Weights live in 16-bit floats during training. Maybe 8-bit at deployment via post-hoc quantization.
Memory for 1.6T parameters BF16 3.2 TB FP8 1.6 TB FP4 0.8 TB V4 component map MoE expert weights FP4 KV cache main FP8 KV cache RoPE dims BF16 Indexer QK path FP4
Reality Mixed precision is built into training. The expert weights live in FP4 from the start, with quantization-aware updates.

The math people quote about LLM size, "175 billion parameters times two bytes equals 350 gigabytes," is the math of a model stored in 16-bit floats. That has been the working assumption since 2018. It is the reason frontier models do not fit on single GPUs without tricks.

DeepSeek-V4 trains in mixed precision from the start. The Mixture-of-Experts expert weights live in FP4 (four bits) with quantization-aware training so the model adapts to the precision loss instead of being post-hoc compressed and hoping. The attention indexer's QK path also runs in FP4. The main KV entries live in FP8; only the rotary positional embedding dimensions need BF16. Index scores quantize from FP32 down to BF16, which doubles top-k throughput while preserving 99.7% recall.

The trick that makes all of this fit together is a lossless conversion between FP4 and FP8. FP8 has two more exponent bits than FP4, so as long as a per-block scale factor lives within FP8's dynamic range, the quantization is reversible. The whole pipeline runs in the existing FP8 framework.

For the curious: why FP4 needs a scale factor

Four bits give 16 distinct values. To represent a wide range of weights, FP4 stores them in blocks (typically 32 weights per block) with a shared scale factor. The actual weight is fp4_value × scale. As long as the scale's dynamic range fits inside FP8's exponent, the FP4 block can be losslessly dequantized into FP8 for matmul. This is the trick that lets MXFP4 ride the existing FP8 hardware path.

Receipts · § V
MoE expert weights
FP4 (MXFP4) with QAT
KV cache main
FP8
KV cache RoPE dims
BF16
Indexer QK path
FP4
Index score precision
BF16 (down from FP32)
Index recall preserved
99.7%

Method

Five paired comparisons, each a "what most people assume" diagram next to a "what DeepSeek-V4 actually does" diagram, with editorial prose explaining the gap and a receipts box pinning each numerical claim to a section of the technical report.

The diagrams are hand-coded SVG, no charting library. The page is static HTML with vanilla JavaScript only for the dateline. Page weight stays under one megabyte total. No analytics, no tracking.

Sources

Primary: the DeepSeek-V4-Pro release on Hugging Face and the accompanying technical report. All numerical claims (compression ratios, head counts, precision configurations, Newton-Schulz coefficients) are sourced there.

Secondary context for the historical comparisons: the original 2017 "Attention Is All You Need" paper, the AdamW paper, and the Muon optimizer paper.

Caveats

This is a visual explainer. It simplifies for clarity. The paper is the source of truth.

The "what people assume" framing is a generalization for the educated non-specialist. ML researchers know none of these assumptions are universal; the lab is not written for them.

The 27% FLOPs and 10% KV cache numbers compare V4-Pro to V3.2 at one-million-token context. At shorter contexts, the gap narrows.

What this lab is not

Not a complete account of DeepSeek-V4. Innovations like DeepSeekMoE and Multi-Token Prediction are inherited from V3 and not covered. The lab focuses on the five most impactful and most under-covered changes.

Not authored by DeepSeek-V4 itself, despite the subject. This page was written by Claude Code based on the public technical report. Verify the primary sources.

Five comparisons. Five places where the public mental model of how LLMs work has fallen behind what frontier labs are actually shipping. The 2017 Transformer is still in the family tree, but DeepSeek-V4 shares less with it than the popular discourse suggests.

The throughline: efficiency at scale isn't won by one big idea. It's won by a stack of compromises (compress here, sparsify there, constrain a manifold, normalize a singular value, drop two bits of precision) that compound multiplicatively. The headline number is twenty-seven percent FLOPs and ten percent KV cache. The boring truth is that no single change in this paper produced that result. They all did, together.

If frontier-LLM efficiency feels mysterious, this is why. It isn't one trick. It's five.

FAQ

What is DeepSeek-V4?

DeepSeek-V4 is a 1.6-trillion-parameter open-source large language model released by DeepSeek in 2026. It handles a one-million-token context window using roughly 27% of the FLOPs and 10% of the KV cache of its predecessor DeepSeek-V3.2 — a 50× reduction in cache versus the standard BF16 GQA8 baseline.

How does DeepSeek-V4 reduce KV cache by 90%?

Through three compounding architectural choices: hybrid attention (Compressed Sparse Attention compresses 4 tokens to 1 entry, Heavily Compressed Attention compresses 128 to 1), mixed-precision storage (FP8 main, BF16 rotary positional dimensions, FP4 indexer), and interleaved layer types so the dense-attention overhead applies to only a few layers. The factors compound multiplicatively to yield the headline 50× cache reduction.

What is Compressed Sparse Attention (CSA)?

CSA is one of the two attention layer types in DeepSeek-V4. It groups every 4 consecutive tokens into a single compressed entry, then a "lightning indexer" picks the top-1024 (Pro) or top-512 (Flash) most-relevant entries for each query. The query attends only to those selected entries plus a sliding window of recent uncompressed tokens, instead of attending to every previous token.

What is the Muon optimizer?

Muon is the optimizer DeepSeek-V4 uses for most parameters, replacing AdamW. It orthogonalizes the gradient before stepping: the gradient matrix M is approximated as UV^T (the polar factor of M's singular value decomposition) via a 10-iteration hybrid Newton-Schulz, then the step is taken in that "shape-only" direction with all singular values normalized to 1. The result is faster convergence and better stability at trillion-parameter scale.

Is DeepSeek-V4 better than DeepSeek-V3.2?

For long-context tasks, dramatically: at one-million-token context V4-Pro uses 10% of V3.2's KV cache and 27% of its FLOPs, with comparable or better quality on benchmark tasks. At shorter contexts the gap narrows. The improvement isn't from one big idea but from five compounding architectural changes: hybrid attention, manifold-constrained hyper-connections, the Muon optimizer, FP4 quantization-aware training, and per-layer mixed precision.