TEST. 07 The Transformer paper, anatomized
Attention is
all you need.
The eight-page paper that ended a decade of recurrent and convolutional sequence models, and started the one we are still living through. Below: every piece of the original Transformer, drawn from the source. Two of them are interactive.
Six authors, eight pages, one architecture. The Transformer replaced recurrence and convolution with stacked self-attention, won state of the art in translation at a fraction of the training cost, and within five years became the substrate for every frontier large language model. This page is the paper, in five moves, with two of them you can poke.
§ I TEST. 07.1 · Why attention
Two bottlenecks. One paper that removed both.
In 2017, the dominant sequence-to-sequence models were recurrent. LSTMs and GRUs read tokens one at a time, threading a hidden state forward through the sequence. The architecture was elegant and powerful and slow. Every step depended on the step before. That dependency made batching across the time axis impossible and made long sequences expensive in two compounding ways: more sequential operations, and a longer path between distant tokens that had to learn to talk to each other.
Convolutional alternatives (ByteNet, ConvS2S) sidestepped the
sequential problem by computing positions in parallel, but the
receptive field still grew slowly with depth. Distant tokens
either took many layers to interact, or required dilations and
tricks. Path length between far positions remained O(log n)
at best.
The 2017 paper proposed a third option. Drop recurrence. Drop
convolution. Connect every position to every other position in
one layer through attention, and let the model decide which
connections matter. Path length collapses to one. Sequential
operations collapse to one. Computational complexity moves
from O(n · d²) for a recurrent layer to
O(n² · d) for self-attention, which is
a win whenever n < d. For typical sentences
encoded with d = 512, that is essentially always.
- Self-attention complexity / layer
- O(n² · d)
- Recurrent complexity / layer
- O(n · d²)
- Convolutional complexity / layer
- O(k · n · d²)
- Self-attention sequential ops
- O(1)
- Recurrent sequential ops
- O(n)
- Self-attention max path length
- O(1)
- Recurrent max path length
- O(n)
§ II TEST. 07.2 · Scaled dot-product attention
One equation. Hover any row to feel it.
The attention operation takes three inputs. A matrix of queries
Q, a matrix of keys K, and a matrix of
values V. For every row in Q (one query
per token), we compute a dot product against every row in
K (one key per token). The result is a matrix of
compatibility scores: how much should each query attend to each
key.
The scores divide by √dk for a
practical reason. Without that scaling, large dk
pushes the softmax into a regime where its gradients vanish. The
paper hypothesizes that for components of q and
k with mean zero and variance one, their dot product
has variance dk, large enough to saturate
softmax at frontier dimension. Dividing by √dk
restores stable gradients.
A softmax over each row turns the scaled scores into attention
weights that sum to one. Those weights then weight the rows of
V to produce the output: each token's representation
is now a content-weighted blend of every other token's value.
The whole thing is one matrix multiplication, one scale, one
softmax, one multiplication.
The two attention variants the paper considered
Additive attention uses a feed-forward network with
a single hidden layer to compute compatibility. Mathematically
similar but slower and less space-efficient in practice, since
it cannot be expressed as a single highly-optimized matmul.
Multiplicative (dot-product) attention is what the
paper uses. Identical to additive at small dk;
faster on the kind of hardware everyone actually trains on. The
1 / √dk scaling is the only twist
they had to add to make it work at frontier scale.
- Compatibility
- scaled dot product
- Scaling factor
- 1 / √dk
- Normalization
- row-wise softmax
- dk in base model
- 64
- dv in base model
- 64
- Base model dmodel
- 512
§ III TEST. 07.3 · Multi-head attention
One layer becomes eight parallel views.
A single attention layer has a problem. It averages information from every position the query attends to. Averaging is fine when every relevant relationship lives in the same representation subspace. It is destructive when different relationships need to be tracked at the same time: subject-verb agreement and preposition binding and anaphora and tense, all braided into a single weighted sum.
The fix is to run the attention operation h times in
parallel, each in its own learned linear projection of Q,
K, and V. Each head sees a different
subspace. Each head learns a different attention pattern. The
outputs concatenate, then a final linear projection collapses them
back to dmodel. Because each head operates
on dk = dv = dmodel / h,
the total compute stays comparable to single-head with full width.
In the base model, h = 8. Eight parallel attention
patterns, eight parallel learned projections, one concatenated
output. The paper's appendix shows individual heads learning to
perform syntactically and semantically distinct jobs without ever
being told to. Some attend to the previous token. Some link
verbs to their objects. Some specialize in long-range coreference.
The math, fully expanded
MultiHead(Q, K, V) = Concat(head₁, ..., headₕ) Wᴼ
where each headᵢ = Attention(QWᵢᴱ, KWᵢᵏ, VWᵢᵥ).
The four parameter matrices per head are
Wᵢᴱ ∈ ℝᵉᵐᵒᵈᵉℓ x ᵈₖ,
Wᵢᵏ ∈ ℝᵉᵐᵒᵈᵉℓ x ᵈₖ,
Wᵢᵥ ∈ ℝᵉᵐᵒᵈᵉℓ x ᵈᵥ,
and Wᴼ ∈ ℝʰᵈᵥ x ᵉᵐᵒᵈᵉℓ.
With h = 8, dₖ = dᵥ = 64, dₘₒᴅₑₗ = 512,
a single multi-head attention layer has comparable compute to a
single full-dimension head. Eight views, one budget.
- Heads in base model
- h = 8
- Per-head dimensions
- dk = dv = 64
- Concatenated output dim
- h · dv = 512
- Total compute vs single-head
- roughly equal
§ IV TEST. 07.4 · Encoder, decoder, repeat six times
Two stacks. Six layers each. One residual connection at every step.
The full Transformer is two stacks of six identical layers each.
The encoder reads the input. The decoder writes the output, one
token at a time, conditioned on what it has written and on the
encoder's output. Every sub-layer is wrapped in a residual
connection followed by layer normalization, so the running
representation is the sub-layer's output added to its input. The
whole stack outputs vectors of dimension dmodel = 512
from start to finish.
Three flavors of attention show up. The encoder uses self-attention over its input. The decoder uses masked self-attention, which prevents each position from looking at future positions during training, preserving the auto-regressive property. The decoder also uses encoder-decoder attention, where queries come from the decoder and keys and values come from the encoder's output. That third flavor is how the decoder grounds its generation in the input sentence.
Position information is added at the bottom of both stacks via sinusoidal positional encodings. Sine and cosine waves of different frequencies, summed onto the embeddings before any attention happens. The paper found that learned positional embeddings worked equally well, but they preferred the sinusoids because they generalize to sequences longer than anything seen during training.
Sinusoidal positional encoding
PE₍ₚₒₛ,2i₎ = sin(pos / 100002i / dₘₒᴅₑₗ)
PE₍ₚₒₛ,2i+1₎ = cos(pos / 100002i / dₘₒᴅₑₗ)
The wavelengths form a geometric progression from 2π to
10000 · 2π. The paper hypothesized that this
family lets the model attend by relative position easily, since
PEₚₒₛ+k is a linear function of
PEₚₒₛ for any fixed k.
- Layers per stack
- N = 6
- Model dimension
- dmodel = 512
- FFN inner dimension
- dff = 2048
- Heads
- h = 8
- Per-head dim
- dk = dv = 64
- Sub-layers per encoder layer
- 2 (attention + FFN)
- Sub-layers per decoder layer
- 3 (masked self-attn + cross-attn + FFN)
- Around every sub-layer
- residual connection + LayerNorm
§ V TEST. 07.5 · Why it won
Slide n. Watch the answer flip.
The complexity argument was the most under-appreciated part of
the paper at publication. Self-attention costs more per layer in
raw FLOPs at long sequence lengths, but it costs much less in
sequential operations because every position can be
computed at the same time. The maximum path length between any
two positions also drops to a constant. The slider below makes
that tradeoff legible. Drag n across the typical
range of inputs and watch which architecture is cheapest at each
length, holding d = 512 constant.
The final ingredient was the result. The Transformer beat every previously reported model, including ensembles, on WMT 2014 English-to-German by more than two BLEU. It set a new single-model state of the art on English-to-French, training in three and a half days on eight P100 GPUs, a small fraction of the cost of the previous best. The base model trained for twelve hours and already beat every competitor.
- BLEU on WMT 2014 EN-DE (big)
- 28.4 (previous SOTA was 26.36)
- BLEU on WMT 2014 EN-FR (big)
- 41.8 (single-model SOTA)
- Base model BLEU EN-DE
- 27.3 (already beat every competitor)
- Training cost (base, FLOPs)
- 3.3 × 1018
- Training cost (big, FLOPs)
- 2.3 × 1019
- Hardware
- 8 NVIDIA P100 GPUs
- Big model training time
- 3.5 days
- Generalization (WSJ parsing F1)
- 91.3 (WSJ-only) · 92.7 (semi-supervised)
§ VI Methodology, sources, what this is and is not
Method
Five sections, each pinned to a specific section of the source paper. The two interactive widgets in § II and § V run entirely client-side in vanilla JavaScript. Heatmap weights are hand-tuned to a plausible self-attention pattern; the multi-head strip in § III is illustrative, not the output of an actual trained model.
The diagrams are hand-coded SVG and CSS grids, no charting library. The page is static HTML with vanilla JavaScript only for the dateline, the heatmap interaction, and the complexity slider. No analytics beyond the privacy-respecting Cloudflare Web Analytics that ships at the platform level.
Sources
Primary: Vaswani et al., "Attention Is All You Need," NeurIPS 2017. arXiv:1706.03762. Every numerical claim on this page traces to a section of that paper: complexity table to § 4, BLEU scores to Table 2, hyperparameters to § 5 and Table 3, sinusoidal positional encoding to § 3.5.
The reference TensorFlow implementation, tensor2tensor, is what the original authors used. The PyTorch reference is the Annotated Transformer from Harvard NLP. Both are excellent reading after this page.
Caveats
The heatmap in § II is a plausible self-attention pattern, not the output of a trained model. It is constructed to make syntactic structure visible to a first-time reader, not to reproduce specific behavior of a specific checkpoint.
The eight head patterns in § III are stylized. Real attention heads in trained Transformers are messier and the precise jobs each head learns vary across runs and seeds. The paper's appendix shows real examples; the strip on this page compresses the "different heads, different jobs" intuition into eight cartoons.
The complexity slider treats d as fixed at 512 and
ignores constant factors. Modern models use much larger d
and much longer n; the crossover point is more
nuanced today than the paper's clean n < d result.
What this lab is not
Not a complete reading of the paper. The training regime, regularization, label smoothing, learning-rate warmup, and constituency parsing experiments are summarized in receipts boxes but not visualized. The original is eight pages and is worth reading directly.
Not a working Transformer. Nothing on this page actually translates anything. For a runnable in-browser Transformer, see Hugging Face's transformers.js or the Annotated Transformer. This lab is the diagram, not the engine.
Eight pages. Six authors. One architecture. The 2017 paper is now nine years old and every frontier large language model on Earth is a direct descendant of its decoder stack. The receipts that mattered were not the BLEU scores. They were the complexity table and the parallelism argument. Recurrence had a hidden tax. Attention removed it.
What followed is harder to summarize. Encoder-only Transformers became BERT. Decoder-only Transformers became GPT. The encoder stack faded, the decoder stack ate the world, and the field is still arguing about whether to keep stretching n or replace n² with something genuinely subquadratic. Two of those arguments are running labs on this site already.
All of that started here. Attention is all you need.
FAQ
What is the Transformer?
The Transformer is the neural network architecture introduced in the 2017 paper "Attention Is All You Need" by Vaswani et al. It dispenses with recurrence and convolution entirely, relying only on stacked self-attention and feed-forward layers. It is the architecture every modern large language model is built on, including GPT, Claude, Gemini, and DeepSeek.
What is scaled dot-product attention?
Scaled dot-product attention computes Attention(Q, K, V) = softmax(Q K^T / √d_k) V. The query Q is compared against every key K via a dot product, the result is divided by the square root of the key dimension d_k to keep gradients stable, a softmax converts the scores to weights that sum to one, and the weights are used to take a weighted sum of the value vectors V.
Why multi-head attention?
A single attention layer averages information from all positions, which blurs distinct relationships. Multi-head attention runs h parallel attention layers (h = 8 in the base model), each in its own d_k = d_v = 64 dimensional subspace, and concatenates the results. Different heads learn to attend to different things: subject-verb agreement, anaphora, syntactic role, position. The total computational cost is similar to single-head with full dimensionality.
Why is self-attention faster than RNNs at typical sequence lengths?
Self-attention costs O(n² · d) per layer but only O(1) sequential operations because every position is computed in parallel. Recurrent layers cost O(n · d²) per layer and require O(n) sequential operations. When the sequence length n is smaller than the representation dimension d (as it is for typical sentences encoded with d = 512), self-attention is both fewer FLOPs and dramatically more parallelizable. The maximum path length between any two positions also drops from O(n) to O(1).
What were the original Transformer's results?
On WMT 2014 English-to-German translation, the big Transformer achieved 28.4 BLEU, beating the previous best (including ensembles) by more than two BLEU. On WMT 2014 English-to-French it reached 41.8 BLEU. The base model trained for twelve hours on eight NVIDIA P100 GPUs at a total cost of 3.3 × 10¹⁸ FLOPs, a fraction of competitors. The model also generalized to English constituency parsing without architecture-specific tuning.