TEST. 07 The Transformer paper, anatomized

Attention is
all you need.

The eight-page paper that ended a decade of recurrent and convolutional sequence models, and started the one we are still living through. Below: every piece of the original Transformer, drawn from the source. Two of them are interactive.

Six authors, eight pages, one architecture. The Transformer replaced recurrence and convolution with stacked self-attention, won state of the art in translation at a fraction of the training cost, and within five years became the substrate for every frontier large language model. This page is the paper, in five moves, with two of them you can poke.

Filed

Engine

Hand-coded SVG · vanilla JS

Source

arXiv:1706.03762 ↗

Why attention · Dot-product · Multi-head · The stack · Why it won · Methodology

§ I TEST. 07.1 · Why attention

Two bottlenecks. One paper that removed both.

In 2017, the dominant sequence-to-sequence models were recurrent. LSTMs and GRUs read tokens one at a time, threading a hidden state forward through the sequence. The architecture was elegant and powerful and slow. Every step depended on the step before. That dependency made batching across the time axis impossible and made long sequences expensive in two compounding ways: more sequential operations, and a longer path between distant tokens that had to learn to talk to each other.

Convolutional alternatives (ByteNet, ConvS2S) sidestepped the sequential problem by computing positions in parallel, but the receptive field still grew slowly with depth. Distant tokens either took many layers to interact, or required dilations and tricks. Path length between far positions remained O(log n) at best.

The 2017 paper proposed a third option. Drop recurrence. Drop convolution. Connect every position to every other position in one layer through attention, and let the model decide which connections matter. Path length collapses to one. Sequential operations collapse to one. Computational complexity moves from O(n · d²) for a recurrent layer to O(n² · d) for self-attention, which is a win whenever n < d. For typical sentences encoded with d = 512, that is essentially always.

Receipts · § I

Self-attention complexity / layer: O(n² · d)
Recurrent complexity / layer: O(n · d²)
Convolutional complexity / layer: O(k · n · d²)
Self-attention sequential ops: O(1)
Recurrent sequential ops: O(n)
Self-attention max path length: O(1)
Recurrent max path length: O(n)

§ II TEST. 07.2 · Scaled dot-product attention

One equation. Hover any row to feel it.

Attention(Q, K, V) = softmax(QK^T / √d_k) V Equation (1) · Vaswani et al., 2017

The attention operation takes three inputs. A matrix of queries Q, a matrix of keys K, and a matrix of values V. For every row in Q (one query per token), we compute a dot product against every row in K (one key per token). The result is a matrix of compatibility scores: how much should each query attend to each key.

The scores divide by √d_k for a practical reason. Without that scaling, large d_k pushes the softmax into a regime where its gradients vanish. The paper hypothesizes that for components of q and k with mean zero and variance one, their dot product has variance d_k, large enough to saturate softmax at frontier dimension. Dividing by √d_k restores stable gradients.

A softmax over each row turns the scaled scores into attention weights that sum to one. Those weights then weight the rows of V to produce the output: each token's representation is now a content-weighted blend of every other token's value. The whole thing is one matrix multiplication, one scale, one softmax, one multiplication.

FIG. 07.1 · Self-attention weights, hand-tuned for a six-token sentence

"the cat sat on the mat"

Hover or focus a row to see what each query attends to. Each row sums to 1.

Hover any row. Each row is one query token looking at the whole sentence and choosing which keys to attend to. Weights sum to 1.

The two attention variants the paper considered

Additive attention uses a feed-forward network with a single hidden layer to compute compatibility. Mathematically similar but slower and less space-efficient in practice, since it cannot be expressed as a single highly-optimized matmul.

Multiplicative (dot-product) attention is what the paper uses. Identical to additive at small d_k; faster on the kind of hardware everyone actually trains on. The 1 / √d_k scaling is the only twist they had to add to make it work at frontier scale.

Receipts · § II

Compatibility: scaled dot product
Scaling factor: 1 / √d_k
Normalization: row-wise softmax
d_k in base model: 64
d_v in base model: 64
Base model d_model: 512

§ III TEST. 07.3 · Multi-head attention

One layer becomes eight parallel views.

A single attention layer has a problem. It averages information from every position the query attends to. Averaging is fine when every relevant relationship lives in the same representation subspace. It is destructive when different relationships need to be tracked at the same time: subject-verb agreement and preposition binding and anaphora and tense, all braided into a single weighted sum.

The fix is to run the attention operation h times in parallel, each in its own learned linear projection of Q, K, and V. Each head sees a different subspace. Each head learns a different attention pattern. The outputs concatenate, then a final linear projection collapses them back to d_model. Because each head operates on d_k = d_v = d_model / h, the total compute stays comparable to single-head with full width.

In the base model, h = 8. Eight parallel attention patterns, eight parallel learned projections, one concatenated output. The paper's appendix shows individual heads learning to perform syntactically and semantically distinct jobs without ever being told to. Some attend to the previous token. Some link verbs to their objects. Some specialize in long-range coreference.

FIG. 07.2 · Eight heads, eight patterns · same six-token sentence

What each head learns to look at.

Head 1self

Head 2previous

Head 3SVO

Head 4article

Head 5prep

Head 6long-range

Head 7broad

Head 8final

The math, fully expanded

MultiHead(Q, K, V) = Concat(head₁, ..., headₕ) Wᴼ

where each headᵢ = Attention(QWᵢᴱ, KWᵢᵏ, VWᵢᵥ). The four parameter matrices per head are Wᵢᴱ ∈ ℝᵉᵐᵒᵈᵉℓ x ᵈₖ, Wᵢᵏ ∈ ℝᵉᵐᵒᵈᵉℓ x ᵈₖ, Wᵢᵥ ∈ ℝᵉᵐᵒᵈᵉℓ x ᵈᵥ, and Wᴼ ∈ ℝʰᵈᵥ x ᵉᵐᵒᵈᵉℓ.

With h = 8, dₖ = dᵥ = 64, dₘₒᴅₑₗ = 512, a single multi-head attention layer has comparable compute to a single full-dimension head. Eight views, one budget.

Receipts · § III

Heads in base model: h = 8
Per-head dimensions: d_k = d_v = 64
Concatenated output dim: h · d_v = 512
Total compute vs single-head: roughly equal

§ IV TEST. 07.4 · Encoder, decoder, repeat six times

Two stacks. Six layers each. One residual connection at every step.

The full Transformer is two stacks of six identical layers each. The encoder reads the input. The decoder writes the output, one token at a time, conditioned on what it has written and on the encoder's output. Every sub-layer is wrapped in a residual connection followed by layer normalization, so the running representation is the sub-layer's output added to its input. The whole stack outputs vectors of dimension d_model = 512 from start to finish.

Three flavors of attention show up. The encoder uses self-attention over its input. The decoder uses masked self-attention, which prevents each position from looking at future positions during training, preserving the auto-regressive property. The decoder also uses encoder-decoder attention, where queries come from the decoder and keys and values come from the encoder's output. That third flavor is how the decoder grounds its generation in the input sentence.

Position information is added at the bottom of both stacks via sinusoidal positional encodings. Sine and cosine waves of different frequencies, summed onto the embeddings before any attention happens. The paper found that learned positional embeddings worked equally well, but they preferred the sinusoids because they generalize to sequences longer than anything seen during training.

Encoder N = 6 identical layers

Input embeddingd=512

+ positional encoding (sin/cos)

↓ repeat 6× ↓

Multi-head self-attentionh=8

Add & LayerNorm

Position-wise FFNd_ff=2048

Add & LayerNorm

↓ output ↓

Encoder outputto decoder K, V

Decoder N = 6 identical layers, auto-regressive

Output embedding (shifted right)d=512

+ positional encoding (sin/cos)

↓ repeat 6× ↓

Masked multi-head self-attentioncausal

Add & LayerNorm

Encoder-decoder attentionQ from dec, K/V from enc

Add & LayerNorm

Position-wise FFNd_ff=2048

Add & LayerNorm

↓ output ↓

Linear → Softmaxnext-token probs

Sinusoidal positional encoding

PE₍ₚₒₛ,2i₎ = sin(pos / 10000^{2i / dₘₒᴅₑₗ})

PE₍ₚₒₛ,2i+1₎ = cos(pos / 10000^{2i / dₘₒᴅₑₗ})

The wavelengths form a geometric progression from 2π to 10000 · 2π. The paper hypothesized that this family lets the model attend by relative position easily, since PEₚₒₛ+k is a linear function of PEₚₒₛ for any fixed k.

Receipts · § IV

Layers per stack: N = 6
Model dimension: d_model = 512
FFN inner dimension: d_ff = 2048
Heads: h = 8
Per-head dim: d_k = d_v = 64
Sub-layers per encoder layer: 2 (attention + FFN)
Sub-layers per decoder layer: 3 (masked self-attn + cross-attn + FFN)
Around every sub-layer: residual connection + LayerNorm

§ V TEST. 07.5 · Why it won

Slide n. Watch the answer flip.

The complexity argument was the most under-appreciated part of the paper at publication. Self-attention costs more per layer in raw FLOPs at long sequence lengths, but it costs much less in sequential operations because every position can be computed at the same time. The maximum path length between any two positions also drops to a constant. The slider below makes that tradeoff legible. Drag n across the typical range of inputs and watch which architecture is cheapest at each length, holding d = 512 constant.

FIG. 07.3 · Per-layer FLOPs by sequence length, d = 512

Self-attention is fewer FLOPs whenever n < d.

Drag the slider. Bars are normalized to the most expensive layer at this n. Numbers are absolute FLOPs per layer.

n = 64 A sentence

Self-attentionO(n² · d)

2.10 MFLOPs

RecurrentO(n · d²)

16.78 MFLOPs

Convolutional, k=3O(k · n · d²)

50.33 MFLOPs

Restricted self-attn, r=256O(r · n · d)

8.39 MFLOPs

Around n ≈ 512 is where self-attention loses its FLOP advantage to recurrence. Below that, self-attention wins on FLOPs and on parallelism. Above it, the paper's appendix already proposed restricted self-attention to keep the work linear at very long sequence lengths. We are still living through what happens above 512.

The final ingredient was the result. The Transformer beat every previously reported model, including ensembles, on WMT 2014 English-to-German by more than two BLEU. It set a new single-model state of the art on English-to-French, training in three and a half days on eight P100 GPUs, a small fraction of the cost of the previous best. The base model trained for twelve hours and already beat every competitor.

Receipts · § V

BLEU on WMT 2014 EN-DE (big): 28.4 (previous SOTA was 26.36)
BLEU on WMT 2014 EN-FR (big): 41.8 (single-model SOTA)
Base model BLEU EN-DE: 27.3 (already beat every competitor)
Training cost (base, FLOPs): 3.3 × 10¹⁸
Training cost (big, FLOPs): 2.3 × 10¹⁹
Hardware: 8 NVIDIA P100 GPUs
Big model training time: 3.5 days
Generalization (WSJ parsing F1): 91.3 (WSJ-only) · 92.7 (semi-supervised)

§ VI Methodology, sources, what this is and is not

Method

Five sections, each pinned to a specific section of the source paper. The two interactive widgets in § II and § V run entirely client-side in vanilla JavaScript. Heatmap weights are hand-tuned to a plausible self-attention pattern; the multi-head strip in § III is illustrative, not the output of an actual trained model.

The diagrams are hand-coded SVG and CSS grids, no charting library. The page is static HTML with vanilla JavaScript only for the dateline, the heatmap interaction, and the complexity slider. No analytics beyond the privacy-respecting Cloudflare Web Analytics that ships at the platform level.

Sources

Primary: Vaswani et al., "Attention Is All You Need," NeurIPS 2017. arXiv:1706.03762. Every numerical claim on this page traces to a section of that paper: complexity table to § 4, BLEU scores to Table 2, hyperparameters to § 5 and Table 3, sinusoidal positional encoding to § 3.5.

The reference TensorFlow implementation, tensor2tensor, is what the original authors used. The PyTorch reference is the Annotated Transformer from Harvard NLP. Both are excellent reading after this page.

Caveats

The heatmap in § II is a plausible self-attention pattern, not the output of a trained model. It is constructed to make syntactic structure visible to a first-time reader, not to reproduce specific behavior of a specific checkpoint.

The eight head patterns in § III are stylized. Real attention heads in trained Transformers are messier and the precise jobs each head learns vary across runs and seeds. The paper's appendix shows real examples; the strip on this page compresses the "different heads, different jobs" intuition into eight cartoons.

The complexity slider treats d as fixed at 512 and ignores constant factors. Modern models use much larger d and much longer n; the crossover point is more nuanced today than the paper's clean n < d result.

What this lab is not

Not a complete reading of the paper. The training regime, regularization, label smoothing, learning-rate warmup, and constituency parsing experiments are summarized in receipts boxes but not visualized. The original is eight pages and is worth reading directly.

Not a working Transformer. Nothing on this page actually translates anything. For a runnable in-browser Transformer, see Hugging Face's transformers.js or the Annotated Transformer. This lab is the diagram, not the engine.

Eight pages. Six authors. One architecture. The 2017 paper is now nine years old and every frontier large language model on Earth is a direct descendant of its decoder stack. The receipts that mattered were not the BLEU scores. They were the complexity table and the parallelism argument. Recurrence had a hidden tax. Attention removed it.

What followed is harder to summarize. Encoder-only Transformers became BERT. Decoder-only Transformers became GPT. The encoder stack faded, the decoder stack ate the world, and the field is still arguing about whether to keep stretching n or replace n² with something genuinely subquadratic. Two of those arguments are running labs on this site already.

All of that started here. Attention is all you need.

FAQ

What is the Transformer?

The Transformer is the neural network architecture introduced in the 2017 paper "Attention Is All You Need" by Vaswani et al. It dispenses with recurrence and convolution entirely, relying only on stacked self-attention and feed-forward layers. It is the architecture every modern large language model is built on, including GPT, Claude, Gemini, and DeepSeek.

What is scaled dot-product attention?

Scaled dot-product attention computes Attention(Q, K, V) = softmax(Q K^T / √d_k) V. The query Q is compared against every key K via a dot product, the result is divided by the square root of the key dimension d_k to keep gradients stable, a softmax converts the scores to weights that sum to one, and the weights are used to take a weighted sum of the value vectors V.

Why multi-head attention?

A single attention layer averages information from all positions, which blurs distinct relationships. Multi-head attention runs h parallel attention layers (h = 8 in the base model), each in its own d_k = d_v = 64 dimensional subspace, and concatenates the results. Different heads learn to attend to different things: subject-verb agreement, anaphora, syntactic role, position. The total computational cost is similar to single-head with full dimensionality.

Why is self-attention faster than RNNs at typical sequence lengths?

Self-attention costs O(n² · d) per layer but only O(1) sequential operations because every position is computed in parallel. Recurrent layers cost O(n · d²) per layer and require O(n) sequential operations. When the sequence length n is smaller than the representation dimension d (as it is for typical sentences encoded with d = 512), self-attention is both fewer FLOPs and dramatically more parallelizable. The maximum path length between any two positions also drops from O(n) to O(1).

What were the original Transformer's results?

On WMT 2014 English-to-German translation, the big Transformer achieved 28.4 BLEU, beating the previous best (including ensembles) by more than two BLEU. On WMT 2014 English-to-French it reached 41.8 BLEU. The base model trained for twelve hours on eight NVIDIA P100 GPUs at a total cost of 3.3 × 10¹⁸ FLOPs, a fraction of competitors. The model also generalized to English constituency parsing without architecture-specific tuning.