Vol. XII · No. 05 · May 2026
Jake Cuth.
← from the Model Atlas

Layers of
nonlinearity.

Stacked layers of linear transforms with nonlinear activations. Universal function approximators in theory; in practice, hungry for data and tuning. The shape of every modern AI system, scaled down.

A 2-H-1 multilayer perceptron with tanh hidden activation and sigmoid output. Backprop runs live in your browser at 5 batch updates per frame. Press Run, watch the boundary form, watch the loss curve descend. Reset to start training over from a fresh random initialization.


Each Run epoch shrinks loss. Different seeds and hyperparameters land on different boundaries — try Reset a few times.

probability surface, smooth nonlinear

A multilayer perceptron stacks affine transforms (matrix multiply + bias) with nonlinear activations between them. This demo has one hidden layer with N units (configurable above), tanh activation, then a single sigmoid output unit for binary classification.

Forward pass: input → hidden activations → output probability. Backward pass: chain-rule the gradient of binary cross-entropy backward through every layer. Update each weight by w ← w − η ∇w. Repeat until the loss stops decreasing meaningfully.

Two facts to keep in mind: every modern AI architecture is structurally this same pattern repeated at scale (with attention, convolutions, etc.); and a 2-N-1 MLP is technically a universal approximator on this domain, but "in theory" is a long way from "with this seed, this lr, and this much patience."

The math

Forward pass:

h = tanh(W₁ · x + b₁)
ŷ = σ(W₂ · h + b₂)

Loss (binary cross-entropy):

L = −[y log ŷ + (1 − y) log(1 − ŷ)]

Backprop, working back from the output:

∂L/∂z₂ = ŷ − y
∂L/∂W₂ = (∂L/∂z₂) · h
∂L/∂h = (∂L/∂z₂) · W₂
∂L/∂z₁ = ∂L/∂h · (1 − tanh²(z₁))
∂L/∂W₁ = ∂L/∂z₁ · xᵀ

Apply with stochastic gradient descent. The same chain rule, scaled, runs every neural network in production. Convolutions, attention, layer norm — all pieces that fit into this skeleton.


Shines

Big data, complex truth

Image, audio, text — anywhere the relationship between input and label is too convoluted for a tree or linear model and you have hundreds of thousands of examples to train on. Modern deep learning IS this scaled to billions of parameters.

Shines

Smooth boundaries

Try the spiral preset above. A decision tree gets there with stair-steps; an MLP carves a smooth, continuous surface that follows the data's shape. For applications where boundary smoothness matters (signal processing, perception), MLPs are unmatched.

Breaks

Tabular data

Surprising fact: on small-to-medium tabular datasets, MLPs frequently lose to gradient boosting. Features that need explicit interaction terms (mixed types, missing values, monotonicity constraints) play to trees' strengths and against neural nets'.

Breaks

Bad initializations stick

Press Reset a few times in a row on the spiral preset. About one in three runs gets stuck in a local minimum and never finds the right boundary. For production, use Adam or RMSprop, careful initialization (Xavier / He), and patience.


INFERENCE ACCURACY TRAINING SIZE
  • Inference0.50
  • Accuracy0.85
  • Training0.30
  • Small size0.40

Every production neural network in the world. Image classifiers, speech recognizers, language models, recommendation rerankers — every modern deep learning system is a multilayer perceptron with structural assumptions added (convolutions for images, attention for text, ResNet-style residual connections everywhere). Scale up the same forward + backward you ran in this demo and you have ChatGPT.


Try the wizard again →