FIG. 13 · ATLASNeural Network (MLP)
Layers of
nonlinearity.
Stacked layers of linear transforms with nonlinear activations. Universal function approximators in theory; in practice, hungry for data and tuning. The shape of every modern AI system, scaled down.
A 2-H-1 multilayer perceptron with tanh hidden activation and sigmoid output. Backprop runs live in your browser at 5 batch updates per frame. Press Run, watch the boundary form, watch the loss curve descend. Reset to start training over from a fresh random initialization.
§ IThe boundary, learned
Each Run epoch shrinks loss. Different seeds and hyperparameters land on different boundaries — try Reset a few times.
§ IIHow it works
A multilayer perceptron stacks affine transforms (matrix multiply + bias) with nonlinear activations between them. This demo has one hidden layer with N units (configurable above), tanh activation, then a single sigmoid output unit for binary classification.
Forward pass: input → hidden activations → output probability. Backward pass: chain-rule the gradient of binary cross-entropy backward through every layer. Update each weight by w ← w − η ∇w. Repeat until the loss stops decreasing meaningfully.
Two facts to keep in mind: every modern AI architecture is structurally this same pattern repeated at scale (with attention, convolutions, etc.); and a 2-N-1 MLP is technically a universal approximator on this domain, but "in theory" is a long way from "with this seed, this lr, and this much patience."
The math
Forward pass:
h = tanh(W₁ · x + b₁)ŷ = σ(W₂ · h + b₂)
Loss (binary cross-entropy):
L = −[y log ŷ + (1 − y) log(1 − ŷ)]Backprop, working back from the output:
∂L/∂z₂ = ŷ − y∂L/∂W₂ = (∂L/∂z₂) · h
∂L/∂h = (∂L/∂z₂) · W₂
∂L/∂z₁ = ∂L/∂h · (1 − tanh²(z₁))
∂L/∂W₁ = ∂L/∂z₁ · xᵀ
Apply with stochastic gradient descent. The same chain rule, scaled, runs every neural network in production. Convolutions, attention, layer norm — all pieces that fit into this skeleton.
§ IIIWhere it shines, where it breaks
Big data, complex truth
Image, audio, text — anywhere the relationship between input and label is too convoluted for a tree or linear model and you have hundreds of thousands of examples to train on. Modern deep learning IS this scaled to billions of parameters.
Smooth boundaries
Try the spiral preset above. A decision tree gets there with stair-steps; an MLP carves a smooth, continuous surface that follows the data's shape. For applications where boundary smoothness matters (signal processing, perception), MLPs are unmatched.
Tabular data
Surprising fact: on small-to-medium tabular datasets, MLPs frequently lose to gradient boosting. Features that need explicit interaction terms (mixed types, missing values, monotonicity constraints) play to trees' strengths and against neural nets'.
Bad initializations stick
Press Reset a few times in a row on the spiral preset. About one in three runs gets stuck in a local minimum and never finds the right boundary. For production, use Adam or RMSprop, careful initialization (Xavier / He), and patience.
§ IVTrade-off scorecard
- Inference0.50
- Accuracy0.85
- Training0.30
- Small size0.40
§ VIn production
Every production neural network in the world. Image classifiers, speech recognizers, language models, recommendation rerankers — every modern deep learning system is a multilayer perceptron with structural assumptions added (convolutions for images, attention for text, ResNet-style residual connections everywhere). Scale up the same forward + backward you ran in this demo and you have ChatGPT.