P04 Two losses. One hyperparameter.

Inside
LeWorldModel.

The world model that fits in fifteen million parameters, trains on one GPU, encodes each frame as a single 192-dim vector, and plans forty-eight times faster than its closest foundation-model rival. Built from a fifteen-page paper that does in two loss terms what every prior JEPA needed seven of.

Every prior Joint-Embedding Predictive Architecture has been a hyperparameter swamp. Exponential moving averages, stop-gradients, frozen DINOv2 encoders, seven-term VICReg objectives. LeWM throws all of it out and replaces it with one regularizer that asks a single question of the embedding cloud: does it look like an isotropic Gaussian from a thousand random angles?

Filed

Engine

Hand-coded SVG · canvas · vanilla JS

Source

Maes et al., March 2026 (arXiv preprint)

Why JEPA · SIGReg · Token economy · The architecture · Wins and losses · Methodology

§ I P04.1 · Why JEPA collapses

Predict the future. Or just predict zero.

A Joint-Embedding Predictive Architecture is a deceptively simple idea. Encode the present frame to a vector. Encode the future frame to a vector. Train a small predictor to map the present vector plus the action to the future vector. The model never has to render pixels, never has to reconstruct anything, and never has to pay the bill for modeling visual texture that does not matter for control.

The catch is that nothing stops the encoder from learning to map every frame to a single point. If both sides of the prediction collapse to the same constant vector, the prediction loss is satisfied perfectly. The encoder has learned nothing about the world, and the loss curve looks like a triumph.

Every prior JEPA has paid for collapse-prevention with extra machinery. BYOL-style methods use an exponential moving average target encoder and a stop-gradient. DINO uses centering and sharpening. PLDM bolts on a seven-term VICReg-derived objective with six tunable weights, plus inverse dynamics, plus auxiliary smoothness. DINO-WM ducks the question entirely by not training its encoder. Every fix works, and every fix introduces brittleness: EMAs need warmup tuning, VICReg coefficients drift, and frozen foundation encoders force you to live inside whatever representation DINOv2 happened to learn.

Receipts · the JEPA tax, before LeWM

BYOL / V-JEPA family: EMA target encoder + stop-gradient + asymmetric projector
DINO / DINOv2: EMA + centering + sharpening + multi-crop
PLDM (closest competitor): 7 loss terms, 6 tunable coefficients, inverse dynamics, smoothness
DINO-WM: encoder is frozen DINOv2, not trained at all
LeWM: 2 loss terms, 1 effective hyperparameter (λ = 0.1)

§ II P04.2 · SIGReg, in pictures

One target distribution. A thousand random angles.

L_LeWM = L_pred + λ · SIGReg(Z) Two terms · λ = 0.1 · M = 1024 random projections per step

Collapse is a distributional pathology. A degenerate encoder maps everything to a point, a line, or a low-dimensional subspace. The cleanest fix is to specify a target distribution for the embedding cloud and force the encoder to match it. The natural choice is the isotropic Gaussian: maximally entropic given a fixed variance, rotationally symmetric, and with a closed-form characteristic function.

Directly testing high-dimensional normality is intractable. SIGReg dodges that with a two-step trick. First, the Cramér-Wold theorem (1936): a multivariate distribution equals N(0, I) if and only if every one-dimensional projection of it is N(0, 1). So instead of a D-dimensional normality test, sample M random unit directions and run M one-dimensional tests. Second, the Epps-Pulley test (1983): a 1-D normality statistic based on the empirical characteristic function. Bounded gradient. Bounded Hessian. Linear-time per projection. Friendly to SGD.

Drag the slider below. The left panel shows a 2D embedding cloud. The right panel shows the histogram of one random 1-D projection of that cloud, with the target N(0, 1) curve drawn in dashes. SIGReg does not push individual embeddings apart. It pushes the shape of the cloud toward a Gaussian.

S01.1 · SIGReg, interactive

Slide λ from collapsed to isotropic.

Left: the embedding cloud, two dimensions for visibility. Right: a random 1-D slice through it.

The cloud (2D scatter)

One random 1-D projection

λ = 0.001 Collapsed

No SIGReg pressure. The encoder maps everything to a single point.

The math: why 1-D projections are enough

Cramér-Wold (1936): a probability measure P on ℝ^d is uniquely determined by the family of distributions of its projections {u · X : ||u|| = 1}. So P = N(0, I) iff every 1-D projection of P is N(0, 1).

SIGReg(Z) = (1/M) Σ_m T_EP(Z · u^(m)) with u^(m) ~ Uniform(S^D-1), resampled every step.

With M = 1024 directions, the Monte-Carlo sketch is empirically excellent on these tasks. Per the appendix, the result is essentially insensitive to M; only λ matters. The single effective hyperparameter is why grid-search collapses from polynomial to logarithmic time. PLDM has six coefficients; LeWM has one.

Receipts · SIGReg

Loss terms: 2 (prediction + SIGReg)
Effective hyperparameter: λ = 0.1
Random projections per step: M = 1024 (per appendix, insensitive)
Statistic: Epps-Pulley empirical characteristic function
Theoretical foundation: Cramér-Wold theorem (1936)
Anti-collapse mechanism: distribution-matching, not pairwise repulsion
EMA / stop-gradient: none
Reconstruction head: none

§ III P04.3 · The token economy

One CLS token. Or one fog of patches.

The cost of imagining the next frame depends almost entirely on how many tokens you choose to represent it with. Every world model on this page does the same job at a high level. Each one pays a different bill per timestep. The dot grids below are one-to-one: each dot is one token the model carries forward every time it rolls the future.

LeWM is a single accent dot. DINO-WM is a 14×14 patch grid. V-JEPA 2 is a larger grid. Genie 3 is so dense the dots blur into a fog. Sora is so dense we cannot draw it cell-by-cell on this card. The 48× planning speedup over DINO-WM does not come from a clever attention trick. It comes from this picture.

S01.2 · Token economy · one dot per token, per frame

Pick a model. See its bill.

Tokens compound with every CEM candidate, every refinement step, and every horizon step. Linear factors stack to orders of magnitude.

How the planning loop compounds the per-frame cost

LeWM plans by Cross-Entropy Method in latent space. Each cycle: sample 300 candidate action sequences of horizon H = 5, roll each one autoregressively through the predictor, score by L2 distance to the goal latent, refit a Gaussian to the top 30 elites, repeat for 10 to 30 iterations, execute the first action, replan.

Total token-rollouts per planning cycle = 300 × H × iters. Multiply by tokens-per-frame, and you have the actual compute budget. LeWM at 1 token/frame and DINO-WM at 196 tokens/frame have the same loop structure but a 196× difference per step. That is where the paper's reported ~1s vs ~47s per planning cycle on an L40S comes from.

Receipts · planning loop

CEM candidates: 300
Planning horizon: H = 5
Elite fraction: top 30 (10%)
CEM iterations: 10 to 30 per cycle
LeWM plan time: ~1 s / cycle (L40S)
DINO-WM plan time: ~47 s / cycle (L40S)
Speedup: ~48×
Token ratio (LeWM : DINO-WM): 1 : 196

§ IV P04.4 · The architecture

Two modules. Both trained end-to-end.

The whole system is two transformers and two batch-norm projectors. The encoder is a ViT-Tiny, fifteen million parameters total across encoder and predictor. There is no EMA target, no stop-gradient, no reconstruction head, no reward head, and no proprioceptive input. Gradients flow through everything.

Action conditioning is injected via AdaLN, the same scale-shift trick used in diffusion transformers. The AdaLN parameters are zero-initialized so the predictor begins as identity-on-action, then gradually learns to use the action signal. That single detail is the difference between stable training and divergence.

Encoder ViT-Tiny · ~5M params

Frame o_tRGB

+ patch embed (14×14)

↓ repeat 12× ↓

Multi-head self-attentionh=3

Add & LayerNorm

MLPd=192

Add & LayerNorm

↓ output ↓

CLS token192-d

BN-projectorrestores Gaussian shape

z_tto predictor + SIGReg

Predictor Causal transformer · ~10M params

N past latents [z_t-N+1, ..., z_t]N=3 typical

Action a_tvia AdaLN

↓ repeat 6× ↓

Causal self-attentionh=16

AdaLN (zero-init)

MLP, dropout 10%d=192

AdaLN (zero-init)

↓ output ↓

BN-projector

&zhat;_t+1predicted next latent

L_pred = ||&zhat;_t+1 - z_t+1||²

Why the BN-projector matters

The ViT's final LayerNorm normalizes the CLS token to unit norm per sample, which destroys the distributional shape that SIGReg is trying to enforce across samples. A one-layer MLP projector with batch normalization restores the distributional structure. Without it, SIGReg is pulling against LayerNorm, and training stalls.

The AdaLN zero-initialization is borrowed from DiT (Peebles & Xie). At step zero, action conditioning has no effect; the predictor is identity-on-action. Gradient flow gradually wakes up the action pathway. This is empirically the difference between converging cleanly and diverging in the first few hundred steps.

Receipts · architecture

Total params: ~15M (5M encoder + 10M predictor)
Encoder: ViT-Tiny, 12 layers, 3 heads, d=192, patch 14
Predictor: causal transformer, 6 layers, 16 heads, dropout 10%
Per-frame embedding: 1 CLS token, 192-d
Action injection: AdaLN, zero-initialized
Context window: N = 3 (Push-T, OGBench-Cube), N = 1 (Two-Room)
Trained on: 1 GPU, hours (not days)
Target encoder: same encoder, no EMA, no stop-grad

§ V P04.5 · What it does, what it does not

Wins on planning. Loses where pretraining still helps.

15M Total parameters

192-d CLS token per frame

~1 s Plan / cycle (vs DINO-WM ~47 s)

+18% Push-T success vs PLDM

The strongest result is on Push-T. LeWM matches or beats DINO-WM there even though DINO-WM is given proprioceptive inputs that LeWM does not see, and improves over PLDM (the closest end-to-end JEPA competitor) by eighteen percent. Beyond benchmark numbers, the latent space recovers physical quantities under linear and MLP probes with r ≈ 0.99 on Push-T agent x/y, block x/y, and block angle.

Two emergent behaviors are worth flagging. First, temporal latent path straightening: the cosine similarity between consecutive latent velocity vectors increases over training. Trajectories become geometrically straighter in latent space, a known signature in primate visual cortex. Second, violation-of-expectation: the model's surprise signal spikes for object teleportation but not for color changes. It is a model trained on no rewards and no labels that is more bothered by physical impossibility than by visual perturbation.

Where LeWM wins

Push-T (success rate)matches or beats DINO-WM, +18% over PLDM at the apples-to-apples comparison
Reachercompetitive with foundation-model planners
Token economy200× fewer tokens per frame than DINO-WM, ~28,000× fewer than Sora
Planning speed~1 s/cycle vs DINO-WM's ~47 s, V-JEPA 2-AC's ~16 s, Cosmos's ~4 min
Trainabilityone GPU, hours, single hyperparameter to tune
Physical probesr ≈ 0.99 on agent / block coordinates from a 192-d CLS
Latent straighteningmore straightening than PLDM despite no explicit smoothness term
Surprise tracks physicsspike on teleport, no spike on color change

Where LeWM loses

Two-Room navigation2D task with intrinsic dim ~3 is over-regularized by a 192-d Gaussian prior
OGBench-Cubevisually rich 3D scenes still favor DINO-WM's ImageNet-scale priors
Long-horizon reasoningautoregressive latent rollouts compound error
Anything generativedoes not render pixels; the post-hoc decoder is for visualization only
Anything languageno tokens, no vocabulary, not the problem class
Passive videorequires action labels; cannot learn from raw observation
Uncertaintydeterministic dynamics, Euclidean cost in latent space

Model	Predicts	Anti-collapse	Plan / step	Good at
Dreamer V3	pixels + reward	reconstruction + KL	imagine + actor-critic	RL on games
DINO-WM	future DINOv2 patches	frozen pretrained encoder	~47 s	visually rich planning
PLDM	future embeddings	7-term VICReg	fast, unstable	simple low-D environments
V-JEPA 2-AC	masked latent patches	EMA + stop-grad	~16 s	foundation video, post-trained
Genie 3	next pixel frame	generative loss	24 fps real-time	interactive worlds
Sora	spacetime patches	diffusion	minutes / minute	photoreal video
LeWM	next 192-d CLS	SIGReg (Gaussian match)	~1 s	cheap pixel-to-action planning

The two-room failure, in one sentence

Two-Room is a 2D navigation task whose true data manifold is roughly three-dimensional. Forcing a 192-d isotropic Gaussian prior on a manifold that small distorts the representation. The paper notes this explicitly: SIGReg is a strong prior, and on data that does not deserve a strong prior it can over-regularize. PLDM and DINO-WM, with weaker priors, win on Two-Room.

§ VI Methodology, sources, what this is and is not

Method

Five sections, each pinned to a specific section of the source preprint. The two interactive widgets in § II and § III run entirely client-side in vanilla JavaScript. The SIGReg scatter and projection histogram are sampled live from a controllable distribution. The token-economy isotype is a one-dot-per-token pictogram, capped at six hundred dots before switching to a density overlay.

The diagrams are hand-coded HTML/CSS grids and a small canvas widget. No charting library, no React, no build step. Page weight stays under one megabyte; only Cloudflare Web Analytics runs at the platform level. No tracking.

Sources

Primary: Maes, Le Lidec, Scieur, LeCun, Balestriero. "LeWorldModel", March 2026 preprint (arXiv pending). Numerical claims on this page (parameter counts, planning latencies, success deltas, hyperparameters) trace to that paper or its appendix.

Prerequisite: Balestriero & LeCun, "LeJEPA" (arXiv:2511.08544, 2025), where SIGReg is introduced and proved to prevent collapse under stated smoothness assumptions.

Comparators: DINO-WM (Zhou et al., ICML 2025), PLDM (Sobal et al., 2025), V-JEPA 2 (Meta), Titans (Behrouz et al., 2024), Mamba (Gu & Dao). Numbers cited from each paper's primary results table.

Caveats

This is a preprint, version 1 (March 2026), with v2 revised the same month. It has not yet completed peer review. "Provable anti-collapse" is provable for SIGReg in LeJEPA under smoothness assumptions; the world-model setting adds dynamics, and the paper does not extend the proof. Read as "principled and empirically robust" rather than "formally guaranteed for the dynamics-conditioned case."

The 200× token-ratio and 48× planning speedup numbers are reported on a single hardware setup (NVIDIA L40S) against a particular DINO-WM configuration. Directionally robust, not tight fundamental ratios.

The benchmark suite (Two-Room, Reacher, Push-T, OGBench-Cube) is small. Generalization to large-scale 3D robotics, real-robot data, or natural video has not been demonstrated.

What this lab is not

Not a tutorial on JEPAs in general. The page focuses on the single architectural shift LeWM proposes: replacing brittle anti-collapse machinery with a principled distribution-matching regularizer. The broader JEPA family (BYOL, DINO, V-JEPA, I-JEPA, JEPA-2) is referenced for context, not surveyed.

Not a runnable model. Nothing on this page actually plans anything. The interactives illustrate the loss-shape and the token economy; they are not LeWM's encoder or predictor in the browser. For runnable code, see the authors' implementation when released.

For nine years, every JEPA recipe has paid an anti-collapse tax. Exponential moving averages, stop-gradients, frozen foundation encoders, multi-term VICReg objectives. Each fix worked. Each fix added a knob. The cumulative knob count on a working JEPA was large enough that a new researcher had to first learn the swamp, then learn the model.

LeWM is the first time someone has shown the swamp was optional. One regularizer that asks one question of the embedding cloud, from a thousand random angles. Does it look Gaussian? That question is enough.

What is striking about this paper is not that it proposes a new architecture. It does not. It proposes a new discipline around the loss. Specify what the embedding distribution should be, then enforce that. Stop bolting on anti-collapse mechanisms one at a time and hoping. The downstream effects (smaller models, faster planning, single-knob training, surprise that tracks physics) all flow from that one move.

FAQ

What is LeWorldModel?

LeWorldModel (LeWM) is a Joint-Embedding Predictive Architecture for action-conditioned world modeling, introduced in March 2026 by Maes, Le Lidec, Scieur, LeCun, and Balestriero (Mila, NYU, Samsung SAIL, Brown). It is the first JEPA that trains stably end-to-end from raw pixels using only two loss terms and a single effective hyperparameter, replacing the brittle anti-collapse machinery (EMAs, stop-gradients, frozen pretrained encoders, multi-term VICReg objectives) used by prior JEPAs.

What is SIGReg?

SIGReg, or Sketched Isotropic Gaussian Regularization, is the regularizer that prevents representation collapse in LeWM. It pushes the encoder's embedding distribution toward an isotropic Gaussian using random one-dimensional projections (Cramér-Wold theorem) and the Epps-Pulley normality test. With M random projections per step (M = 1024 in the paper) the test runs in linear time, has bounded gradients, and parallelizes cleanly. It is a distribution-matching anti-collapse term, not a pairwise repulsion.

How does LeWM compare to DINO-WM and PLDM?

LeWM uses one CLS token per frame (192 dimensions); DINO-WM uses about 196 patch tokens per frame from a frozen DINOv2 encoder. LeWM plans about forty-eight times faster (one second per planning cycle versus forty-seven seconds for DINO-WM on an L40S GPU) and is end-to-end trainable, while DINO-WM depends on a frozen pretrained encoder. Versus PLDM, the only previous end-to-end JEPA world model, LeWM collapses six tunable loss coefficients into one and achieves higher Push-T success rates with cleaner training dynamics.

Where does LeWM lose?

Three places. Two-Room navigation, where the true data manifold is roughly three-dimensional and the 192-dim isotropic Gaussian prior over-regularizes. OGBench-Cube, the visually rich 3D manipulation task where DINOv2's ImageNet-scale priors carry. And anything requiring long-horizon reasoning, where autoregressive latent rollouts compound prediction error. The paper's authors flag these explicitly.

Is LeWM a video generator?

No. LeWM is a planner, not a generator. It does not render pixels. The post-hoc decoder used in some figures is a separate, training-free transformer added after the model is trained, purely for visualization. The model itself only predicts the next 192-dimensional CLS token. Different paradigms imagine in different substrates: Sora dreams in megapixels, Genie 3 dreams in interactive scenes, LeWM imagines an arrow through a 192-dimensional cloud.