P04 Two losses. One hyperparameter.
Inside
LeWorldModel.
The world model that fits in fifteen million parameters, trains on one GPU, encodes each frame as a single 192-dim vector, and plans forty-eight times faster than its closest foundation-model rival. Built from a fifteen-page paper that does in two loss terms what every prior JEPA needed seven of.
Every prior Joint-Embedding Predictive Architecture has been a hyperparameter swamp. Exponential moving averages, stop-gradients, frozen DINOv2 encoders, seven-term VICReg objectives. LeWM throws all of it out and replaces it with one regularizer that asks a single question of the embedding cloud: does it look like an isotropic Gaussian from a thousand random angles?
§ I P04.1 · Why JEPA collapses
Predict the future. Or just predict zero.
A Joint-Embedding Predictive Architecture is a deceptively simple idea. Encode the present frame to a vector. Encode the future frame to a vector. Train a small predictor to map the present vector plus the action to the future vector. The model never has to render pixels, never has to reconstruct anything, and never has to pay the bill for modeling visual texture that does not matter for control.
The catch is that nothing stops the encoder from learning to map every frame to a single point. If both sides of the prediction collapse to the same constant vector, the prediction loss is satisfied perfectly. The encoder has learned nothing about the world, and the loss curve looks like a triumph.
Every prior JEPA has paid for collapse-prevention with extra machinery. BYOL-style methods use an exponential moving average target encoder and a stop-gradient. DINO uses centering and sharpening. PLDM bolts on a seven-term VICReg-derived objective with six tunable weights, plus inverse dynamics, plus auxiliary smoothness. DINO-WM ducks the question entirely by not training its encoder. Every fix works, and every fix introduces brittleness: EMAs need warmup tuning, VICReg coefficients drift, and frozen foundation encoders force you to live inside whatever representation DINOv2 happened to learn.
- BYOL / V-JEPA family
- EMA target encoder + stop-gradient + asymmetric projector
- DINO / DINOv2
- EMA + centering + sharpening + multi-crop
- PLDM (closest competitor)
- 7 loss terms, 6 tunable coefficients, inverse dynamics, smoothness
- DINO-WM
- encoder is frozen DINOv2, not trained at all
- LeWM
- 2 loss terms, 1 effective hyperparameter (λ = 0.1)
§ II P04.2 · SIGReg, in pictures
One target distribution. A thousand random angles.
Collapse is a distributional pathology. A degenerate encoder maps everything to a point, a line, or a low-dimensional subspace. The cleanest fix is to specify a target distribution for the embedding cloud and force the encoder to match it. The natural choice is the isotropic Gaussian: maximally entropic given a fixed variance, rotationally symmetric, and with a closed-form characteristic function.
Directly testing high-dimensional normality is intractable. SIGReg dodges that with a two-step trick. First, the Cramér-Wold theorem (1936): a multivariate distribution equals N(0, I) if and only if every one-dimensional projection of it is N(0, 1). So instead of a D-dimensional normality test, sample M random unit directions and run M one-dimensional tests. Second, the Epps-Pulley test (1983): a 1-D normality statistic based on the empirical characteristic function. Bounded gradient. Bounded Hessian. Linear-time per projection. Friendly to SGD.
Drag the slider below. The left panel shows a 2D embedding cloud. The right panel shows the histogram of one random 1-D projection of that cloud, with the target N(0, 1) curve drawn in dashes. SIGReg does not push individual embeddings apart. It pushes the shape of the cloud toward a Gaussian.
The cloud (2D scatter)
One random 1-D projection
The math: why 1-D projections are enough
Cramér-Wold (1936): a probability measure P on
ℝd is uniquely determined by the family of
distributions of its projections {u · X : ||u|| = 1}. So
P = N(0, I) iff every 1-D projection of P is N(0, 1).
SIGReg(Z) = (1/M) Σm TEP(Z · u(m))
with u(m) ~ Uniform(SD-1), resampled every step.
With M = 1024 directions, the Monte-Carlo sketch is empirically excellent on these tasks. Per the appendix, the result is essentially insensitive to M; only λ matters. The single effective hyperparameter is why grid-search collapses from polynomial to logarithmic time. PLDM has six coefficients; LeWM has one.
- Loss terms
- 2 (prediction + SIGReg)
- Effective hyperparameter
- λ = 0.1
- Random projections per step
- M = 1024 (per appendix, insensitive)
- Statistic
- Epps-Pulley empirical characteristic function
- Theoretical foundation
- Cramér-Wold theorem (1936)
- Anti-collapse mechanism
- distribution-matching, not pairwise repulsion
- EMA / stop-gradient
- none
- Reconstruction head
- none
§ III P04.3 · The token economy
One CLS token. Or one fog of patches.
The cost of imagining the next frame depends almost entirely on how many tokens you choose to represent it with. Every world model on this page does the same job at a high level. Each one pays a different bill per timestep. The dot grids below are one-to-one: each dot is one token the model carries forward every time it rolls the future.
LeWM is a single accent dot. DINO-WM is a 14×14 patch grid. V-JEPA 2 is a larger grid. Genie 3 is so dense the dots blur into a fog. Sora is so dense we cannot draw it cell-by-cell on this card. The 48× planning speedup over DINO-WM does not come from a clever attention trick. It comes from this picture.
How the planning loop compounds the per-frame cost
LeWM plans by Cross-Entropy Method in latent space. Each cycle:
sample 300 candidate action sequences of horizon
H = 5, roll each one autoregressively through the
predictor, score by L2 distance to the goal latent, refit a
Gaussian to the top 30 elites, repeat for
10 to 30 iterations, execute the first action,
replan.
Total token-rollouts per planning cycle = 300 × H × iters.
Multiply by tokens-per-frame, and you have the actual compute
budget. LeWM at 1 token/frame and DINO-WM at 196 tokens/frame
have the same loop structure but a 196× difference per
step. That is where the paper's reported ~1s vs
~47s per planning cycle on an L40S comes from.
- CEM candidates
- 300
- Planning horizon
- H = 5
- Elite fraction
- top 30 (10%)
- CEM iterations
- 10 to 30 per cycle
- LeWM plan time
- ~1 s / cycle (L40S)
- DINO-WM plan time
- ~47 s / cycle (L40S)
- Speedup
- ~48×
- Token ratio (LeWM : DINO-WM)
- 1 : 196
§ IV P04.4 · The architecture
Two modules. Both trained end-to-end.
The whole system is two transformers and two batch-norm projectors. The encoder is a ViT-Tiny, fifteen million parameters total across encoder and predictor. There is no EMA target, no stop-gradient, no reconstruction head, no reward head, and no proprioceptive input. Gradients flow through everything.
Action conditioning is injected via AdaLN, the same scale-shift trick used in diffusion transformers. The AdaLN parameters are zero-initialized so the predictor begins as identity-on-action, then gradually learns to use the action signal. That single detail is the difference between stable training and divergence.
Why the BN-projector matters
The ViT's final LayerNorm normalizes the CLS token to unit norm per sample, which destroys the distributional shape that SIGReg is trying to enforce across samples. A one-layer MLP projector with batch normalization restores the distributional structure. Without it, SIGReg is pulling against LayerNorm, and training stalls.
The AdaLN zero-initialization is borrowed from DiT (Peebles & Xie). At step zero, action conditioning has no effect; the predictor is identity-on-action. Gradient flow gradually wakes up the action pathway. This is empirically the difference between converging cleanly and diverging in the first few hundred steps.
- Total params
- ~15M (5M encoder + 10M predictor)
- Encoder
- ViT-Tiny, 12 layers, 3 heads, d=192, patch 14
- Predictor
- causal transformer, 6 layers, 16 heads, dropout 10%
- Per-frame embedding
- 1 CLS token, 192-d
- Action injection
- AdaLN, zero-initialized
- Context window
- N = 3 (Push-T, OGBench-Cube), N = 1 (Two-Room)
- Trained on
- 1 GPU, hours (not days)
- Target encoder
- same encoder, no EMA, no stop-grad
§ V P04.5 · What it does, what it does not
Wins on planning. Loses where pretraining still helps.
The strongest result is on Push-T. LeWM matches or beats
DINO-WM there even though DINO-WM is given proprioceptive inputs
that LeWM does not see, and improves over PLDM (the closest
end-to-end JEPA competitor) by eighteen percent. Beyond
benchmark numbers, the latent space recovers physical quantities
under linear and MLP probes with r ≈ 0.99 on
Push-T agent x/y, block x/y, and block angle.
Two emergent behaviors are worth flagging. First, temporal latent path straightening: the cosine similarity between consecutive latent velocity vectors increases over training. Trajectories become geometrically straighter in latent space, a known signature in primate visual cortex. Second, violation-of-expectation: the model's surprise signal spikes for object teleportation but not for color changes. It is a model trained on no rewards and no labels that is more bothered by physical impossibility than by visual perturbation.
Where LeWM wins
- Push-T (success rate)matches or beats DINO-WM, +18% over PLDM at the apples-to-apples comparison
- Reachercompetitive with foundation-model planners
- Token economy200× fewer tokens per frame than DINO-WM, ~28,000× fewer than Sora
- Planning speed~1 s/cycle vs DINO-WM's ~47 s, V-JEPA 2-AC's ~16 s, Cosmos's ~4 min
- Trainabilityone GPU, hours, single hyperparameter to tune
- Physical probesr ≈ 0.99 on agent / block coordinates from a 192-d CLS
- Latent straighteningmore straightening than PLDM despite no explicit smoothness term
- Surprise tracks physicsspike on teleport, no spike on color change
Where LeWM loses
- Two-Room navigation2D task with intrinsic dim ~3 is over-regularized by a 192-d Gaussian prior
- OGBench-Cubevisually rich 3D scenes still favor DINO-WM's ImageNet-scale priors
- Long-horizon reasoningautoregressive latent rollouts compound error
- Anything generativedoes not render pixels; the post-hoc decoder is for visualization only
- Anything languageno tokens, no vocabulary, not the problem class
- Passive videorequires action labels; cannot learn from raw observation
- Uncertaintydeterministic dynamics, Euclidean cost in latent space
| Model | Predicts | Anti-collapse | Plan / step | Good at |
|---|---|---|---|---|
| Dreamer V3 | pixels + reward | reconstruction + KL | imagine + actor-critic | RL on games |
| DINO-WM | future DINOv2 patches | frozen pretrained encoder | ~47 s | visually rich planning |
| PLDM | future embeddings | 7-term VICReg | fast, unstable | simple low-D environments |
| V-JEPA 2-AC | masked latent patches | EMA + stop-grad | ~16 s | foundation video, post-trained |
| Genie 3 | next pixel frame | generative loss | 24 fps real-time | interactive worlds |
| Sora | spacetime patches | diffusion | minutes / minute | photoreal video |
| LeWM | next 192-d CLS | SIGReg (Gaussian match) | ~1 s | cheap pixel-to-action planning |
The two-room failure, in one sentence
Two-Room is a 2D navigation task whose true data manifold is roughly three-dimensional. Forcing a 192-d isotropic Gaussian prior on a manifold that small distorts the representation. The paper notes this explicitly: SIGReg is a strong prior, and on data that does not deserve a strong prior it can over-regularize. PLDM and DINO-WM, with weaker priors, win on Two-Room.
§ VI Methodology, sources, what this is and is not
Method
Five sections, each pinned to a specific section of the source preprint. The two interactive widgets in § II and § III run entirely client-side in vanilla JavaScript. The SIGReg scatter and projection histogram are sampled live from a controllable distribution. The token-economy isotype is a one-dot-per-token pictogram, capped at six hundred dots before switching to a density overlay.
The diagrams are hand-coded HTML/CSS grids and a small canvas widget. No charting library, no React, no build step. Page weight stays under one megabyte; only Cloudflare Web Analytics runs at the platform level. No tracking.
Sources
Primary: Maes, Le Lidec, Scieur, LeCun, Balestriero. "LeWorldModel", March 2026 preprint (arXiv pending). Numerical claims on this page (parameter counts, planning latencies, success deltas, hyperparameters) trace to that paper or its appendix.
Prerequisite: Balestriero & LeCun, "LeJEPA" (arXiv:2511.08544, 2025), where SIGReg is introduced and proved to prevent collapse under stated smoothness assumptions.
Comparators: DINO-WM (Zhou et al., ICML 2025), PLDM (Sobal et al., 2025), V-JEPA 2 (Meta), Titans (Behrouz et al., 2024), Mamba (Gu & Dao). Numbers cited from each paper's primary results table.
Caveats
This is a preprint, version 1 (March 2026), with v2 revised the same month. It has not yet completed peer review. "Provable anti-collapse" is provable for SIGReg in LeJEPA under smoothness assumptions; the world-model setting adds dynamics, and the paper does not extend the proof. Read as "principled and empirically robust" rather than "formally guaranteed for the dynamics-conditioned case."
The 200× token-ratio and 48× planning speedup numbers are reported on a single hardware setup (NVIDIA L40S) against a particular DINO-WM configuration. Directionally robust, not tight fundamental ratios.
The benchmark suite (Two-Room, Reacher, Push-T, OGBench-Cube) is small. Generalization to large-scale 3D robotics, real-robot data, or natural video has not been demonstrated.
What this lab is not
Not a tutorial on JEPAs in general. The page focuses on the single architectural shift LeWM proposes: replacing brittle anti-collapse machinery with a principled distribution-matching regularizer. The broader JEPA family (BYOL, DINO, V-JEPA, I-JEPA, JEPA-2) is referenced for context, not surveyed.
Not a runnable model. Nothing on this page actually plans anything. The interactives illustrate the loss-shape and the token economy; they are not LeWM's encoder or predictor in the browser. For runnable code, see the authors' implementation when released.
For nine years, every JEPA recipe has paid an anti-collapse tax. Exponential moving averages, stop-gradients, frozen foundation encoders, multi-term VICReg objectives. Each fix worked. Each fix added a knob. The cumulative knob count on a working JEPA was large enough that a new researcher had to first learn the swamp, then learn the model.
LeWM is the first time someone has shown the swamp was optional. One regularizer that asks one question of the embedding cloud, from a thousand random angles. Does it look Gaussian? That question is enough.
What is striking about this paper is not that it proposes a new architecture. It does not. It proposes a new discipline around the loss. Specify what the embedding distribution should be, then enforce that. Stop bolting on anti-collapse mechanisms one at a time and hoping. The downstream effects (smaller models, faster planning, single-knob training, surprise that tracks physics) all flow from that one move.
FAQ
What is LeWorldModel?
LeWorldModel (LeWM) is a Joint-Embedding Predictive Architecture for action-conditioned world modeling, introduced in March 2026 by Maes, Le Lidec, Scieur, LeCun, and Balestriero (Mila, NYU, Samsung SAIL, Brown). It is the first JEPA that trains stably end-to-end from raw pixels using only two loss terms and a single effective hyperparameter, replacing the brittle anti-collapse machinery (EMAs, stop-gradients, frozen pretrained encoders, multi-term VICReg objectives) used by prior JEPAs.
What is SIGReg?
SIGReg, or Sketched Isotropic Gaussian Regularization, is the regularizer that prevents representation collapse in LeWM. It pushes the encoder's embedding distribution toward an isotropic Gaussian using random one-dimensional projections (Cramér-Wold theorem) and the Epps-Pulley normality test. With M random projections per step (M = 1024 in the paper) the test runs in linear time, has bounded gradients, and parallelizes cleanly. It is a distribution-matching anti-collapse term, not a pairwise repulsion.
How does LeWM compare to DINO-WM and PLDM?
LeWM uses one CLS token per frame (192 dimensions); DINO-WM uses about 196 patch tokens per frame from a frozen DINOv2 encoder. LeWM plans about forty-eight times faster (one second per planning cycle versus forty-seven seconds for DINO-WM on an L40S GPU) and is end-to-end trainable, while DINO-WM depends on a frozen pretrained encoder. Versus PLDM, the only previous end-to-end JEPA world model, LeWM collapses six tunable loss coefficients into one and achieves higher Push-T success rates with cleaner training dynamics.
Where does LeWM lose?
Three places. Two-Room navigation, where the true data manifold is roughly three-dimensional and the 192-dim isotropic Gaussian prior over-regularizes. OGBench-Cube, the visually rich 3D manipulation task where DINOv2's ImageNet-scale priors carry. And anything requiring long-horizon reasoning, where autoregressive latent rollouts compound prediction error. The paper's authors flag these explicitly.
Is LeWM a video generator?
No. LeWM is a planner, not a generator. It does not render pixels. The post-hoc decoder used in some figures is a separate, training-free transformer added after the model is trained, purely for visualization. The model itself only predicts the next 192-dimensional CLS token. Different paradigms imagine in different substrates: Sora dreams in megapixels, Genie 3 dreams in interactive scenes, LeWM imagines an arrow through a 192-dimensional cloud.