JEPA fails at 4Hz — adjacent frames are too similar to learn from. NeMo-WM replaces the gradient entirely with eight biological reward signals. AUROC 0.9999 · 26K params · 0.34ms · ~$2,000 edge hardware · 8W inference. 1411× faster than V-JEPA 2-L on identical hardware.
The neuromodulator is not a replacement for JEPA — it is an interpretability and grounding layer. Add it to DINO-WM, JEPA-WM, or V-JEPA 2-AC without changing your architecture. Eight scalar computations per batch. Zero new hyperparameters.
L_jepa is a single opaque number. DA converts it into a named verdict: trivially clamped vs genuinely surprising. Detectable in 200 training steps — before you waste days training.
Cortisol tracks rolling loss above baseline. Sprint 8d ablation: removing cortisol delays REOBSERVE onset 50× (step 28,500 vs 500) and slows per-epoch compression 3.5×. Empirically validated at r=0.768 lag-1.
JEPA learns visual dynamics but ignores GPS. NE adds a GPS prediction loss gated on spatial error — no encoder changes, no extra architecture. GPS displacement encodes at p=5.9e-05.
5HT penalises particle distribution collapse by measuring embedding diversity directly. Detects and corrects collapse before L_jepa reflects it. Replaces VICReg or SIGReg with a single biological signal.
Fixed cosine schedules decay to zero regardless of what the model still needs to learn. DA peaked at step 1,081,000 — after 28 complete data passes — then sustained at 0.002 through the final six steps.
CLIP distilled into a 164K-parameter dual-head on the frozen encoder. Text-conditioned navigation at 4Hz. Works on any JEPA encoder — the backbone never changes.
NeMo-WM was designed for and runs natively on the CORTEX Perception Engine (multi-domain anomaly detection across six sensor domains) and the CORTEX World Model (neuromodulated navigation and planning). Fully compatible with any JEPA-based world model — and not limited to JEPA.
We tested NeMo-WM's 26,561-parameter proprioceptive encoder against Meta's V-JEPA 2 ViT-G (1034M parameters, internet-scale video pre-training) on the same RECON navigation benchmark. Visual scaling does not solve temporal self-localisation. Physics-grounded path integration does.
Evaluated on RECON outdoor robot navigation (Berkeley campus, Jackal robot, 4Hz, 545,866 samples) and five additional anomaly detection domains. All results on GMKtec EVO-X2 · AMD Ryzen AI MAX+ 395 · ~$2,000 · No GPU. Sprint 8d cortisol ablation confirmed: cortisol accelerates REOBSERVE onset 50× (step 500 vs 28,500) and provides 3.5× per-epoch compression advantage.
Every other world model uses a single prediction error to drive learning. NeMo-WM uses eight biologically-inspired neuromodulatory signals, each gating a different aspect of the loss. The JEPA prediction gradient contributed zero across 30 epochs — 1.12 million steps. Cortisol, the eighth signal, detects distribution shift one epoch ahead (r=0.768 lag-1, p<0.0001). Sprint 8d ablation confirmed: cortisol accelerates REOBSERVE onset 50× and provides 3.5× faster per-epoch compression. DA peaked at 0.003 at step 1,081,000 — training closed at peak arousal, never saturated.
In NeMo-WM, L_jepa is clamped at a free-bits floor of 0.5 for all 30 training epochs — contributing zero gradient across 1.12 million steps. Yet the predictor achieves 0.003 MSE at 2-second prediction horizons. The eight neuromodulatory signals drove that learning through GPS, contact, and Gaussian supervision — no explicit prediction objective required.
The non-saturation property: dopamine peaked at DA=0.003 at step 1,081,000 — after the system had seen the full dataset 28 times. Training closed at peak arousal. A fixed schedule would have decayed to zero. The biological reward responded to actual surprise regardless of training duration.
This is the central finding: biological reward signals are sufficient to teach temporal world dynamics. JEPA becomes the evaluation framework, not the mechanism.
Using the AIM quantization framework (Liu, 2026), we converted NeMo-WM's K=16 particle embeddings to discrete symbol sequences and measured encoding of physical quantities via chi-squared tests. N=1,752 samples, 150 trajectories, 16 particles, 16 clusters. Ep28 canonical — eight physical signals confirmed.
Eight physical signals confirmed simultaneously. Ground-truth odometry (jackal wheel encoders) is encoded independently of commanded velocity — the system encodes both what was asked and what actually happened. Temporal gap k is null at ep12 (p=0.345) and weakly encoded at ep28 (p=0.031) — a training-dependent dissociation revealing that representational equilibria shift with training duration. The null control (p=0.427) confirms calibration.
Unlike scalar loss objectives, NeMo-WM's seven neuromodulatory signals provide a continuous, human-readable training and inference narrative. No black box. No post-hoc explanations. The signals are the explanation.
DA=0.001 means mild surprise. 5HT=0.112 means representation health is good. ACh=0.445 means contact events are active. Every training step produces a readable narrative — not just a loss number.
The AIM probe independently confirms what the signals claim. High-DA batches correspond to measurable entropy increases in the quantized particle symbol distribution. The interpretability is verifiable, not assumed.
FDA, DoD, and industrial safety regulators require explainability. A cardiac anomaly detector that can explain which signal flagged an event, and why, is deployable where black-box models are not.
One model. ~$2,000 edge hardware — or Raspberry Pi for inference. No internet required. Sovereign AI that runs where the data is — not where the servers are.
NeMo-WM's proprioceptive encoder achieves AUROC 0.9999 using only velocity, angular rate, heading, and contact — the same signals a mammal uses when its visual cortex is lesioned. No camera. No GPS. No radio. No light required.
Rodents with visual cortex lesions still navigate familiar mazes. Head direction cells (heading), velocity afferents (wheel encoder), and proprioception (contact) maintain a spatial map entirely without visual input. McNaughton et al. 2006; Moser et al. 2008.
Heading signal dominates velocity at every timescale tested. HD:vel ratio ranges from ∞:1 (fine scale, k_pos=1) to 9:1 (k_pos=4). Removing heading collapses AUROC by up to −0.228. Removing velocity alone drops AUROC by at most −0.010.
Every major world model in the literature was trained on GPU clusters drawing thousands of watts. NeMo-WM was trained on 45W — the power draw of a laptop — and infers at 8W on the AMD NPU. The entire training history consumed less electricity than a single GPU uses in a few hours.
At 8W, NeMo-WM can run on a battery pack. That means drones, field robots, wearables, and remote sensors — anywhere a GPU is not just impractical but physically impossible. A world model that needs 400W can never go in a drone. One that needs 8W can.
At 45W training power, NeMo-WM can be trained anywhere with a standard wall outlet. No data centre access required. No institutional infrastructure. A researcher with an $800 machine and an idea can reproduce these results tonight.
The entire training run — Sprints 1 through 3, hundreds of hours of continuous computation — consumed under 50 kWh. That is several orders of magnitude below comparable GPU-based world model training, and roughly equivalent to driving a car 15 miles.
8W NPU inference runs on portable power. Persistent world modelling without mains power or connectivity.
Under 50 kWh for the full training history. The environmental cost of training NeMo-WM is a rounding error compared to GPU clusters.
45W runs on a UPS battery backup. Researchers in locations with unreliable power infrastructure can train and deploy NeMo-WM.
$0.58/month for continuous inference. Industrial monitoring, cardiac surveillance, persistent navigation — previously cost-prohibitive use cases become trivial.
I started coding in the 1970s on a TRS-80, sitting next to my dad. Then an Amiga. Then everything that came after. Fifty years of watching compute get faster, cheaper, and more capable — and learning that the interesting problems don't get easier with more hardware. They get different.
Formally I studied Toy Design — which sounds like a detour but isn't. Toy design is systems thinking under tight constraints: how does something work, who uses it, what happens when it breaks, how do you make it do more with less. That framing has followed me through software, product, SEO, clothing, and TV & film production.
Now I'm building NeMo-WM — a neuromodulated world model for edge AI. Seven biologically-inspired reward signals. 1.78M parameters. No GPU. It learns temporal dynamics without the standard JEPA gradient ever firing. The whole project is a toy design problem: maximum capability, minimum resources, runs anywhere.
At epoch 12, k=1 peaks at 0.9837. By epoch 28, k=2 overtakes it at 0.9578 — the system learned 0.5-second horizons are more discriminable than 0.25-second ones. Mechanistically linked to temporal gap k becoming encoded in the particles at epoch 28.
Temporal gap k is null at epoch 12 (p=0.345), encoded at epoch 28 (p=0.031). The expanded probe reveals eight simultaneous physical signals — including ground-truth odometry (p=1.3×10⁻¹⁸) distinct from commanded velocity. The system encodes both what was commanded and what actually happened.
After 28 complete passes through training data, dopamine reached its run peak of DA=0.003. The final six steps sustained DA=0.002. Training ended at peak arousal — never saturated. A fixed schedule would have decayed to near-zero by this point.
Slow-timescale signal tracking rolling loss excess above baseline. Empirically validated: r=0.768 lag-1 prediction of future loss (p<0.0001). Detected the Sprint 3 distribution shift one epoch ahead. Implemented in neuromodulator v16.12, all tests passing.
Dual-head architecture: frozen backbone, SemanticHead (98K) + CLIPBridge (65K). Sprint 6c InfoNCE: 9/9 navigation queries STRONG aligned (2.2–3.3×). Sprint 6d adds null repulsion to fix out-of-distribution rejection. 8,700× compression vs direct CLIP. No LLM required.
Training tracker shows 25/25/25/25 (aux loss enforces uniformity). Inference probe reveals genuine specialisation: by epoch 20, Expert 1 handles 77.8% of RECON decisions, Expert 3 handles 22.2%, Experts 0 and 2 completely excluded.