The paradox runs local.
1865: Britain fears running out of coal, and the obvious fix is efficiency. Jevons looks at the logic and calls the opposite. This week NVIDIA split a model in two and the same loop started spinning in silicon.
- Britain is afraid of running out of coal. The obvious fix: better engines; burn less; stretch the reserve. William Stanley Jevons looks at the logic and calls the opposite. Efficiency makes energy cheaper; cheaper energy invites use. Better engines won't save the coal; they'll eat it faster. He was right. History filed the receipt.
Same loop, spinning up again. This week, in silicon.

They split the model
NVIDIA shipped Nemotron TwoTower yesterday. Two networks, one backbone. The first tower frozen, autoregressive; holds the context, everything already committed. The second a diffusion denoiser; refines whole blocks in parallel, commits multiple tokens per step. 2.42x throughput; 98.7% of baseline quality. And the denoiser trained on ~2.1T tokens; roughly 8% of the 25T that built the backbone. You don't start over; you bolt parallel generation onto what you already own. The weights are open, the paper's on arXiv, and NVIDIA's framing writes its own Jevons epigraph: it turns the economics from a cliff to a ramp.
Three weeks earlier: DiffusionGemma. Google's model, NVIDIA's optimization. 256 tokens denoised per step; up to 4x faster; 1,000 tokens/sec on a single H100, 150 on a DGX Spark sitting on a desk. No cloud. No meter.
Jevons walks back in
The argument against local inference was always efficiency itself. Your GPU idles most of the day; the datacenter's never does; let the cloud amortize the silicon. Every one of these releases weakens that argument. A thousand tokens a second on local hardware; good enough locally arriving years ahead of schedule.
And cheap inference does what cheap energy did. We don't use less; we use absurdly more. Agent loops that run all night because nothing is metered. Drafts discarded a hundred times before anyone sees one. Ideas too expensive at cloud prices become weekend projects at local prices. The efficiency gain doesn't shrink the appetite; it feeds it. Jevons, verbatim.
The pattern, named
Optimize a resource and you expand the demand for it. But note what gets consumed. Coal's loop drained a finite reserve. This loop burns electricity and produces experiments. Every halving of cost-per-token widens the circle of who gets to tinker; labs, then startups, then one person with a gaming card and a stubborn idea. 1865 depleted something. 2026 might democratize something.
The engine got more efficient. Demand is about to do what demand does.
Jevons would recognize every beat; except, maybe, the ending.
Sources
- Tech Times, 07.02.26 — Nemotron TwoTower: 2.42x throughput without retraining
- arXiv 2606.26493 — Nemotron-TwoTower: Diffusion Language Modeling with Pretrained Autoregressive Context
- Hugging Face — nvidia/Nemotron-TwoTower collection (open weights)
- NVIDIA RTX AI Garage, 06.10.26 — DiffusionGemma optimized for RTX / DGX Spark
Benchmark caveat: parallel block denoising trades a few points on strictly sequential tasks; HumanEval 79.27 → 75.58 at the default operating point. Reported transparently in the paper.