Challenges

(2/3) What Actually Drives the Rate of Return on Compute

By Marc Molas·May 26, 2026·9 min read

In Part 1 we walked through Sara Hooker's case that the bigger-is-better era is ending. The natural follow-up question — and the one Hooker spends most of the essay on — is if compute is no longer the dominant lever, what is?

Her answer: it's the rate of return on a unit of compute that matters now, and that rate is being driven by four things, only one of which is "more parameters." Let's go through them in order, because every one of them touches decisions I actually face as a CTO: which models to pick, how to design training pipelines, and what infrastructure to budget for in 2026.

1. Parameters: diminishing returns, then weirdness

In 2016 Inception had 23M parameters. In 2025 Qwen3-235B-A22B has 235 billion. That four-orders-of-magnitude jump bought real gains for a while. It has also exposed a deeply uncomfortable fact: we do not understand why we need most of those weights.

Hooker cites a body of work that makes this concrete:

You can remove the majority of trained weights after training with minimal performance loss (Gale et al. 2019; Han et al. 2015; Evci et al. 2019; Hooker et al. 2020). This is the well-known sparsity / pruning result.
But — and here's the puzzle — you cannot reach the same performance if you start with the smaller network in the first place. The extra weights are doing something during training that they aren't doing at inference.
Denil et al. (2014) showed a small set of weights can be used to predict 95% of the weights in a network. The space is enormously redundant.

The simplest explanation is uncomfortable: deep nets are incredibly inefficient learners of the long tail. Frequent patterns are learned early and cheaply. Rare ones — exactly the ones that make a model feel "smart" on edge cases — require a disproportionate share of the compute and a disproportionate share of the weights, largely because we train with average-loss minimization and equal exposure across examples. The signal of rare attributes is diluted in batch updates.

The fair counter-argument: scaling still works. Every frontier model of the past year was trained at enormous scale, and the labs writing the biggest cheques can read a loss curve. Conceded. But "still works" is not the same claim as "best return on the next unit of compute" — and the second is the one your budget actually answers to.

Hooker calls this "building a ladder to the moon" — technically progressing, but at a cost structure that cannot keep paying off. If you accept that diagnosis, the next three levers are not optional optimizations. They are the actual frontier.

2. Data quality: the lever everyone underspends on

Data quality compensates for compute. Hooker assembles a large body of evidence — deduplication, data pruning, data prioritization — showing that better-curated training corpora reduce the parameter count needed to hit a given capability bar. Per Marion et al. (2023), Penedo et al. (2023), Singh et al. (2024b), and others, smaller datasets curated well can match or beat larger ones used naively. Training time drops directly, and the compute saving is structural, not incremental.

Why does industry chronically underspend here? Three reasons that are familiar to anyone who's run an ML team:

Curation work doesn't quarterly-plan well. "Cleaning data" is a verb that doesn't fit on a roadmap slide. "Train a 10× bigger model" does.
Compute is buyable; curated data is built. You can wire money to NVIDIA and have GPUs next quarter. You cannot wire money and have a clean, deduplicated, balanced, license-clear corpus next quarter.
The success metrics get gamed. Benchmark improvements from data quality look identical to benchmark improvements from scale on a chart, so the credit goes to whoever was loudest about scaling, not to the data team that quietly did the deduplication.

The shift Hooker describes — from data as a frozen snapshot (MNIST, ImageNet, SQuAD) to data as a malleable, optimized object — is one of the most important paradigm changes in the essay. It's also where the most asymmetric returns exist for teams that don't have hyperscaler budgets but do have domain expertise. We'll come back to this in Part 3 under "the malleable data space."

3. Algorithmic techniques: the silent compounding

The third lever is the most underrated, mostly because it doesn't come in one big breakthrough but as a continuous trickle of techniques that each individually look like minor optimizations. Hooker enumerates a partial list of what's compensated for raw compute over the last few years:

Instruction fine-tuning. Teaching models to follow instructions on top of pre-training.
Distillation from larger teachers. A capable small "student" trained on synthetic data from a bigger "teacher" can approximate the teacher at a fraction of inference cost.
Chain-of-thought reasoning. A prompting and training pattern that improves multi-step performance with no training compute change.
Increased context length. Architectural and attention changes that let the same model condition on far more information at inference time.
Retrieval-augmented generation. Outsource the long tail of facts to a retrieval layer. Reduces hallucination, reduces the need to memorize, reduces the parameter pressure.
RLHF and preference training. Constitutional AI, DPO, RLOO and other variants substantially change behavior without proportionally more parameters.

Davidson et al. (2023) estimate that purely inference-time compute techniques can deliver 5×–20× improvements over base post-training performance. That number is worth sitting with. A 10× capability improvement that requires zero retraining is the kind of thing that breaks "bigger model next year" roadmaps.

For engineering teams the practical lesson is: most of your AI roadmap should be algorithmic, not capacity. You will get more leverage from adding a properly-implemented retrieval layer, a verification pass, a distilled task-specific model, or a chain-of-thought prompt structure than you will from waiting for the next model size class.

4. Architecture: the ceiling-setter

Architecture is the lever everyone underestimates because it moves rarely. But when it moves, it resets every scaling law that came before it. Hooker is direct:

A new architecture design can fundamentally change the relationship between compute and performance and render any existing scaling law irrelevant.

We have the historical receipts. CNNs changed the relationship for vision (Ciresan et al. 2011; Krizhevsky et al. 2012; Szegedy et al. 2014). Transformers changed the relationship for language (Vaswani et al. 2017). Each of those was a paradigm shift that made the previous compute-performance curves obsolete and unlocked an entire decade of follow-up work.

We are almost certainly due for another. The essay is blunt that current architecture "shows all the signs of plateauing in returns from additional compute" and that "the next significant step forward will require an entirely different architecture." Deep nets are particularly bad at:

Continual learning — they suffer catastrophic forgetting when new data interferes with old behaviors.
Specialization of knowledge — global gradient updates don't carve out regions of competence the way biological systems do.
Sample efficiency — they need vastly more examples than a human child does for comparable tasks.

A new architecture that fixes even one of these would re-shuffle the entire landscape. Which is why concentrating all capital expenditure on scaling the current architecture is, in Hooker's framing, under-investing in the most likely source of the next jump.

What this changes for engineering leaders

Pulling these four levers together, here's what I'd take into a planning conversation in Q3 2026:

Stop ranking models by parameter count. Rank them by capability-per-token-per-dollar on your actual task mix. The correlation between parameter count and that ratio is now weak.
Move data engineering up the org chart. If you don't have a senior person owning data curation, deduplication, license compliance, and prioritization, you're leaving the biggest free lever on the floor.
Treat algorithmic improvements as the default first move. Before commissioning a fine-tune or a bigger model deployment, exhaust: retrieval, prompt structure, verification passes, distillation, tool use, chain-of-thought. Most teams give up on this layer too early.
Track architecture shifts seriously. When the next post-transformer architecture lands (and it will), the teams that have over-invested in transformer-shaped infrastructure — pipelines, ops, vendor commitments — will be slowest to adapt. Architectural diversity in your stack is a hedge.
Don't confuse "AI strategy" with "model selection." The model is one decision among many. The data, the retrieval, the verification, the human-in-the-loop design — those are where the differentiated work happens.

Hooker's framing — rate of return on a unit of compute — is the right one to internalize. It moves the conversation away from "how big" and toward "how much capability per unit of cost, and what are the levers that move it." That is a conversation engineering teams can actually win, and one CFOs can actually price.

Next in this series: Beyond scaling — the new optimization spaces for AI progress. Gradient-free methods, inference-time compute as a first-class lever, the malleable data space, agentic systems, and what the death of scaling does (and doesn't) mean for environmental impact.

(2/3) What Actually Drives the Rate of Return on Compute

1. Parameters: diminishing returns, then weirdness

2. Data quality: the lever everyone underspends on

3. Algorithmic techniques: the silent compounding

4. Architecture: the ceiling-setter

What this changes for engineering leaders

Related Articles

Coherence Is Not Correctness: Why a Paper Needs Testable Claims, Not Flawless Prose

(1/3) The Slow Death of Scaling: Why Bigger Is No Longer Always Better

Agentic-as-a-Service and the Return of the Engineer

Ready to build your engineering team?