The Slow Death of Scaling: Why Bigger Is No Longer Always Better
Sara Hooker — formerly head of Cohere For AI, one of the few researchers with skin in both the industry and academic camps — has published an essay called On the slow death of scaling. It engages with a question that, for most of the last decade, has been treated as already answered: is bigger always better?
The honest answer, she argues, is no. And the consequences of having assumed otherwise are larger than most teams — and most regulators — have begun to reckon with. This is the first post of a three-part series unpacking the essay and what it means for anyone shipping or governing AI in 2026.
The decade that made "scale" a synonym for "progress"
The story Hooker tells starts with an accident. In 1945 Percy Spencer noticed a chocolate bar melting in his pocket near a radar magnetron tube and we got the microwave. In the 2000s, GPUs — designed in the 70s to render Mario — were repurposed for matrix multiplication and we got deep learning. The 2012 Google paper used 16,000 CPU cores to classify cats; a year later, the same task was solved with two CPU cores and four GPUs.
That moment ignited a rush of compute and, with it, a culture. Ken Thompson's old joke — "when in doubt, use brute force" — was elevated into Rich Sutton's bitter lesson: throw more compute at the problem, and human knowledge engineering keeps losing. From 2017 to 2023, training costs grew roughly four orders of magnitude. GNMT cost ~$100K to train; Gemini Ultra crossed $100M. The "formula" became: scale model size and training data, repeat.
The capital implications were enormous. Frontier research migrated out of academia and into a handful of industry labs. Hooker cites the geography directly: notable ML model output is now concentrated in the US and China to a degree that would have been unthinkable in 2010. Open publication culture has collapsed in parallel. Industry labs have stopped publishing not because the science got harder to write down, but because the moat moved from algorithms to capex.
The evidence that the assumption is breaking
Here's where the essay gets uncomfortable for everyone whose roadmap depends on the bigger-is-better dogma being right.
Hooker plots the Open LLM Leaderboard over two years. The trend is not subtle:
- Falcon 180B — once frontier — is easily outperformed by Llama-3 8B, Command R 35B, and Gemma 2 27B.
- Aya 23 8B and Aya Expanse 8B beat BLOOM 176B despite having 4.5% of the parameters.
- The best models under 13B routinely beat far larger ones submitted in the same window.
These are not edge cases. They are the dominant trend on a public benchmark over a multi-year period. If "bigger" still implied "better" in a meaningful, reliable way, none of this would be happening. What we are seeing is that the rate of return on a unit of compute is shifting, and the shift is being driven by things other than raw parameter count — data quality, algorithmic technique, architectural choices. We'll get into those in Part 2.
Why scaling laws have been oversold
The dominant intellectual justification for the bigger-is-better trajectory has been scaling laws — Kaplan et al. (2020), Chinchilla, Hernandez et al. — which try to predict how loss decreases as compute, data and parameters grow. They became, in Hooker's words, "a catchall phrase to justify everything from massive capital investments in AI startups to policy decisions about compute thresholds."
But the essay catalogues, with citations, a set of caveats that should make anyone using scaling laws for anything beyond a single planned training run nervous:
- They mostly predict pre-training test loss, not downstream capabilities — and the relationship between the two is "murky or inconsistent." This is the emergent properties discussion, which Hooker reframes wryly: emergent properties are just our admission that the scaling laws didn't predict what came out.
- They have been hard to replicate under slightly different assumptions about the data distribution (Besiroglu et al. 2024 on Chinchilla; Anwar et al. 2024).
- Many "power laws" rest on fewer than 100 data points (Ruan et al. 2024). In any other field this would not pass review.
- Some downstream capabilities scale erratically or do not follow power laws at all (Srivastava et al. 2023; Caballero et al. 2023).
- They hold best when architecture, optimizer and data quality stay constant — exactly the conditions least likely to hold over a multi-year planning horizon.
The honest reading is that scaling laws are useful for planning the next training run within a known regime and not much more. Treating them as a load-bearing prediction about the trajectory of AI capability over years was always a stretch.
The policy problem this creates
This is where the essay becomes load-bearing for anyone not actually training frontier models — which is most of us. Regulation has been built on top of the bigger-is-better assumption. The EU AI Act, the US executive orders, and the wave of compute-threshold language in 2024–25 legislation all share a structural premise: that training compute (FLOPs at training time, or by proxy, hardware access) is the best indicator of capability and therefore risk.
If Hooker is right — and the empirical evidence she presents is hard to wave away — then compute thresholds:
- Miss small-but-capable models entirely. An 8B model that outperforms a 180B model on harmful capabilities won't trip any FLOP-based threshold.
- Over-regulate large but underperforming models, creating compliance cost for capability that doesn't exist.
- Will age badly as inference-time compute, agentic systems and gradient-free techniques (Part 3) shift where capability actually accrues.
- Concentrate power further by writing the current oligopoly's scale assumptions into law.
The Anthropic and OpenAI "responsible scaling policies" inherit the same baked-in assumption: that scaling will keep happening and the only open question is how to scale responsibly. Hooker's challenge is more uncomfortable: what if scaling isn't the only — or even the most interesting — axis of progress?
What this means if you're shipping product, not policy
The implications cascade downward. If you're a CTO, VP Eng, or technical founder making model choices for production:
- Stop indexing on parameter count. It was always a noisy proxy and it is now actively misleading. Open leaderboard scores, task-specific evals, and your own production traffic mix tell you more than B-of-parameters.
- Default to "smallest model that hits the eval bar," not "largest model the budget allows." Inference cost compounds. The 8B-beats-180B reality means you can usually get away with much less than vendor marketing implies.
- Treat any vendor roadmap whose value proposition is "we will be bigger next year" with suspicion. Some of the most important capability gains of the last 24 months — RAG, tool use, chain-of-thought, distillation — required no scaling at all.
- Audit any internal planning document that uses scaling laws as a forecast. They are weak forecasters outside narrow training regimes. If a 3-year roadmap depends on extrapolating one, that's a risk, not a plan.
The bigger-is-better assumption was useful for a decade. It is, gracefully and slowly, dying. The interesting question is what comes next — and that's where this gets exciting again. Engineering creativity has been crowded out by capex for years. It is about to matter again.
Next in this series: What actually drives the rate of return on compute — diminishing returns on parameters, the role of data quality, the algorithmic improvements doing the real work, and why architecture is the ceiling-setter nobody talks about.


