Beyond Scaling: The New Optimization Spaces for AI Progress
In Part 1 we covered why scaling is no longer a reliable axis of progress. In Part 2 we walked through the four levers that drive the actual rate of return on a unit of compute. The natural close of the series — and the part of Sara Hooker's essay I found most energizing — is the question: where should the field go next?
Hooker's answer is that we are entering an era of expanded optimization spaces. Computer scientists used to have one big lever (train a bigger model with more data) and that was both empowering and confining. The new landscape gives us a much wider set of things to optimize, and many of them are dramatically under-explored. Let's go through the ones she highlights, then deal with two important clarifications she makes at the end.
1. Gradient-free exploration: inference-time compute as a first-class lever
For the last 30 years, the way to make a model better has been to update its parameters. More training, more data, more weights. The departure happening right now is that a lot of compute is being spent at inference time, not training time — and crucially, much of it is gradient free, meaning the model itself doesn't change.
Hooker groups this family of techniques together as the new "compute light" and "gradient free" optimization spaces (her Figure 5 splits them out explicitly):
- Best-of-N sampling. Sample multiple completions, score them, return the best.
- Search and planning over generations. Tree search, beam search variants, agentic loops that explore alternatives.
- Tool use. A model that can call a calculator, a database, a code interpreter or another model effectively borrows capability it doesn't have to memorize.
- Retrieval-augmented generation. Already mentioned in Part 2 — it lives in this category.
- Agentic swarms. Multiple model instances coordinating to solve a problem one couldn't solve alone.
- Model merging. Combine the parameters of multiple fine-tuned models without further training.
- Adaptive compute. Spend more inference compute on hard problems, less on easy ones.
The Davidson et al. (2023) estimate is the headline number: inference-time techniques can deliver 5×–20× improvements over base post-training performance, with minimal footprint relative to the cost of pre-training. That is an enormous leverage ratio, and it is being captured today by teams who chose to invest in this layer instead of waiting for the next model size class.
The strategic implication is subtle but important. Inference-time techniques are engineering, not training. They reward teams that can ship, instrument, evaluate and iterate quickly. The bottleneck moves from "do you have enough GPUs to train" to "do you have enough engineering velocity to compose, evaluate and ship." This is genuinely good news for organizations that don't sit on a hyperscaler-sized capex line — which, again, is most of us.
2. The malleable data space
Hooker's second new optimization space is what she calls the malleable data space, and it might be the most philosophically interesting shift in the whole essay.
For most of AI history, datasets were frozen artifacts — MNIST, ImageNet, SQuAD, C4. You picked one, you trained on it, you reported numbers. The dataset was a snapshot of the world you happened to be able to gather. The fundamental machine learning assumption was IID — samples drawn independently and identically from some fixed distribution. We accepted whatever the world handed us.
What changes when synthetic data generation becomes cheap enough to treat data itself as something you optimize?
- You can steer the distribution toward what you actually want — including capabilities, languages, edge cases, demographic balance — rather than accepting what the corpus happens to contain.
- You can target the long tail directly. If your model is weak on a specific category, you can generate or synthesize examples for it instead of hoping the next scrape contains more of them.
- You can shrink the gap between training-time and inference-time distribution. Historically there's been a chronic mismatch: training data is determined by what you could collect; inference inputs are determined by what users actually do. Synthetic data can close that gap deliberately.
- You can make invisible populations visible. The Aya line of work (Aryabumi et al. 2024; Üstün et al. 2024; Dang et al. 2024b) is largely about using synthetic data and translation to give multilingual coverage that the open web does not provide.
This is a sharp break from "IID samples from nature." We are now able to intentionally skew the distribution toward what we hope to represent, instead of accepting a random sample of what is. That is both an enormous capability and an enormous responsibility — synthetic data done badly compounds bias rather than fixes it.
For product teams, the practical takeaway is that you should treat your training/fine-tuning data as a thing you design, not a thing you gather. If your model is weak on a slice that matters, you have a lever that didn't exist five years ago.
3. Design and interface
The third optimization space Hooker highlights is the one most computer scientists are least equipped for: how the system interacts with the world.
The most intelligent system will increasingly be defined by building an algorithm that can interact with the world. This means for the first time researchers who care about intelligence need also be obsessed with how a model interacts. What was previously the narrow purview of UX designers, artists and human computer interaction specialists, should now be of great interest to all computer scientists.
This lands hard because it inverts a longstanding cultural assumption. AI progress has historically been gated by the algorithm and treated the interface as a wrapper. Hooker is saying the interface is becoming part of the algorithm — and the most capable systems will be multi-component systems whose intelligence emerges from how the components are composed and how they touch the world, not from any single model getting larger.
This dovetails with the agentic-systems wave but reframes it. The interesting agentic systems aren't "bigger model + tools." They're carefully designed interaction surfaces: where the model gets information, where it can act, what gets shown to the human, what the human approves, how feedback flows back. That's HCI, product design and systems engineering — and it's exactly the kind of work that has historically been undervalued in AI labs.
For anyone shipping AI features in product, this is good news. The discipline you already have in UX, in trust-and-safety review, in workflow design, in human-in-the-loop architecture — that is now first-class AI work. It is no longer a wrapper around the "real" capability.
What this does not mean: the environmental clarification
Hooker is careful to head off a specific misreading of the essay, and I want to repeat it because it's important. The slow death of scaling training compute does not mean the environmental footprint of AI is shrinking. The opposite:
The majority of energy requirements of AI workloads is not in training, but instead the cost to productionize an ML workload and serve it to billions of users. Even if model size is trending smaller, the widespread adoption of AI means overall energy requirements will likely continue to rise.
In other words: smaller, more performant models are being deployed in vastly more places, so the aggregate energy and water footprint of AI keeps growing even as per-model training cost potentially levels off. The Strubell et al. (2019a), Patterson et al. (2021), Luccioni et al. (2025), and Wu et al. (2022) lines of work are still load-bearing. If anything, the inference-heavy future Hooker describes makes serving efficiency, hardware utilization and carbon-aware deployment more important, not less.
I've written before about feasible sovereign operating regions on this exact tension — that the cost story for AI is increasingly determined by serving infrastructure, not training. Hooker's framing reinforces it.
Will we ever go back to scaling?
Hooker's answer here is measured and worth quoting:
As long as we are stuck with transformers as an architecture it doesn't make sense to keep scaling compute. Our current architecture shows all the signs of plateauing in returns from additional compute. While progress has revolved around deep neural networks for the last decade, there is much to suggest that the next significant step forward will require an entirely different architecture.
The implication is that scaling will return when a new architecture arrives that breaks the current returns curve and opens a new one — exactly the way transformers did in 2017. But scaling the current architecture is, increasingly, capex chasing diminishing returns. The frontier labs that will lead the next wave won't be the ones that scaled hardest. They'll be the ones that bet on a paradigm shift.
What I'm taking away from the whole series
Three threads pulled from Hooker's essay that I think matter most for anyone shipping AI in 2026:
-
The interesting work is back in the hands of engineers. For a decade, AI progress was a story about who could afford the most compute. The shift toward algorithmic technique, data design, inference-time compute and interface means the interesting differentiation is once again about engineering judgment — choice of retrieval architecture, curation of training data, design of agent loops, structure of human-in-the-loop. That is recoverable territory for teams that don't have a $100M training budget.
-
The dominant policy and capex assumptions are aging fast. Compute thresholds in legislation, "responsible scaling" frameworks, vendor roadmaps premised on "next year, bigger" — these are all artifacts of an assumption that is now empirically weak. Any plan that depends on them deserves a fresh look.
-
The next architecture is the prize. Catastrophic forgetting, sample inefficiency, the inability to specialize regions of knowledge — these are the hard problems the current architecture can't solve. Whoever solves them resets the field. That's a much more interesting bet than "more parameters."
Hooker closes the essay with a Turing quote that fits the moment: "We can only see a short distance ahead, but we can see plenty there that needs to be done." The reason that lands is because, for a long stretch, computer science felt like it didn't have plenty to do — it had one thing to do, very expensively. We are finally on the other side of that. The view from here is more uncertain, but the work is more interesting again.
This is the final post in the series. Part 1 covered why bigger is no longer always better. Part 2 walked through what actually drives the rate of return on compute.
Reference: Sara Hooker, On the slow death of scaling, 2025.


