← Back to all articles
Challenges

The Feasible Sovereign Operating Region: Why Your AI Roadmap Hits an Energy–Carbon–Water Wall (2/3)

By Marc Molas·May 13, 2026·11 min read

This is post 2 of 3 in a series on Sergio Cruzes' AI Infrastructure Sovereignty paper. Part 1 framed why sovereignty is infrastructure, not data residency; part 3 covers the LLM-as-advisor architecture.

The most useful idea in Sergio Cruzes' AI Infrastructure Sovereignty paper is one I haven't seen named explicitly anywhere else: the Feasible Sovereign Operating Region (FSOR). The intersection in which three hard physical limits can be satisfied jointly, not one at a time:

  1. Energy — grid capacity and the rapid fluctuations that AI workloads inject into it.
  2. Carbon intensity — the live mix of the grid feeding the site.
  3. Water — cooling availability under seasonal stress.

Most AI roadmaps I have reviewed in the last twelve months optimise one of the three and silently assume the other two. They don't compose. The FSOR is the part of the plan where they have to.

This is a follow-up to the sovereignty-isn't-data-residency post. If the previous piece argued that real sovereignty lives in physical infrastructure, this one is about what happens when you actually try to operate in that physical infrastructure under the constraints the paper makes explicit.

The three limits aren't independent — that's the whole point

AI sustainability has historically been reported as three separate KPIs: PUE for power efficiency, gCO₂eq/kWh for grid carbon, WUE for water. Three numbers, three teams, three quarterly reports. The paper's contribution is to insist they are a single joint feasibility problem, not three additive ones.

A training cluster site is operable at a given moment only if all three of the following are simultaneously true:

  • The grid has the power headroom to absorb the workload's bursty profile, including millisecond-to-second spikes during collective operations.
  • The live carbon intensity of that grid is inside the policy envelope you've committed to (or the one a sustainability regulator will accept).
  • The local water budget — wet-bulb ambient, basin reserves, seasonal variation — supports the cooling load at the required power density.

Take any one of those out, and the cluster is not operating in the FSOR; it is operating outside the policy and accumulating future debt — financial, regulatory, or reputational. The three constraints can be in tension at the same site:

  • A region with abundant low-carbon energy may have insufficient grid capacity to absorb 100+ MW of bursty load without destabilising the local network.
  • A region with abundant water and grid capacity may have a high-carbon mix that pushes the workload outside its declared sustainability envelope.
  • A region with clean energy and adequate water may sit on a fiber path with latency or operator dependencies that make it inadmissible for the workload anyway.

The FSOR is the intersection. It is often smaller than any of its constituent sets suggests, and most operators have not actually computed it.

Why training is the hardest case

The paper makes a workload classification that should be on every platform team's wall:

WorkloadPower profileCooling demandNetwork demandPortability
TrainingBursty, highExtremeVery highLow
InferenceSustainedModerate–highHighMedium
Batch analyticsVariableModerateLow–moderateHigh

The killer column is portability. Training clusters cannot exploit distant low-carbon sites for a structural reason — collective communication latency. The speed of light fixes the floor: ~5 ms per 1,000 km. Training tolerates roughly 1 ms of collective-comm latency, which collapses the geographic radius to about 100 km. Within 100 km, you take the energy mix, the water situation and the grid headroom you find, or you don't train.

This is the part of the AI sustainability discourse where things start to hurt. The "we'll move our training to wherever the energy is greenest" story is a slide, not an architecture. The paper says it plainly: frontier AI training is inherently site-bound and concentrated in the few locations that satisfy all three FSOR constraints simultaneously. That's why the actual training capacity in the world clusters in a small set of geographies; it isn't industry preference, it is the FSOR doing its job.

Inference is more forgiving — replicable across regions, but still tied to demand geography. Batch analytics is the only category where carbon-aware relocation is actually achievable at scale. Most "carbon-aware AI" demos in the wild rebrand batch analytics as the headline. That is fine, but it does not address training, which is where the carbon, water and grid impact actually live.

Where the European roadmaps quietly break

If you take the FSOR seriously, several lines on the current European AI infrastructure roadmaps stop holding their own weight:

  • "Locate training in Northern Europe for the green grid." Partially true. The grid is greener; the grid headroom in many target regions is already constrained by existing data centre tenancy and hyperscaler expansions. Carbon ✓, energy ✗.
  • "Use Southern European solar capacity for AI training." Energy availability sometimes ✓; water budget under summer wet-bulb conditions often ✗. The cooling penalty in peak season collapses the FSOR.
  • "Edge AI eliminates the data centre problem." For inference at the very edge, sometimes. For training, no. The paper is explicit on this: training is site-bound and its constraints are not solved by edge fan-out.
  • "Liquid cooling solves the density problem." It solves the air-cooling ceiling (the ~20–30 kW per rack threshold). It does not solve the water budget; in many configurations it makes the WUE situation more legible, not better, because it brings the cooling demand into a regime that has to be measured rather than approximated.

I don't say any of this to be dismissive — I support each of those programmes in principle. I say it because the next round of operational incidents in European AI infrastructure will live precisely in the gap between the headline assumption ("clean energy region") and the joint FSOR ("clean energy region that, in August at 19:00 local time, with current grid concurrency, cannot operate this workload in policy").

The telemetry you need to even know your FSOR

You cannot prove you are operating inside an FSOR if you cannot measure it. The paper's reference architecture spells this out: cross-layer telemetry fusion is a precondition, not a nice-to-have. For each of the three FSOR axes, the minimum signal set is non-trivial:

Energy axis

  • Per-rack power draw and the derivative (so you see collective-op spikes, not just averages).
  • Upstream grid headroom signal — usually a commercial feed or a direct utility integration.
  • Real-time PPA accounting (what fraction of current draw is actually covered by your contracted renewables vs. spot grid).

Carbon axis

  • Live grid carbon-intensity feed (WattTime, Electricity Maps, national TSO). Update cadence ~5 minutes.
  • Marginal vs. average emissions distinction — most policy envelopes are written in averages, but for workload-shifting decisions the marginal is what matters.
  • Attribution: which workload's emissions belong to which tenant, billing entity or regulatory perimeter.

Water axis

  • Coolant inlet/outlet temperatures, flow rates.
  • Wet-bulb ambient and a forecast (so the FSOR has a next-six-hours projection, not just a current snapshot).
  • Basin / utility stewardship signal — usually an external integration, often not real-time.

In the language of my last post, this is the θ(t) state vector for sustainability. Without it, your FSOR is a guess. With it, you have the substrate to make actual scheduling, throttling and migration decisions defensibly.

The reason most platform teams don't have this layer is that it costs real engineering work and produces no demo. It produces a system that, after three months, says "we cannot operate workload X at this site between 17:00 and 23:00 on summer weekdays." That is a politically uncomfortable output, which is part of why FSOR telemetry remains underbuilt.

What the agentic layer is supposed to do — and what it isn't

Cruzes proposes a control architecture in which agentic AI helps operate within the FSOR. I'll cover the LLM-vs-deterministic-agent boundary in the next post. For the FSOR specifically, the architecture says three useful things:

  1. Optimisation respects constraint primacy. Hard limits — sustainability envelopes, physics — are not soft objectives in the loss function. They override. An agent that proposes a workload placement violating the water budget should be rejected by the coordination layer, not penalised by a regularisation term.

  2. Cross-domain agreement is the actual job. A compute-placement agent that picks the lowest-carbon site is correct only if the cooling agent, the network agent and the grid agent agree it is jointly feasible. Single-domain optimisation is exactly the failure mode the FSOR exists to prevent.

  3. Digital twin validation before execution. No agent proposal touches live infrastructure until a digital twin has simulated it against the physical dynamics. This is the discipline that separates running AI for AI infrastructure from running a demo of AI for AI infrastructure.

The point I want to extract here for my regulated clients: the agentic-AI-on-data-centre story is not an autonomy story. It is a constraint enforcement story dressed up in agent vocabulary. The interesting work is making the constraints machine-readable and the validation deterministic, not making the agents smarter.

What I'm critical of in the sustainability-AI hype

I deploy LLMs and agentic systems for a living. I am not going to argue the technology isn't useful. What I am going to argue is that the sustainability-AI hype is currently optimising the wrong layer.

  • "AI for grid optimisation" pitched as the answer when AI is also a major driver of grid stress. Both are true at once. The second sentence is missing from most decks.
  • Carbon-aware training relocation presented as a near-term lever. It is a long-term lever conditional on solving the 100-km latency constraint, which nobody has solved. The near-term lever is carbon-aware batch analytics, which is real and useful but isn't training.
  • WUE reporting treated as a checkbox. It is fine as a number; it becomes meaningful only inside an FSOR that joins it to the carbon and energy axes. A site with great WUE on a high-carbon grid in a water-stressed basin is not winning anything; it is moving the externality.
  • "Liquid cooling = sustainable" stated without the joint accounting. Liquid cooling is what makes >30 kW racks possible; whether it improves the overall footprint depends entirely on the FSOR maths at that specific site.

The honest version of carbon-aware AI is: for the small set of workloads with high portability, in regions with a real FSOR, we can shift load in ways that materially reduce footprint. That is genuine value. It is also a much narrower claim than the slide.

What I'd put on the infrastructure roadmap this quarter

For a platform team that has read the FSOR section of the paper and wants to act on it before the next sustainability disclosure cycle:

  1. Build a workload portability map. For each AI workload, mark its category (training / inference / batch / serving) and its actual portability budget — geographic radius, allowed downtime for migration, statefulness constraints. This single artifact tells you which workloads even could be moved on an FSOR basis.

  2. Compute the FSOR for one site. Pick your most loaded location. Compute the joint feasible region across the three axes for the next 72 hours using whatever telemetry you can assemble. Publish it as an internal dashboard. You will discover blind spots immediately. That is the value.

  3. Define the constraint envelope. Translate your committed carbon, energy and water policies into machine-readable constraints. If your policy is "average <X gCO₂/kWh," write it as a function of time-of-day and season. Soft policies that live only in a PDF cannot enter the FSOR.

  4. Tag workloads by portability, not just by tier. The portability tag is what your future workload scheduler — agentic or otherwise — will use to decide what can move. This is the precondition for carbon-aware batch shifting being more than a demo.

  5. Treat the FSOR as part of the risk-management file. In a regulated environment, you'll be asked about energy, carbon and water dependencies of high-risk AI systems within this regulatory cycle. Better to have an FSOR analysis in the file than to write it the week of the audit.

The line I'm drawing

The FSOR isn't a clever name for what we already do; it is a discipline we mostly don't do yet. The three sustainability KPIs we report quarterly are not jointly tested in operations. The hype around carbon-aware AI flattens the workload-portability distinction that decides whether anything is shiftable at all. The European sovereignty discourse picks one of the three axes and assumes the other two are someone else's problem.

If you are running AI infrastructure in 2026 and the question "what is the FSOR for workload X at site Y over the next 72 hours?" cannot be answered by your platform from a live system in under five minutes, you don't yet have an FSOR — you have three KPIs that happen to be reported on the same slide.

The work is doable. It is mostly telemetry, schema and policy work, with a small layer of agentic logic on top. The team that does it before the regulatory cycle catches up will be visibly ahead of the team that doesn't.


Sources:

  • Sergio Cruzes (Ciena Corporation), AI Infrastructure Sovereignty, arXiv:2602.10900v4, April 2026. arxiv.org

Running AI workloads with sustainability commitments you're not sure your infrastructure can actually honour? Talk to a CTO — we'll help you compute a real FSOR before the disclosure cycle does it for you.

Ready to build your engineering team?

Talk to a technical partner and get CTO-vetted developers deployed in 72 hours.