← Back to all articles
Challenges

Let the LLM Talk, Not Touch: The Closed-Loop Architecture That Actually Survives Production (3/3)

By Marc Molas·May 13, 2026·11 min read

This is post 3 of 3 in a series on Sergio Cruzes' AI Infrastructure Sovereignty paper. Part 1 framed why sovereignty is infrastructure, not data residency; part 2 covered the Feasible Sovereign Operating Region.

The third piece of Sergio Cruzes' AI Infrastructure Sovereignty paper that should travel further than it has is the bit where he draws a hard architectural line: in a closed-loop AI infrastructure system, LLMs are advisory and interpretive. They do not execute. Execution is the job of bounded, deterministic agents, validated by a digital twin, with two strictly separated feedback paths.

I deploy agentic AI in regulated environments for a living. I'm "invested" in this technology in the most literal billable sense. And I think the paper's architecture is the correct one — which is exactly why I want to flag that most products being sold as agentic platforms in 2026 silently violate it. They put the LLM closer to the actuator than the paper's design allows, then market that closeness as the feature.

This is the third post on the paper, following the sovereignty-isn't-data-residency and the FSOR pieces. If those covered what you have to control, this one is about how the control loop has to be wired without lighting the data hall on fire.

The four-layer reference architecture, in one paragraph

The paper proposes four stacked layers:

  1. Physical — AI data centres, optical networks, energy systems. The substrate.
  2. Observability — streaming normalisation, timestamp alignment, freshness certification, cross-domain fusion. Produces the unified state vector θ(t).
  3. Coordinated Control — domain agents (compute, power, cooling, optical) + coordination tier + digital twin + an LLM assistance layer.
  4. Safe Execution — only digital twin-validated actions reach the live infrastructure.

The interesting boundary is between 3 and 4. The interesting non-boundary — the one the hype layer wants to blur — is between the LLM assistance and everything else inside layer 3.

What Cruzes actually says about LLMs

The paper is unusually explicit. The LLM layer is "advisory and interpretive role only." It exists to:

  • Translate human intent into structured objectives the deterministic agents can consume.
  • Generate explanations of what the agentic system decided and why.
  • Be a natural-language surface on top of the actual control system, not a participant in it.

And then the paper says the quiet part out loud:

Allowing LLM outputs to drive infrastructure actions directly — without validation through the deterministic constraint-checking of the agentic system and the pre-execution simulation of the digital twin — introduces a failure mode in which plausible-sounding but incorrect instructions are executed on live infrastructure.

This is the LLM-in-production failure mode I've personally watched in five different incident reviews in the last eighteen months, none of them in data-centre control but all of them in regulated environments: the LLM produces something that looks like the right command, the surrounding system is too eager to execute it, and the post-mortem turns into a "we trusted text where we should have trusted policy" exercise. The data-centre version of that incident would not be a slack-bot embarrassment. It would be a thermal event.

The two-tier agentic structure

Inside the coordinated control layer, the paper separates:

  • Tier 1 — domain agents. Specialised reasoners for compute placement, power management, cooling control, optical routing. Each one has hard-coded knowledge of its domain's constraints and physics. These do the actual proposing of actions.
  • Tier 2 — coordination layer. Joint feasibility checking across all tier 1 proposals. If compute wants to place a workload at site A, but cooling agent says A is over budget given current wet-bulb, and the optical agent says the link to A is in degraded mode, the coordinator catches the contradiction. If no jointly feasible action exists, it escalates to humans rather than picking the least bad option silently.

The LLM is not tier 1 and not tier 2. The LLM sits outside this loop. It explains what the loop did. It accepts human intent and reformulates it as a structured objective fed into the loop. It does not place workloads. It does not throttle racks. It does not reroute optical paths.

This is a defensible, regulator-friendly design. It is also a design that most "agentic" platforms on the market today do not match, because the marketing pressure is to include the LLM in the decision — that's where the magic-trick demo lives.

Two feedback paths, kept strictly separate

The detail that an engineer will appreciate and a marketer will gloss over is the two-feedback-paths discipline:

  • Feedback Ameasured outcomes flow from the physical layer back up through observability. This closes the control loop. The agents learn that the action they took produced (or did not produce) the expected state change.
  • Feedback Bprediction residuals (the difference between what the digital twin expected and what actually happened) flow back into the digital twin only. This is how the twin detects its own drift from physical reality.

The paper insists these channels remain strictly separate. Conflate them and you destroy drift detection. If the digital twin gets the same measurement stream as the agent control loop, with no isolation, then a slow drift in the twin's accuracy will look like normal operational variance to the agents, and you will not see the drift until the twin makes a decision that the physical system rejects in an incident.

This is the kind of architectural rigour that doesn't sell platform licences but does keep you out of a post-mortem.

Where most current "agentic" platforms quietly break this

I'll generalise from what I see in client architectures and vendor demos, without naming names:

  1. LLM in the action path. The product sells "an agent that operates your infrastructure." Under the hood, the LLM both interprets the request and emits the command. There is no deterministic tier 1 agent with hard-coded constraints between the LLM and the actuator. This is the failure mode the paper names explicitly.

  2. Digital twin as a marketing asset, not a validation gate. Many products show a 3D-rendered "digital twin" in the demo. Few of them require that the twin validate every proposed action before execution. The twin is decorative. In the paper's architecture, the twin is a gate; if the twin's simulation diverges from policy, the action is blocked.

  3. Single-loop telemetry. Both the agent and the twin consume the same stream with no separation. Feedback A and B are conflated, drift detection is unreliable, and the system silently loses the property the paper insists on.

  4. No escalation contract. When the coordination layer finds no jointly feasible action, what happens? In the paper, graceful degradation with structured escalation to humans, who retain final authority. In many products, the system just picks the lowest-cost action under a fallback heuristic and writes a debug log. That is not graceful degradation; it is silent failure with a logging system.

  5. Human-on-the-loop as a checkbox. A human dashboard exists; it is reviewed weekly. Operationally, the agents have moved faster than the review cadence for months. This is the data-centre version of theatre HITL that McKinsey's report flagged for general agentic systems. Same disease, higher blast radius.

If your platform fails any one of these tests, you have an agentic infrastructure system in the marketing sense and a demo with elevated permissions in the operational one.

Why I think the paper's architecture is correct

Three reasons, drawn from how this actually plays out at clients who have to defend the stack:

1. The LLM is excellent at the layer where its errors are recoverable. Translating "I want to schedule the next training run somewhere within our carbon envelope" into a structured objective is a great use of an LLM. If the translation is wrong, the structured objective fails validation and the request returns with an error. No physical action was taken. Recoverable. Excellent.

2. The LLM is dangerous at the layer where its errors are not recoverable. Generating the exact rack-throttling command is the wrong place to use the LLM, because if the generated command is plausible-but-wrong and it executes, the physical system already moved. There is no "undo" on a thermal cycle. The paper's separation puts the LLM exactly where its strengths land and removes it from where its weaknesses bite.

3. Regulator-shaped vocabulary. A supervisor in a regulated sector will ask, in any incident review: what made the decision, what validated it, what evidence do you have? The paper's design has a clean answer for each. The LLM-in-the-action-path design has, at best, "the model decided," which is the answer that triggers the next two years of remediation work.

I want to be clear: I am positive on LLMs. I deploy them, I have skin in the game on AI working in production. I am not making a "LLMs are unreliable, don't use them" argument. I am making a placement argument: LLMs are the right tool at the natural-language and explanation layer, and the wrong tool at the execution layer. The paper formalises the placement that good operators were already converging on informally.

What this means for the rest of agentic AI, not just data centres

The paper is specifically about AI infrastructure control, but the architecture generalises cleanly to most regulated agentic deployments:

  • Banking agent that processes payments. LLM translates the customer's intent. Deterministic agent with policy and limits issues the actual debit. Digital twin (or pre-flight checks against a sandboxed ledger) validates before commit.
  • Healthcare triage agent. LLM mediates dialogue, summarises history. Deterministic agent applies the protocol. Human-in-the-loop on any action that produces clinical effect.
  • Industrial control agent. LLM explains setpoints to the operator and accepts setpoint targets from natural language. Deterministic controller actually moves the valve, after a simulator validates that the move does not violate process limits.

In all three, the architectural skeleton is the same as the data-centre one in the paper: the LLM never holds the actuator. It holds the explanation, the natural-language surface, and the translation of intent. The boundary doesn't move because the regulator and the physics don't move.

This is the same line I drew in my proof-carrying deployment and verifiable governance architecture posts, from a different angle. The Cruzes paper provides the physical-infrastructure version of an argument that is converging across regulated sectors: LLM useful, LLM not authoritative, deterministic agent in the path of consequence.

What I'd put on the platform roadmap this quarter

If I had to translate this third post into actions for a platform team running — or planning to run — agentic AI in a serious environment:

  1. Map your action graph. For every operation an "agent" can perform, mark which layer issues it: LLM, tier-1 deterministic agent, or human. If the LLM appears anywhere in the execution column, you have rework to do before the regulator does it for you.

  2. Put a digital twin in front of the actuator. Even a coarse one. The point isn't fidelity; the point is the gate. An action that the twin cannot simulate, or which the twin shows violating a constraint, does not execute. Period. This single discipline removes a category of incidents that look catastrophic in the post-mortem and trivial in retrospect.

  3. Separate feedback A and B. Outcomes go to the control loop. Twin residuals go to the twin. Same source telemetry, two pipelines, two retention policies, two ownership lines. This is unglamorous infra work; it is also the work that makes drift detection real.

  4. Write the escalation contract. Define what happens when no jointly feasible action exists. The answer is humans, with a clear handoff and a published SLA on response. Anything else is a silent fallback that will surface in an incident.

  5. Audit your vendor against the four tests above. LLM not in the action path; twin as a real validation gate; feedback paths separated; explicit escalation. Any "agentic platform" that fails two or more of these is not a regulator-grade system; it is a productivity demo with elevated permissions.

The line I'm drawing — and why I'm holding it

I'm critical of the current agentic-AI hype not because the technology isn't real — it is, demonstrably, and I bill for it — but because the marketed architecture is consistently closer to the actuator than the engineering architecture should be. The Cruzes paper, working in the most demanding operational domain available (live AI infrastructure under joint physical constraints), arrives at a discipline that translates cleanly to every regulated agentic deployment: LLMs talk and explain. Deterministic agents propose. Coordinators check feasibility. Digital twins validate. Humans authorise the policy and own the escalation. The physical system only ever sees actions that have cleared all four prior gates.

The fastest agentic platform in 2026 won't be the one whose LLM gets closest to the metal. It will be the one whose LLM is honestly placed where its strengths live, with the rest of the stack engineered to absorb its weaknesses. That platform won't make the magic-trick demo. It will make the audit on a Tuesday in October at 09:30 without anybody needing to take the day off.

That's the system I want to keep building. Everything else is theatre with permissions.


Sources:

  • Sergio Cruzes (Ciena Corporation), AI Infrastructure Sovereignty, arXiv:2602.10900v4, April 2026. arxiv.org

Putting agentic AI into production and unsure your architecture would survive an incident review? Talk to a CTO — we'll help you place the LLM exactly where its strengths land and nowhere else.

Ready to build your engineering team?

Talk to a technical partner and get CTO-vetted developers deployed in 72 hours.