← Back to all articles
Challenges

McKinsey 2026: AI Trust Maturity Hits 2.3. My Infrastructure Isn't Buying It Yet.

By Marc Molas·May 12, 2026·10 min read

McKinsey has just published its annual AI Trust Maturity Survey, this year framed as the agentic era. Around 500 organisations surveyed between December 2025 and January 2026. Average maturity score: 2.3 out of 5, modestly up from 2.0 the year before. 62% are at least experimenting with agents, 23% are scaling them somewhere. And the headline that actually matters: nearly two-thirds of respondents cite security and risk as the top barrier to scaling agentic AI, ahead even of regulatory uncertainty.

That number is what should land on every platform roadmap this quarter. From where I sit — DevOps and infrastructure for companies that have to defend their stack in front of a regulator — the report isn't an optimistic one. It's a list of things that aren't yet built underneath the good slides in the keynote.

McKinsey's framing: trust is no longer compliance, it's business value

This year's framing is deliberate. McKinsey says the perceived influence of some regulatory frameworks has declined, and that companies are shifting from compliance-led motivation to value-driven adoption. Translation: executives want to stop seeing AI governance as a mandatory cost and start seeing it as a revenue lever.

Fine as a discourse frame. Toxic as an operating frame if you don't understand what sits beneath it. The bit the report cites — that organisations investing more than $25M in responsible AI see EBIT impact above 5% — isn't because governance "adds value" by magic. It's because companies that have put that money down have also built:

  • Evaluation pipelines with versioned golden sets.
  • Cost attribution per agent and per route.
  • Tool catalogues with per-agent scopes and quotas.
  • A dedicated AI platform team with its own on-call rotation.
  • Lineage for prompts, models, embeddings, retrieval and decisions.

If your CFO sees the 5% number and concludes that governance pays, fine. But let's not confuse the conclusion: what pays is the infrastructure. Governance is what makes it defensible. Without the first you don't have a product; without the second you don't have a licence to operate.

The 23% "scaling agents" figure is smaller than it looks

The other number that will travel around board decks this month is that 23% of organisations are already scaling agents somewhere. Read literally, that's a milestone. Read as an engineer who has to keep those systems standing up, it's a question:

Scaled how? With what SLOs? Under which risk classification? With what incident plan?

The report is honest enough to say that only about one third of organisations report maturity levels of 3 or higher in governance, strategy and agent-specific governance. The gap between "23% scaling agents" and "33% with level-3 governance" is exactly the space where the next round of AI incidents — the ones that make the trade press — will live.

In regulated sectors — banking, healthcare, energy, public sector — that gap isn't a theoretical risk. It's a finding a supervisor can close with a remediation order. The question I ask any team that wants to scale agents in these sectors is the same one an ECB or OCC examiner would ask: show me the evidence.

65% versus 23%: the difference is human-in-the-loop done properly

One of the more useful data points in the report is the gap between high performers and the rest on human validation: 65% of leaders have defined human-in-the-loop processes, versus 23% in the laggard cohort. The report is correctly describing a phenomenon I see every week in technical audits: the difference between an AI system that survives internal review and one that doesn't is, almost always, the rigour of the human layer, not the quality of the model.

But human-in-the-loop is a label that hides four very different designs:

  1. Explicit-approval HITL — the agent proposes, the human signs. The pattern a regulator understands without translation. Slow, but defensible.
  2. HITL by exception — the agent decides autonomously below a confidence threshold, the human steps in above it. Requires a calibrated confidence estimator. Many teams use raw logit probability as a proxy here. It isn't one. Calibrate or die.
  3. Post-hoc HITL — the human reviews a statistical sample after the fact. Useful for drift detection, insufficient as a primary control in regulated sectors.
  4. Theatre HITL — there is a human in the workflow, but their real job is to hit approve on batches of 200 because the queue is moving too fast. This isn't governance, it's absolution by keyboard. It surfaces at the first serious audit.

When we work with a 65% client, they almost always run a calibrated mix of 1 and 2 with a statistical sample of 3. When we work with a 23% client, they almost always sit at 4 without knowing it. That's the real difference, and it's architectural before it's cultural. I've written about this at length and my past self keeps having to repeat it.

"Doing the wrong thing" is a new runbook problem

McKinsey introduces a distinction worth stealing as-is: in the agentic era, organisations can no longer worry only about systems that say the wrong thing, but must also contend with systems that do the wrong thing — take unintended actions, misuse tools, or operate outside their guardrails.

That shift is what breaks most of the runbooks I see at clients coming out of the chatbot era. The whole observability discipline built around latency, error rate, throughput is still necessary, but no longer sufficient. You need a second monitoring axis:

  • Per-agent tool inventory with scopes, rate limits and allowed destinations. If agent A can touch Salesforce, agent B shouldn't be able to reach it transitively via delegation.
  • Cost and action quotas per agent per time window. An infinite loop in an agent that calls a paid external API is a finance incident before it's an SRE one.
  • Behaviour alarms, not just error alarms: an agent that did one thing yesterday and a slightly different thing today against real data — even if it doesn't technically fail — is the era-defining incident signal.
  • Signed audit trail for every tool action executed, not just the model's messages. In a regulated environment, who did what to my system of record is the examiner's question, not what the LLM said.

If your stack doesn't generate that second stream, you aren't running agents in production. You're running a demo with elevated permissions. The distance between the two will be paid in an incident, a headline or a fine, in that order.

What actually changes in a regulated environment

The report covers the EU AI Act and the three-year horizon to full enforcement. It correctly notes that a conservative approach — anticipating likely standards on human oversight, data protection and fairness — helps organisations stay ahead. I agree. From an engineering seat, here's what "staying ahead" looks like in practice while the regulation is still solidifying:

  1. Risk-classify the system, not the model. Most teams classify the LLM. What the regulator wants classified is the full sociotechnical system: model + retrieval + tools + human flow + data. Without that map, you can't even start answering Article 9 of the AI Act.
  2. Joint versioning of model, prompt and retrieval index. A change in any of the three has to produce an immutable, signed, traceable artifact. If you version the model but not the retrieval index, you can't reproduce a six-month-old decision under a subpoena. That isn't an engineering preference anymore; it's a requirement.
  3. Data-isolation policies enforced on retrieval output, not just on input. Most leaks I see in regulated pilots come from retrieval pulling more than it should and the model reciting it with confidence. The policy has to apply before the context reaches the model, not after.
  4. Deployment gates with proof. Pushing a new prompt to production should pass a minimum eval battery — alignment, bias, leakage, tool behaviour — before it touches live traffic. The idea of proof-carrying deployment stops being academic the moment a supervisor asks for evidence of what you validated before the change.
  5. Controlled-withdrawal plan. Every agent in production should have a documented, tested kill switch with execution measured in minutes. Not "we can deprecate it next sprint". Minutes. In a regulated environment, the option of not acting is often safer than acting; your system has to know how.

None of those five come free with any agentic platform I've seen on the market this year. All five are your own architecture work. McKinsey sells them as verifiable governance architecture; I prefer to call it a runbook a lawyer can sign.

The report's bias: optimistic by construction

A note on the data. McKinsey's survey is answered, by definition, by people with direct responsibility or expertise in AI governance, risk management or AI investment decisions. That's a sample self-selected toward companies that have those functions defined at all. Reality in the mid-market is worse than the report shows — not because McKinsey is misleading, but because companies without an AI risk officer don't answer this kind of survey and therefore appear in neither the numerator nor the denominator.

If your organisation doesn't have someone who could plausibly answer this survey, your real maturity is probably not 2.3. It's closer to 1, and the first job isn't getting to 3; it's building the role that lets you measure it honestly.

What I'd put on my own roadmap this quarter

If I have to translate the report into concrete actions for a platform team in a regulated sector, this is what I'd do before the next board update:

  1. A real inventory of agents in production, not just the ones marketing calls agents. Counting cron jobs, webhooks and scripts that call an LLM with elevated permissions.
  2. A single table that answers who can do what: agent, tools, scopes, accessible data, accountable human, behaviour metrics. If it doesn't fit in a table, you can't defend it.
  3. An explicit governance budget: people, tools, evals, platform. The report says the >$25M cohort sees returns. Your number won't be that, but the principle holds: governance without a budget is theatre.
  4. A kill-switch drill per critical agent, timed. If it takes more than ten minutes, you don't have one.
  5. A grown-up conversation with risk and compliance. Governance maturity grows when engineering, risk and compliance share vocabulary. The report correctly identifies that gap as a primary barrier for many organisations; the fix is cultural and organisational before it's technical.

The line I'm drawing

McKinsey is right in the central observation: agentic AI moves the problem from saying to doing, and that changes the kind of governance you need underneath anything you want to call production. My question isn't whether the global sector is more mature (yes, slightly) or whether the risk is going up (clearly). My question is whether, in your specific system, an examiner could ask for the action log, the decision lineage, the human-validation history and the result of the last pre-deployment eval — and you could put all four artifacts on the table within the same hour.

If the answer is yes, you're in the 33% with real maturity and can start talking business value. If it's no, the report's 2.3 average is still aspirational for you, regardless of what the board slide says.

The companies that will win the agentic era won't be the ones that scale agents fastest. They'll be the ones who, when the regulator, the auditor or the incident investigator shows up, can open the runbook and turn the pages without looking away.


Sources:

  • McKinsey & Company, State of AI trust in 2026: Shifting to the agentic era, April 2026. mckinsey.com
  • McKinsey & Company, Trust in the age of agents — Agentic AI governance for autonomous systems. mckinsey.com
  • McKinsey & Company, Deploying agentic AI with safety and security: A playbook for technology leaders. mckinsey.com

Putting AI agents into production under a real regulator and unsure your runbook would survive the first audit? Talk to a CTO — we'll help you separate real maturity from the slide.

Ready to build your engineering team?

Talk to a technical partner and get CTO-vetted developers deployed in 72 hours.