← Back to all articles
strategy

You Can't Recall a Model That Runs on Your Own Hardware

By Marc Molas·June 14, 2026·11 min read

This weekend we've watched projects and prototypes break because a government in another country decided against the usage of a commodity.

Not a bug. Not a bad deploy. Not a rate limit you could back off and retry. A US export-control directive ordered the most capable public AI model in the world switched off — for every user, everywhere, including the vendor's own employees who happened to hold the wrong passport. If your product reached that model over an API, your product didn't degrade gracefully. It returned an error and stopped. I worked through what that did to the price of sovereign risk and to the IPO math in A Visa System for Intelligence. This post is about the other half of the bill: what it does to how you build.

I'm writing this from the builder's seat, not the policy desk or the cap table. I ship production systems that call these APIs, and the lesson I took from last week isn't political — it's architectural. A model you reach over the wire, sitting on a server behind a national border, is a dependency with an off switch you don't own. And the state has now shown, with a timestamp, that it will flip it. The market has already started routing around that switch. The same week one model went dark by decree, Microsoft quietly documented how to run another with no API in the loop at all. And three weeks before any of it, Nvidia — the company that sells the shovels — rewrote its own financial statements to bet that this is exactly where computing goes.

An API behind a border has a failure mode that lives in a federal building

I keep a short list of the ways a feature can die without anyone touching its code. The CrowdStrike outage was a bad upstream update — 8.5 million machines down over a file nobody at your company wrote. Unity's runtime fee was a pricing change you didn't consent to, applied retroactively to software you'd already shipped. Both are vendor-dependency failures, and both are, in the end, negotiable — you can engineer around a bad patch and you can argue down a bill.

Yesterday added a third entry with a genuinely new cause, and this one isn't negotiable. A sovereign directive: nationality-gated, effective immediately, with no SLA that covers it and no appeal beyond compliance. There is no support ticket you can file against an export-control order. The vendor itself couldn't refuse it — it could only object on the way to obeying. The visa-system piece named this sovereign recall risk, and the thing worth internalizing is that it's structurally unlike every dependency risk we already know how to manage. You can buy redundancy across regions, across providers, across clouds. You cannot buy redundancy against the proposition that the single most capable tier of model is now a controlled strategic asset, and that the government deciding so is the same one your vendor is domiciled in.

Every mitigation we reach for by reflex — multi-region, multi-cloud, a second provider — still ends at a model sitting on someone else's server, reachable only as long as a directive permits it. There is only one mitigation that removes the switch instead of hedging it: run the model on hardware you own. A week ago that sounded like something we cannot afford. It's now a resilience requirement, and the tooling to act on it shipped the same week the risk did.

The same week one model was switched off, Microsoft documented how to run another with no server at all

Here is the part that made me stop. Microsoft's Phi Silica is a 3.3-billion-parameter small language model. Until recently it ran only on the neural processing units inside Copilot+ PCs — a narrow, certified hardware tier. This June, Microsoft quietly expanded its Windows AI documentation with a new page: how to run Phi Silica on Nvidia RTX GPUs, no NPU required. The supported list reaches back across the RTX 30-series and newer, the bar is roughly 8 GB of dedicated video memory and a driver from the 560 branch or later, and execution goes through the Windows Copilot Runtime over DirectML. The documentation is blunt about the one thing that matters here: the model and the inference run entirely on the user's own hardware. No cloud API calls.

Read the requirement again and translate it out of spec-sheet language: a useful, supported, locally-run language model now targets a graphics card that millions of people already own. Not a data-center accelerator under export license. Not a certified AI PC you have to go buy. The card that's already in the tower running games. The capability didn't get cheaper — it moved into a building the state can't reach without a warrant.

Nvidia rewrote its own books to bet on the edge — three weeks before the recall

If you want to know where inference demand is actually heading, don't read the manifestos. Read the company that has the clearest view of the order book and the strongest incentive not to be wrong about it — and watch what it does when it has to make claims under oath.

In its first-quarter fiscal-2027 results on May 20, Nvidia changed how it reports its own business. The old operating segments — "Compute & Networking" and "Graphics" — are gone. In their place are two market platforms: Data Center and Edge Computing. Inside Data Center sit two sub-markets, Hyperscale and ACIE (AI Clouds, Industrial, Enterprise). And standing beside it, for the first time as a co-equal platform, is Edge Computing — defined as the devices for agentic and physical AI: PCs, game consoles, workstations, AI-RAN base stations, robotics, automotive. The category Nvidia used to call "gaming" didn't shrink; it got absorbed into a platform whose name is now about running AI at the edge. Edge Computing booked $6.4 billion in the quarter on its own line.

A company does not restructure its segment reporting on a whim. That's an audited filing, durable, expensive to change, and read closely by people who sue when it misleads them. When the firm with the best view of the future puts Edge Computing alongside the data center as a co-equal platform, it is telling you — in the most legally constrained language a business has — that it does not believe the future is one giant model sitting on one server behind one nation's border. And it said this in May, three weeks before the June recall. So this is not a reaction to the news. It's the structural bet the news then validated.

We have, in fairness, seen this movie. Compute decentralizes whenever the center accrues a liability the edge doesn't carry. Mainframe to PC, when the liability was cost and access. PC back to cloud for a decade, when the liability was operational toil. Now the pendulum is loading the other way under the weight of latency, unit economics, privacy — and, as of last week, sovereignty, which is the heaviest liability the center has ever carried, because it's the only one you can't price, insure, or negotiate. The swing isn't ideological. It's a business routing around the most expensive risk on the board.

Business routes around risk; that's the one thing it reliably does

Strip away the geopolitics and this is an ordinary observation about how companies behave. A business is, above almost everything else, a risk-routing machine. It will accept worse latency, higher upfront cost, and more engineering work to remove a tail risk that can zero its product overnight — the same way it carries insurance it hopes never to use. For two years the case for local inference was made on cost and privacy, and it lost most arguments, because the convenience of a frontier API was worth the lock-in. Last week the calculation changed, because the tail risk stopped being hypothetical and acquired a timestamp.

Now the strongest objection, head-on, because it's correct: a 3.3-billion-parameter model is not Fable 5, and it isn't close. You cannot run frontier-grade reasoning on a gaming GPU, and a great deal of what makes these tools worth paying for lives in the top tier that only the big remote models can serve. True but wrong framing. Nobody serious is proposing you move everything local. The move is to tier the work:

  • The high-volume, latency-sensitive, capability-modest 80–90% — classification, extraction, drafting, autocomplete, retrieval-augmented answers over your own documents — runs perfectly well on a 3–8B local model today. That's also, not coincidentally, the part of your stack where an outage is most expensive, because it sits in the hot path of everything.
  • The genuinely hard 5–10% that needs the frontier stays on the API — but behind a documented, tested fallback, so a recall degrades you instead of stopping you.

And the gap narrows every quarter; small models keep absorbing capability that used to require the frontier. The point of going local was never parity. It's optionality — and owning the off switch on the part of your product you can't afford to have switched off by someone else.

One more honest caveat, because it cuts the other way: the state controls the chips, too. The same administration that recalled the model has Nvidia and AMD handing it a cut of their China revenue for the privilege of exporting at all. But there's a real difference between controlling the next sale and reaching into a GPU already humming in your rack. The directive that landed last week was remote and instant. A model resident on hardware you already own exposes no remote interface for a directive to grab. Export controls slow down your next purchase. They do not recall your installed base.

What I'd put on the architecture diagram this quarter

If I were your CTO, this is the work I'd fund before the next planning cycle closes — concrete, not aspirational:

  1. Add a row to the dependency map. For every AI feature, write down which government can switch it off, and for which of your users by nationality. If that cell is blank, the design isn't finished. This belongs on the architecture diagram, not in a legal footnote.
  2. Put a stable inference interface in front of every model call, with at least one open-weight or local option already wired behind it. The model becomes swappable; the harness stays yours. The model is the commodity; the harness around it is the moat — and now, the resilience.
  3. Tier your workloads by the capability they actually require, then move the high-volume, capability-modest tier onto a local 3–8B model — Phi-class on an RTX box, or its open-weight peers. That single move takes your hottest path off the wire entirely.
  4. Write and test a fallback for every frontier-tier feature the way you'd write one for a payments provider: detect the 4xx, degrade to the local model, alert, keep serving. Then rehearse it. CrowdStrike and Unity taught us to have a fallback; the recall raised the stakes on actually testing it.
  5. Spec the hardware now. Capability you own outright cannot be repossessed by directive. An RTX box in your rack — or already in your user's tower — is a sovereignty hedge that also happens to cut your inference bill. Foundation-model economics was about not over-paying to rent capability; this is the harder-edged version of the same instinct.

Don't build the load-bearing wall out of something the wind can carry off

My grandfather ran a construction business, and he had a line he'd repeat whenever someone pitched him a venture that hinged on something outside the room: never do business that depends on which way the wind blows. He meant weather, and harvests, and political dependencies. My grandfather knew his stuff and 50 years later, I must take his advice to heart. Don't build out of a capability a government can switch off on a whim.

Last week the wind changed direction, and a model hundreds of millions of people relied on was gone before the next request landed. The frontier failed because it sat on a server behind a border, and the border has an owner. Microsoft's documentation and Nvidia's reporting change are the same instinct expressed twice, by two of the largest companies in the industry, in the same month: the durable place for a model to run is on hardware someone owns, where no directive can reach it. Not because local is faster. Because local can't be recalled.

If you're mapping your own AI supply chain for the switch you don't control, start with the companion to this piece — A Visa System for Intelligence — then come back and put "which government can switch this off" on the diagram, in writing, next to the feature it would take down.

Ready to build your engineering team?

Talk to a technical partner and get CTO-vetted developers deployed in 72 hours.