Coherence Is Not Correctness: Why a Paper Needs Testable Claims, Not Flawless Prose
Someone posted their own paper on one of my posts. I am not too bothered by people promoting themselves, but it caught my attention. The title alone should have warned me: Conditional Realism, Stewardship, and Survivable Cognition Under Finite Constraint. Forty pages on Zenodo, with a DOI, an ORCID, and a scaffolding of references all pointing back to the author's own research program, the Architecture of Limitation. It looks serious. Reading it, the prose flows, it never contradicts itself, it anticipates its own objections and defuses them gracefully.
And yet, by the end, there was nothing for me to grab onto. Not because it was hard — it's deliberately hard — but because it never made a claim I could test, check, or refute from outside its own text. It was flawless and empty at the same time. The most telling part is that the paper describes its own failure mode without ever recognizing itself in the mirror.
It's worth spelling out why, because the pattern is getting more common, and in the age of language models, spotting it has quietly become a first-order engineering skill.
The kernel of truth, first
Let me be fair before I'm harsh. The paper has one good idea, and I'll lead with it so the rest doesn't read like a strawman.
The idea is this: the phrase "human in the loop" often works as a symbolic placeholder. We drop a human into a decision chain and declare the system safe, without ever defining the conditions under which that human participation is actually meaningful, proportionate, or accountable. The paper proposes replacing it with a notion of stewardship: what the human contributes isn't oversight, it's exposure to consequence. The model generates; the human bears. The asymmetry is structural, not moral.
I agree with that. It lines up directly with something I've argued for a long time: AI augments, it doesn't replace. The part of the chain you can't outsource is precisely whoever carries the cost of the error and holds an adversarial relationship with reality when the system drifts. That intuition is solid.
The problem is everything around it.
The core point: the paper is its own failure mode
Here's what actually made me write this post.
The paper, citing earlier work by the same author, coins two terms for how reasoning systems fail under pressure:
- "Coherence inflation": the moment an argument's recoverable structure starts to look like complete explanation, and growing internal consistency gets mistaken for metaphysical certainty.
- "Hallucination as geometric overflow": text that keeps its fluency, consistency, and explanatory organization while drifting past the boundaries that originally grounded the reasoning.
Read those two definitions again. They are an exact description of the paper that contains them.
Forty pages of fluent, internally consistent prose that never touches the ground. Every reference in the bibliography is another document by the same author, uploaded to the same repository, inside the same invented framework. It's a closed citation loop: the text validates itself in its own vocabulary. The paper even anticipates this charge — it literally writes that "readers may interpret the present work as recursively self-validating" — and waves it away by saying that isn't the intent. But acknowledging a vicious circle doesn't break it. The bibliography is still chasing its own tail.
And the detail that crowns it: a Zenodo DOI is not peer review. Zenodo mints a DOI for anything you upload — a PDF, a dataset, a meme. It's an archival service, not a quality stamp. The trappings of academic authority — ORCID, DOI, sections numbered in Roman numerals — are aesthetics, not substance.
What you end up with is an artifact that passes its own tests because it wrote them, and can't fail any of them because it never makes a checkable claim.
The engineering analogy
If you're an engineer, you already have the mental model for exactly what's going on here.
Picture a PR that compiles clean, passes the linter, and shows green CI. Every test passes. Then you open the tests and find that the code under test wrote them, and every assertion is a tautology: expect(x).toBe(x). The build is green. Coverage is 100%. And the system does precisely nothing.
That's the paper. Perfect syntactic coherence, zero contact with an external oracle.
We have a sharp instinct for this in software because it has bitten us so many times. We know a test that always passes is worthless. We know a system that only validates against itself — no staging environment, no real data, no user complaining — can be deeply broken and look perfectly healthy. A clean compile is not correctness. Green CI is not truth. It's internal consistency, which is a far cheaper and far less valuable property.
Philosophy and science have the same instinct, and it has a name: falsifiability. A claim that can't be framed so that it could turn out false isn't false — in Pauli's phrase, it's "not even wrong." It doesn't enter the game. There's nothing to argue about, because there's nothing the world could contradict.
What makes a paper solid
I want to be constructive here, because the easy move is to stop at "this is fluff." The useful question is: what would it need to be solid?
Three things, in decreasing order of strength.
1. A testable claim, with experiments, data, and results
The gold standard. You assert something about the world, you design a way to measure it, you collect data, and the results either support the claim or knock it down. The key is that someone else, from the outside, can reproduce it and reach their own conclusion. The data isn't yours; it belongs to anyone willing to replicate it.
A few weeks ago I wrote about an AI alignment paper that treats the deployed system as a probability distribution over trajectories and defines alignment as topological membership in a safe set. You don't need to follow the math to see the difference in kind: that paper claims membership can be proven with finite logs using conformal bounds. It has a declared scope (information-work systems, not embodied AI). You can disagree with it, attack its assumptions, try to find a counterexample. It gives you surface to grab. That's what makes a paper part of a conversation.
2. A falsifiable claim, even if you couldn't test it yourself
Not everything needs a lab experiment on the day it's published. But it does need to be framed so it could be put to the test in principle, by someone, eventually. "Teams that monitor intermediate trajectories catch deviations earlier than teams monitoring only the final output" is a claim you may not have the data to close today, but any team can try to refute it with their own telemetry. It's arguable on shared ground.
3. At minimum, a claim that's arguable outside the author's own head
This is the floor, and it's exactly the one the Zenodo paper fails to clear. A philosophical claim can be perfectly legitimate with no experiment at all — serious philosophy does this constantly — as long as it offers definitions, distinctions, and consequences that someone else can pick up and push back on. Structural realism, fallibilism, the problem of induction: these are old philosophical positions, debated for decades, precisely because they're stated sharply enough that someone can say "no, and here's why."
The paper I read doesn't do that. It repackages epistemic structural realism — a position that's been around since the 1980s — in invented vocabulary ("survivable cognition," "recoverable continuity," "operational invariants") and presents it as a proprietary architecture with "capability layers." It declares itself "operational, not metaphysical" dozens of times, yet never supplies a single operation: not a metric, not a procedure, not a criterion. The word "operational" works as a talisman. Everything is defined by abstract nominalization and nothing by a measurable operation.
It's a thought experiment sealed inside itself. And a thought experiment nobody can step into from the outside isn't research; it's a diary in academic typeface.
Why this is now an engineering problem
So far this might read like a quarrel between academics. It isn't. It's our problem, and it's turned urgent for one concrete reason: language models are coherence engines.
An LLM is optimized to produce the most plausible, most fluent, most internally consistent continuation of a text. It is not optimized to tell the truth. When it works well, the two roughly coincide. But coherence and correctness are independent axes, and a model can travel a very long way along the coherence axis with zero movement along the correctness one. It can generate forty flawless pages about a framework that doesn't exist, with a bibliography that cites itself, and every sentence will dovetail with the last.
The paper I read has all the fingerprints of being born this way — and the irony is perfect, because it describes this very phenomenon and doesn't recognize itself in it.
This is where its one good idea comes back, turned on the paper itself. The defense against empty coherence isn't to distrust AI. It's the human steward who carries the consequence and does the verifying. The model generates the prose; someone has to be the one who asks "wait, can this be checked? Against what? Who could replicate it? Do the references exist outside this document?" That function can't be outsourced to the same system that generates the text, for the same reason you don't let code write and approve its own tests.
That is, almost word for word, the thesis I keep coming back to: AI augments, it doesn't replace. The augmentation is real and enormous. But the epistemic accountability — the contact with an external oracle — stays human. The paper set out to argue this and instead proves it, by being the example of what happens when that function is missing.
A checklist, for engineers
So this is actionable and not just an elegant complaint, here are the questions I run when I read — or write — anything that claims to be a serious contribution. They're the same ones you'd run in a code review:
- Can you state the central claim so that it could be false? If there's no state of the world that would contradict it, it's not a claim; it's a definition in disguise.
- Is there any measurement? Data, an experiment, a reproducible observation. And if there isn't, at least a consequence someone could go and look for.
- Could an outsider argue with it on their own terms? Or can you only debate it after first swallowing the author's entire vocabulary?
- Are the references a closed loop? If every citation points back to the author or their own framework, the bibliography is decoration, not foundation.
- Are the terms defined as operations or as abstract nouns? "Recoverability" with no way to measure it is a word, not a concept.
- Is the appearance of authority doing the work of truth? A DOI, an ORCID, and Roman-numeral sections aren't peer review. Ask who actually evaluated this.
If a text fails most of these, it can be brilliant, it can be beautiful, it can even be right by accident — but you can't lean on it. And in production, leaning on things is the whole job.
What I take away
Coherence is cheap. It always was, but it used to take talent or obsession to produce forty internally consistent pages about nothing. Now it's free and instant. Which means coherence has stopped being a quality signal, and the entire load shifts onto the properties that should have mattered all along: testability, falsifiability, exposure to contradiction from the outside.
The paper I read isn't solid, and it isn't technical — for all its vocabulary of vector geometry and distributed semantic representations, it contains not one equation, not one data point, not one experiment. But it's been useful to me, because it's the perfect case study of something we all need to learn to catch: text that sounds like a thesis and isn't one.
The engineering job — and the job of anyone who wants to think well with these tools — isn't to stop using the machine that generates fluent prose. It's to keep the discipline of asking, every single time, what this gets checked against. Because a system that only validates against itself looks perfectly healthy right up until the day you put it in front of the world.
Building AI systems where the gap between coherence and correctness gets paid in production, and you'd rather have a team with the instinct to check it against reality? Talk to a CTO about deploying nearshore engineering capacity with the discipline not to mistake green CI for the truth.


