Challenges

Coherence Is Not Correctness: Why a Paper Needs Testable Claims, Not Flawless Prose

By Marc Molas·May 30, 2026·12 min read

Someone posted their own paper on one of my posts. I am not too bothered by people promoting themselves, but it caught my attention. The title alone should have warned me: Conditional Realism, Stewardship, and Survivable Cognition Under Finite Constraint. Forty pages on Zenodo, with a DOI, an ORCID, and a scaffolding of references all pointing back to the author's own research program, the Architecture of Limitation. It looks serious. Reading it, the prose flows, it never contradicts itself, it anticipates its own objections and defuses them gracefully.

And yet, by the end, there was nothing for me to grab onto. Not because it was hard — it's deliberately hard — but because the claims it does make are framed so that nothing outside its own text could ever count against them. It was fluent and unfalsifiable at the same time. The most telling part is that the paper names its own failure mode — and then, in my reading, walks straight into it.

It's worth spelling out why, because the pattern is getting more common, and in the age of language models, spotting it has quietly become a first-order engineering skill.

The kernel of truth, first

Let me be fair before I'm harsh. The paper has one good idea, and I'll lead with it so the rest doesn't read like a strawman.

The idea is this: the phrase "human in the loop" often works as a symbolic placeholder. We drop a human into a decision chain and declare the system safe, without ever defining the conditions under which that human participation is actually meaningful, proportionate, or accountable. The paper proposes replacing it with a notion of stewardship: what the human contributes isn't oversight, it's exposure to consequence. The model generates; the human bears. The asymmetry is structural, not moral.

I agree with that. It lines up directly with something I've argued for a long time: AI augments, it doesn't replace. The part of the chain you can't outsource is precisely whoever carries the cost of the error and holds an adversarial relationship with reality when the system drifts. That intuition is solid.

The problem is everything around it.

The core point: the paper is its own failure mode

Here's what actually made me write this post.

The paper, citing earlier work by the same author, coins two terms for how reasoning systems fail under pressure:

"Coherence inflation": the moment an argument's recoverable structure starts to look like complete explanation, and growing internal consistency gets mistaken for metaphysical certainty.
"Hallucination as geometric overflow": text that keeps its fluency, consistency, and explanatory organization while drifting past the boundaries that originally grounded the reasoning.

Read those two definitions again. The second one especially reads almost like a description of the paper that contains them.

Forty pages of fluent, internally consistent prose that never touches the ground. Every reference in the bibliography is another document by the same author, uploaded to the same repository, inside the same invented framework. It's a closed citation loop: the text validates itself in its own vocabulary. The paper even anticipates this charge — it literally writes that "readers may interpret the present work as recursively self-validating" — and waves it away by saying that isn't the intent. But acknowledging a vicious circle doesn't break it. The bibliography is still chasing its own tail.

And one last thing — about how the paper lands on a reader, not about what its author claims. It's openly a preprint: self-archived on Zenodo, explicitly provided for academic and research purposes, with no claim of peer review anywhere in it. That's entirely legitimate — preprints are how a lot of real work first appears. The trap is on the reading side. A DOI, an ORCID, and sections numbered in Roman numerals can read as vetting to someone skimming, and they aren't: Zenodo mints a DOI for anything you upload — a PDF, a dataset, a meme — it's an archival service, not a quality stamp. The author never pretended otherwise. The point is that the trappings of academic authority do work on us that only scrutiny should.

What you end up with is an artifact that validates against its own vocabulary and gives you no foothold from outside it — because the claims it makes are never framed so the world could contradict them.

The engineering analogy: green CI is not truth

If you're an engineer, you already have the mental model for exactly what's going on here.

Picture a PR that compiles clean, passes the linter, and shows green CI. Every test passes. Then you open the tests and find that the code under test wrote them, and every assertion is a tautology: expect(x).toBe(x). The build is green. Coverage is 100%. And the system does precisely nothing.

That's the paper. Perfect syntactic coherence, zero contact with an external oracle.

We have a sharp instinct for this in software because it has bitten us so many times. We know a test that always passes is worthless. We know a system that only validates against itself — no staging environment, no real data, no user complaining — can be deeply broken and look perfectly healthy. A clean compile is not correctness. Green CI is not truth. It's internal consistency, which is a far cheaper and far less valuable property.

Philosophy and science have the same instinct, and it has a name: falsifiability. A claim that can't be framed so that it could turn out false isn't false — in Pauli's phrase, it's "not even wrong." It doesn't enter the game. There's nothing to argue about, because there's nothing the world could contradict.

I want to be fair and precise here, because this is exactly the kind of claim that should itself be checkable. The paper is not silent. It carries an explicitly labelled Provisional Hypothesis (Section XIV) and a deliberately narrowed concluding claim (Section XV), and it is careful, repeatedly, never to inflate either into certainty — "No final metaphysical closure is claimed." So the honest objection isn't that it makes no claim. It's that the claims it does make — "deterministic structure may remain operationally admissible … through the persistence of invariant structures recoverable across constrained observational frames" — are stated so that nothing you could observe would count for them or against them. That's what "not even wrong" actually means: not missing, but unfalsifiable.

What makes a paper solid

I want to be constructive here, because the easy move is to stop at "this is fluff." The useful question is: what would it need to be solid?

Three things, in decreasing order of strength.

1. A testable claim, with experiments, data, and results

The gold standard. You assert something about the world, you design a way to measure it, you collect data, and the results either support the claim or knock it down. The key is that someone else, from the outside, can reproduce it and reach their own conclusion. The data isn't yours; it belongs to anyone willing to replicate it.

A few weeks ago I wrote about an AI alignment paper that treats the deployed system as a probability distribution over trajectories and defines alignment as topological membership in a safe set. You don't need to follow the math to see the difference in kind: that paper claims membership can be proven with finite logs using conformal bounds. It has a declared scope (information-work systems, not embodied AI). You can disagree with it, attack its assumptions, try to find a counterexample. It gives you surface to grab. That's what makes a paper part of a conversation.

2. A falsifiable claim, even if you couldn't test it yourself

Not everything needs a lab experiment on the day it's published. But it does need to be framed so it could be put to the test in principle, by someone, eventually. "Teams that monitor intermediate trajectories catch deviations earlier than teams monitoring only the final output" is a claim you may not have the data to close today, but any team can try to refute it with their own telemetry. It's arguable on shared ground.

3. At minimum, a claim that's arguable outside the author's own head

This is the floor, and it's exactly the one the Zenodo paper fails to clear. A philosophical claim can be perfectly legitimate with no experiment at all — serious philosophy does this constantly — as long as it offers definitions, distinctions, and consequences that someone else can pick up and push back on. Structural realism, fallibilism, the problem of induction: these are old philosophical positions, debated for decades, precisely because they're stated sharply enough that someone can say "no, and here's why."

The paper I read doesn't do that. It repackages epistemic structural realism — a position that's been around since the 1980s — in invented vocabulary ("survivable cognition," "recoverable continuity," "operational invariants") and presents it as a proprietary architecture with "capability layers." It declares itself "operational, not metaphysical" dozens of times, yet never supplies a single operation: not a metric, not a procedure, not a criterion. The word "operational" works as a talisman. Everything is defined by abstract nominalization and nothing by a measurable operation.

It's a thought experiment sealed inside itself. And a thought experiment nobody can step into from the outside isn't research; it's a diary in academic typeface.

Why this is now an engineering problem

So far this might read like a quarrel between academics. It isn't. It's our problem, and it's turned urgent for one concrete reason: language models are coherence engines.

An LLM is optimized to produce the most plausible, most fluent, most internally consistent continuation of a text. It is not optimized to tell the truth. When it works well, the two roughly coincide. But coherence and correctness are independent axes, and a model can travel a very long way along the coherence axis with zero movement along the correctness one. It can generate forty flawless pages about a framework that exists only inside its own text, with a bibliography that cites itself, and every sentence will dovetail with the last.

I'm not going to claim I know how this particular paper was written — I can't, and it doesn't matter to the argument. What matters is the property, not the provenance: it is now possible — for a person, a model, or the two working together — to produce forty fluent, internally consistent pages that never once make contact with an external check. This paper is a clean example of that property. And the irony is sharp, because it names this very failure mode and then, in my reading, exhibits it.

This is where its one good idea comes back, turned on the paper itself. The defense against empty coherence isn't to distrust AI. It's the human steward who carries the consequence and does the verifying. The model generates the prose; someone has to be the one who asks "wait, can this be checked? Against what? Who could replicate it? Do the references exist outside this document?" That function can't be outsourced to the same system that generates the text, for the same reason you don't let code write and approve its own tests.

That is, almost word for word, the thesis I keep coming back to: AI augments, it doesn't replace. The augmentation is real and enormous. But the epistemic accountability — the contact with an external oracle — stays human. The paper sets out to argue exactly this, and to my eye ends up illustrating it: what a text looks like when the question checked against what? is never forced on it.

A checklist, for engineers

So this is actionable and not just an elegant complaint, here are the questions I run when I read — or write — anything that claims to be a serious contribution. They're the same ones you'd run in a code review:

Can you state the central claim so that it could be false? If there's no state of the world that would contradict it, it's not a claim; it's a definition in disguise.
Is there any measurement? Data, an experiment, a reproducible observation. And if there isn't, at least a consequence someone could go and look for.
Could an outsider argue with it on their own terms? Or can you only debate it after first swallowing the author's entire vocabulary?
Are the references a closed loop? If every citation points back to the author or their own framework, the bibliography is decoration, not foundation.
Are the terms defined as operations or as abstract nouns? "Recoverability" with no way to measure it is a word, not a concept.
Is the appearance of authority doing the work of truth? A DOI, an ORCID, and Roman-numeral sections aren't peer review. Ask who actually evaluated this.

If a text fails most of these, it can be brilliant, it can be beautiful, it can even be right by accident — but you can't lean on it. And in production, leaning on things is the whole job.

What I take away

Coherence is cheap. It always was, but it used to take talent or obsession to produce forty internally consistent pages about nothing. Now it's free and instant. Which means coherence has stopped being a quality signal, and the entire load shifts onto the properties that should have mattered all along: testability, falsifiability, exposure to contradiction from the outside.

The paper I read isn't solid, and it isn't technical — for all its vocabulary of vector geometry and distributed semantic representations, it contains not one equation, not one data point, not one experiment. But it's been useful to me, because it's the perfect case study of something we all need to learn to catch: text that sounds like a thesis and isn't one.

The engineering job — and the job of anyone who wants to think well with these tools — isn't to stop using the machine that generates fluent prose. It's to keep the discipline of asking, every single time, what this gets checked against. Because a system that only validates against itself looks perfectly healthy right up until the day you put it in front of the world.

Building AI systems where the gap between coherence and correctness gets paid in production, and you'd rather have a team with the instinct to check it against reality? Talk to a CTO about deploying nearshore engineering capacity with the discipline not to mistake green CI for the truth.

Editor's note (June 2026). A word on what this is, and then on what I changed.

I pride myself on being accurate and fair, and I genuinely welcome criticism — a bit of banter, even — as long as it's the kind of healthy intellectual competition that makes arguments sturdier and pushes conclusions closer to verifiable claims. I also think the fast, occasionally gleeful, lightly hyperbolic register of a blog has its place. This is a blog — not a paper, not a peer-reviewed publication, and it doesn't try to be one in any scientific sense. These are notes from the engine room, after all: quick ideas, written at speed. But if something as lighthearted as a five-minute read of notes offends an honest researcher, or discourages them from their work, nothing could be further from my intention — and I'm glad to polish the prose and sharpen the arguments.

So, after this piece went out and its author, Franky Schaut, reached out, I revised it on three points. First, I removed any speculation about how the paper was produced — I can't know that, and the argument doesn't need it; what matters is the property, not the provenance. Second, I made the central objection more precise: the paper does state an explicitly provisional hypothesis (Section XIV) and a narrowed concluding claim (Section XV), and it is careful never to claim certainty — so the critique is that those claims aren't framed to be falsifiable, not that the paper makes none. Third, I clarified that the paper is openly a preprint and never claimed peer review; the point about DOIs and academic trappings is about how authority reads to a skimming reader, not about anything its author misrepresented. The author has published his own responses to the original critique, and they're worth reading alongside this.

The broader argument stands: in the age of coherence engines, fluency is not evidence, and testability is the property that carries the weight.

References

Schaut, F. (2026). Conditional Realism, Stewardship, and Survivable Cognition Under Finite Constraint. Zenodo.

Coherence Is Not Correctness: Why a Paper Needs Testable Claims, Not Flawless Prose

The kernel of truth, first

The core point: the paper is its own failure mode

The engineering analogy: green CI is not truth

What makes a paper solid

1. A testable claim, with experiments, data, and results

2. A falsifiable claim, even if you couldn't test it yourself

3. At minimum, a claim that's arguable outside the author's own head

Why this is now an engineering problem

A checklist, for engineers

What I take away

Related Articles

The Statistic Was True. The Headline Wasn't.

(2/3) What Actually Drives the Rate of Return on Compute

(1/3) The Slow Death of Scaling: Why Bigger Is No Longer Always Better

Ready to build your engineering team?