Measuring Judgment in an AI-Assisted World — Caliber

AI made output cheap and polish universal. What stays scarce is judgment that holds up, and you can't reward or build what you can't see.

01 new scarcity

Fluency got free. Judgment didn't.

The memo was good. That was the problem.

You're forwarded a recommendation. It reads clean: well-structured, confident, the right words, no loose ends. A year ago, that polish meant something. It meant someone had sat with the problem long enough to make it look easy, and you could sign off on the strength of it, because the polish and the thinking almost always arrived together.

They've come apart. The same memo now takes twenty minutes and a half-decent prompt. The writing is still clean; whether anyone reasoned their way to it is anybody's guess. And the page won't tell you.

The signal that went dead

This is the part of the AI shift that doesn't make the headlines. The first-order story is the one everyone tells: work gets faster. True, and not the part that will cost you. What costs you is one level down.

For most of working life, polish was a decent stand-in for thought. If the analysis read well, someone had probably done the reasoning, and that rough correlation is what let managers and teachers and clients judge quality at a glance. AI broke the correlation. The analysis reads well now whether or not any reasoning happened behind it. The work is as good, or as thin, as it ever was. What's gone is your ability to tell which, from the page in front of you. And you're still making the call on a cue that has gone silent.

There's a slower loss underneath that one. People build judgment by being made to defend it: a hard question, a skeptical reviewer, a position you have to hold while someone leans on it. Route every difficult call through a model and that pressure goes away. The pressure was never in the way. Holding a claim, defending it, watching it get knocked down and rebuilding it stronger: that is how the judgment got there in the first place. Skip the struggle and you skip the reps that build it.

Why judgment became the scarce thing

It's tempting to say AI made everything cheap except judgment, and stop there. That's too neat. Plenty of things are still scarce: taste, relationships, capital, the data nobody else has. Judgment isn't special by process of elimination. It's special for two reasons that are easy to walk past.

The first is that it went invisible. Fluency used to be its proxy, and now it isn't, so a capability that's worth exactly what it always was has lost the thing that made it legible.

The second is that it sits under everything else. Judgment is what you use to deploy the capital, pick the relationships, and apply the taste; it's the capacity that governs the use of all the others. So this isn't one more scarce thing for the pile. It sits under the pile, and when your grip on it slips, everything stacked on top gets harder to use well.

06 first second order

The first-order effect is easy to see. The second-order chain is where judgment lives.

What a "cognitive profile" for judgment looks like

When Google DeepMind's researchers tried to measure how close machines are to general intelligence, they ran into a conclusion that turns out to matter for the rest of us: you can't put it in a single number. Their 2026 paper, Measuring Progress Toward AGI: A Cognitive Framework, breaks intelligence into a taxonomy of ten faculties (reasoning, memory, metacognition, and executive function among them) and reads each one into what they call a cognitive profile of strengths and weaknesses. One score, they argue, hides more than it shows.

Human judgment behaves the same way. There's no single quantity to rank people on; there's a profile. That's the premise behind Caliber. It takes a real piece of work — a memo, a recommendation, an analysis — and reads the depth of the independent judgment behind it across seven dimensions:

Reasoning quality — is the argument actually built, or only asserted?
Evidence grounding — are the claims anchored to something checkable, or floating?
Original judgment — has the view been worked through, or is it still a restatement of the consensus?
Trade-off clarity — does it own what it gives up?
Uncertainty awareness — does it say what it doesn't know, instead of overclaiming?
Domain grounding — does it reflect how the field actually works?
Defensibility — would it survive a smart, motivated, hostile reviewer?

04 seven dimensions

Judgment isn't a number. It's a profile.

Two of those — reasoning, and the self-knowledge to admit what you don't know — are among the faculties in DeepMind's taxonomy of intelligence itself, which is the whole point. We spent a few years asking whether the machine had these capacities. Now that the machine is fluent, the question worth asking is whether the person still brings them, or whether they've gradually been handed over.

"Sharpen this": handing the thinking back

Here is where it runs against the grain of everything else in your toolbar. Caliber reads what you wrote and asks the question that exposes the soft spot, then leaves the rewriting to you. Most tools hand you a finished draft; this one hands you back the part of your own argument you'd lose on.

Picture that escalation memo. It comes back strong on most dimensions, with one thin spot under reasoning quality. Rather than just flag it, Caliber asks the kind of question a good colleague would: "Defend your escalation principle to someone who says it makes the agent too slow and timid. Why is it still right?" The writer starts to answer, and stops. They'd inherited the principle. They hadn't reasoned it. So they sit with the question, work it through, and rewrite. The memo gets better, and so does the person who wrote it. The next one is stronger because the judgment is stronger, not because a model patched it.

02 sharpen this

It points at the part of the argument you'd lose on, and leaves the fix to you.

You stay the decider the whole way. You revise the work, Caliber re-reads the new version, and you watch the line move: measured, then better, then solid. It's progress you can see, not a grade you collect, and it lands on the work, never on you. It puts back the one thing cognitive offloading takes out: the friction that builds judgment — the kind you used to need a demanding boss to supply.

03 the loop

Reason, read, close the gap, repeat.

Is the score trustworthy?

Fair question, and by Caliber's own standard, the right one to ask. Defensibility is one of the things it reads in your work; it would be a poor instrument that couldn't take its own test.

So, plainly. Caliber reads the reasoning on the page, not the person behind it. A brilliant thinker can dash off one undefended paragraph, and a shaky idea can be argued beautifully. It isn't an intelligence test or a ruling on your worth, and it doesn't claim to know whether you're right. It reads whether your reasoning is built to hold up, and shows you where it isn't yet.

It doesn't guess, either. It applies one consistent standard for what defensible reasoning looks like, the same way every time, so the same piece of work lands on a near-identical read from one pass to the next. It is not perfectly deterministic, and it doesn't pretend to be; where it's unsure, it says so rather than bluffing past it. So: a consistent tool with stated limits that names its own uncertainty. Not an oracle, and it won't pretend to be one. The number was never the point on its own. The value is the specific weak spot it surfaces, and the question that helps you close it.

What the research says about AI and critical thinking

If this sounds like an anxiety dressed up as a product, the evidence is less comfortable than most of the marketing around AI. Surveys keep finding that heavier reliance on generative tools tracks with measurably weaker critical thinking, with cognitive offloading doing the damage in between. One widely reported study put the correlation as strong as roughly −0.68. Knowledge workers, in particular, slide into automation bias, accepting an AI's answer even when it's wrong and they would have done better on their own. The effect hits hardest early in a career, before a person's own judgment has set.

The same research points at the way out, and it isn't "use less AI." The workers who keep their edge are the ones who move their effort from generating to evaluating — they interrogate what the model hands them instead of forwarding it. That shift, from producing to scrutinizing, is the entire protective mechanism, and it happens to be exactly what Caliber is built to provoke. This was never about discouraging anyone from using AI. It's about keeping the one muscle the tools let you skip.

What this looks like inside an organization

A personal tool is one thing. The more interesting question is what it would take for this to become part of how an organization actually works.

Start with the decision document — the memo, the business case, the recommendation that moves real money or real risk. A judgment layer would sit there, not as a gate that blocks people, but as a read that surfaces the undefended assumption while it's still cheap to fix, before the commitment hardens. Fix it in the draft and it costs a comment. Miss it, and you find out when the money is gone and someone is writing the post-mortem.

Then there's accountability, which is turning into a legal question and not only a cultural one. Under the EU AI Act's human-oversight rules — Article 14, with high-risk obligations binding from 2 August 2026 — organizations are increasingly expected to show that a real person exercised meaningful judgment over an AI-assisted decision, and could have overridden it. Most can't show that today; the readiness gap is well documented. A record of the reasoning behind a decision, and of how it held up, is the kind of evidence that gap is asking for. Caliber doesn't make anyone compliant on its own, but it produces the sort of artifact this whole direction of regulation keeps pointing at.

An enterprise version would also have to speak each field's language, because what reads as defensible in a credit memo isn't what reads as defensible in a clinical recommendation or a power-siting plan. And at the level of the whole organization, you'd get something most leaders run without: a view of where the company's shared judgment is strong and where it most needs investment. That matters more every year, as experienced people retire and the colleagues stepping into their decisions are still building the judgment those roles demand.

One point matters more than any feature. The purpose of all of this is to raise the floor and grow people, not to rank them, monitor them, or quietly build a case for letting them go. It only works as a shared gym. The day someone turns the scores into a ranking, it stops helping people and starts diminishing them. That's the one use it can't be put to. The privacy posture has to match — scores kept, the work itself never — because no one will be honest about their own thinking inside a system they're afraid of.

05 widening circles

The same instrument, pointed at higher stakes.

The ambitious version

Push the idea further and the AGI researchers are a useful guide again. Their framework reads intelligence as a profile across faculties, measured against a human baseline, using targeted tasks that isolate one capacity at a time. Point that same lens at human judgment and a roadmap appears.

A richer profile, first: judgment isn't seven things, and it behaves differently under time pressure than under ambiguity, inside your domain than outside it. A real baseline, too — not "are you clever," but "how does this reasoning compare to the defensible standard in your field." And a longer view than any single read can give: judgment as a trajectory a person and a team grow over time, which is the opposite of a one-time label. Done well, judgment stops being a vague compliment people pay each other and becomes something an organization can see, trust, and develop the way it already tracks revenue or reliability.

That's the bet under all of it. As AI takes over more of the doing, the part it can't do for you (deciding well, and standing behind the decision) is where the human contribution concentrates. The edge goes to the people who can still argue with the model: who use it for the draft and then go after the reasoning instead of forwarding it, and who, increasingly, can prove they did. That last part barely exists yet. Almost no one, today, can show that their judgment held up. Closing that gap — making human judgment something you can see and strengthen, rather than replace or rank — is the whole reason Caliber exists.